Goodness of Fit Tests for Linear Mixed Models

Min Tang; Eric V Slud; Ruth M Pfeiffer

doi:10.1016/j.jmva.2014.03.012

. Author manuscript; available in PMC: 2017 May 11.

Published in final edited form as: J Multivar Anal. 2014 Apr 5;130:176–193. doi: 10.1016/j.jmva.2014.03.012

Goodness of Fit Tests for Linear Mixed Models

Min Tang ^1,^*, Eric V Slud ², Ruth M Pfeiffer ^3,^*

PMCID: PMC5426279 NIHMSID: NIHMS585672 PMID: 28503001

Abstract

Linear mixed models (LMMs) are widely used for regression analysis of data that are assumed to be clustered or correlated. Assessing model fit is important for valid inference but to date no confirmatory tests are available to assess the adequacy of the fixed effects part of LMMs against general alternatives. We therefore propose a class of goodness-of-fit tests for the mean structure of LMMs. Our test statistic is a quadratic form of the difference between observed values and the values expected under the estimated model in cells defined by a partition of the covariate space. We show that this test statistic has an asymptotic chi-squared distribution when model parameters are estimated by maximum likelihood or by least squares and method of moments, and study its power under local alternatives both analytically and in simulations. Data on repeated measurements of thyroglobulin from individuals exposed to the accident at the Chernobyl power plant in 1986 are used to illustrate the proposed test.

Keywords: asymptotic efficiency, information matrix, maximum likelihood estimators, method of moments, model fit, random effects

1. Introduction

The linear mixed model (LMM) (McCulloch and Searle, 2001) extends the linear model by including random effects in addition to the usual fixed effects in the linear predictors. By incorporating random effects LMMs can accommodate clustered or correlated data. Developments in model fitting algorithms and their implementations in statistical packages (e.g. lme in R; PROC Mixed in SAS 9.2; SAS Institute, Cary, NC) have greatly facilitated the applications of LMMs.

Two important steps in modeling are selecting a model and checking its fit. Often model selection is done by comparing nested models, via likelihood ratio or score tests, as part of model building, and approaches are also available for comparing non-nested models (Cox, 1961; Godfrey, 1988). Variables are often selected for inclusion into a model if their p-value obtained from a Wald test meets some significance criterion. AIC, BIC and other model selection principles (Rao and Wu, 1989; Shao, 1997) also focus on selection of covariates. Once a model is selected, its fit should be assessed. For fixed effects models this is done by checking residuals and formal goodness of fit tests, such as score or Wald tests or likelihood ratio tests based on nested models. Khuri, Mathew and Sinha (1998) discussed likelihood ratio testing for fixed effects within LMMs. The literature for assessing the fit of LMMs against general alternatives is limited, and is mostly concerned with specification of the random effect distributions. Likelihood ratio testing for the presence of random effects in LMMs has been discussed by Self and Liang (1987) and Crainiceanu and Ruppert (2004). Jiang (2001) and Ritz (2004) assessed the distributional assumptions for the random effects in LMMs. Claeskens and Hart (2009) proposed tests for normality of the random effects and/or error terms. Lombardía and Sperlich (2008) introduced a test for the hypothesis of a linear fixed effect part in a generalized linear mixed model against the alternative of a semiparametric fixed effect part. Pan and Lin (2005) propose checking the adequacy of 2-level generalized linear mixed models based on the maximum absolute partial sums of residuals over a scalar projection of covariates. Their approach allows for assessing overall model fit as well as the functional form of individual components of the fixed effects part. However, to date there is no general easily computable test for checking the fit of the fixed-effect part of a model against unspecified alternatives, including omitted covariates or interaction terms or misspecifications of the functional form of covariates. Such a test is needed as a model-building tool.

Examination of the residuals of a model is a standard way to judge the quality of model fit. This can be done in many different ways. One useful way is to classify the response into mutually exclusive events defined in terms of the covariates and then assess for each category the deviation of the observed values and the expected values under the model. For survival data, Schoenfeld (1980) presented a class of omnibus chi-squared goodness of fit tests for the proportional hazards regression model, based on the observed minus the expected values of the covariates at each failure time. In this article, we adopt the idea of Schoenfeld (1980) and develop a goodness of fit test for the mean structure of LMMs by comparing the observed and expected values computed from the model within cells of a partition of the covariate space.

The rest of the paper is organized as follows. In Section 2 we present the linear mixed model, introduce the goodness of fit test statistic, and derive its asymptotic properties, including its theoretical power under local alternatives. We first assume that the random effects components and the error term are normally distributed and parameters are estimated by maximum likelihood (Section 2.2). We then relax the assumption of normality and only require finite higher order moments for the random effect and the error term and estimate parameters using least squares and method of moments (Section 2.3). We study the power of the test in simulations in Section 3, present a data example in Section 4 and close with a discussion in Section 5.

2. Goodness of fit test statistic for linear mixed models

2.1. The linear mixed model

We consider the linear mixed model (LMM) with additive random effects,

Y = X β + \sum_{r = 1}^{R} Z_{r} α_{r} + ε,

(1)

where Y_N×1 is the vector of observations; $X_{N \times p} = (x_{1}^{T}, \dots, x_{N}^{T})$ is the design matrix for the fixed effects part of the model, where x_i denotes the p × 1 covariate vector for individual i; β is a p × 1 vector of unknown fixed effects parameters; Z_r is the known N × m_r design matrix for the random effect α_r, an m_r × 1 random vector, for r = 1, …, R. The random effects α₁, …, α_R are i.i.d. and independent of the error term ε. In the next section we assume that the components α_kr of α_r and ε are normally distributed. Within the LMM with a single random effect, we later require no distributional assumptions on the random effect and the error terms, but only the finiteness of their 4 + δ moments for some δ > 0. We let θ = (β, ψ) be the parameters of model (1), where $ψ = (σ_{∊}^{2}, σ_{1}^{2}, \dots, σ_{R}^{2})$ is the vector of all variance components.

An important special case of model (1) is the 2-level LMM, that includes only a single random effect,

y_{ij} = x_{ij}^{T} β + α_{i} + ∊_{ij}, i = 1, \dots, m, j = 1, \dots, n_{i},

(2)

where, using a slightly different notation, the 1 × p vector $x_{i j}^{T} = (1, x_{i j 1}, \dots, x_{i j (p - 1)})$ denotes covariates for the jth observation within the ith cluster. The first entry in x_ij is set to be 1 to accommodate an intercept term in the model. We let y_i = (y_i1, …, y_{in_i}) denote the vector of observations for the ith cluster. The normally distributed cluster specific random effects $α_{i} ~ N (0, σ_{α}^{2})$ are assumed to be independent of the error terms $∊_{i j} ~ N (0, σ_{∊}^{2})$ . Then under model (2), Y is also normal, Y ~ N(Xβ, V) with a block diagonal covariance matrix V, where each of the m n_i × n_i blocks V_i, i = 1, …, m, has entries $σ_{α}^{2} + σ_{∊}^{2}$ on the diagonal and entries $σ_{α}^{2}$ elsewhere. Throughout this paper, we regard models (1) and (2) as conditional specifications of the distribution of Y given X.

2.2. Test statistic and its asymptotic behavior when parameters are estimated by maximum likelihood

2.2.1. LMM with a single random effect

We first discuss the 2-level LMM in (2) when both the random effect and the error term are normally distributed and derive our test statistic for the setting where the model parameters $θ = (β, ψ) = (β, σ_{α}^{2}, σ_{∊}^{2})$ are estimated by maximum likelihood (MLE). Here X is considered to be fixed.

Under Assumptions 1.1-1.6 stated below in Theorem 1, consistency and asymptotic normality of the MLE $\hat{θ} = (\hat{β}, \hat{ψ})$ follow from Miller (1977), i.e. $\sqrt{N} (\hat{θ} - θ) \overset{D}{\to} N (0, J^{- 1})$ , where J denotes the limiting Fisher information matrix. Wand (2007) showed that under model (2), $\hat{β}$ and $\hat{ψ}$ are asymptotically uncorrelated and thus

J = [\begin{matrix} J_{β β} & 0 \\ 0 & M \end{matrix}],

(3)

where

J_{β β} = \lim_{N \to \infty} X^{T} V^{- 1} X ∕ N .

(4)

We assume that J_ββ is positive definite (Assumption 1.5, Theorem 1).

To test the goodness of fit of the mean structure of the LMM (2), we first divide the covariate space into L disjoint regions E₁, …, E_L. These regions are based on categorizations of single covariates or of composites of multiple covariates that may or may not be included in the current model. For example, for a single continuous covariate with support on the interval (a, b), the cells E could be defined by E_l = (c_l, c_l+1], l = 1, …, L − 1 where a = c₁ < c₂ < … < c_L−1 < c_L = b. For a categorical (discrete) variable X that takes the values c_l for l = 1, …, L, the partition can be defined by E_l = {X = c_l}. We compute the observed and expected sums in each region E_l as

f_{l} = \sum_{i = 1}^{m} \sum_{j = 1}^{n_{i}} I_{{x_{ij} \in E_{t}}} y_{ij},

(5)

e_{l} (β) = \sum_{i = 1}^{m} \sum_{j = 1}^{n_{i}} I_{{x_{ij} \in E_{t}}} E (y_{ij}) = \sum_{i = 1}^{m} \sum_{j = 1}^{n_{i}} I_{{x_{ij} \in E_{l}}} x_{ij}^{T} β,

(6)

where I denotes the indicator function. When the cell partition is based on covariates not included in the model (2), then we let x_ij denote the vector of all available covariates and $x_{i j}^{*}$ the covariates used in the model, and use $e_{l} (β^{*}) = \sum_{i = 1}^{m} \sum_{j = 1}^{n_{i}} I_{{x_{i j} \in E_{l}}} {(x_{i j}^{*})}^{T} β^{*}$ , where β* corresponds to the coefficients of $x_{i j}^{*}$ . However, for notational simplicity we employ the expressions (5) and (6) throughout.

Letting f = (f₁, …, f_L) and e(β) = (e₁(β), …, e_L(β)), the observed minus the expected vector is

f - e (β_{0}) = (\begin{matrix} \sum_{i = 1}^{m} \sum_{j = 1}^{n_{i}} I_{{x_{i j} \in E_{1}}} (y_{ij} - x_{ij}^{T} β_{0}) \\ ⋮ \\ \sum_{i = 1}^{m} \sum_{j = 1}^{n_{i}} I_{{x_{i j} \in E_{1}}} (y_{ij} - x_{ij}^{T} β_{0}) \end{matrix}) .

(7)

Since the true parameter vector β₀ is unknown, to create a test statistic we must replace it by a consistent asymptotically normal estimator, the MLE $\hat{β}$ as in (7) within Section 2.2 and the generalized least squares estimator (15) in Section 2.3. We further make Assumption 1.7 which, together with Assumptions 1.4 and 1.5, ensures the existence of components of the limiting variance covariance matrix for the test statistic.

Theorem 1. We make the following assumptions:

Assumption 1.1. The true parameter point θ₀ = (β₀, ψ₀) is an interior point of $ϴ = (R^{p}, {(R^{+})}^{R + 1})$ .

Assumption 1.2. $α_{i} ~ N (0, σ_{α}^{2})$ and $∊_{i j} ~ N (0, σ_{∊}^{2})$ .

Assumption 1.3. X is fixed and has full rank.

Assumption 1.4. $\lim_{K \to \infty} {\lim \sup}_{m \to \infty} \frac{1}{m} \sum_{i = 1}^{m} I_{{n_{i}^{2} \geq K}} n_{i}^{2} = 0$ .

Assumption 1.5. J_ββ = lim_N→∞ X^TV⁻¹ X/N exists and is positive definite.

Assumption 1.6. The 2 × 2 matrix M with elements defined below exists and is positive definite;

{[M]}_{st} = \frac{1}{2} \lim_{N \to \infty} tr (V^{- 1} G_{s} V^{- 1} G_{t}) ∕ N, s, t = 0, 1,

where G₀ = I is the N × N identity matrix and G₁ is the block-diagonal matrix with m blocks and each block is an n_i × n_i matrix of all 1s. After some algebra,

\begin{matrix} {[M]}_{00} = \lim_{N \to \infty} \frac{1}{2 N} tr (V^{- 2}) = \lim_{N \to \infty} \frac{1}{2 N} \sum_{i = 1}^{m} (\frac{n_{i} - 1}{σ_{∊}^{4}} + \frac{1}{{(σ_{∊}^{2} + n_{i} σ_{α}^{2})}^{2}}), \\ {[M]}_{01} = {[M]}_{10} = \lim_{N \to \infty} \frac{1}{2 N} tr (V^{- 2} 1_{N}^{\otimes 2}) = \lim_{N \to \infty} \frac{1}{2 N} \sum_{i = 1}^{m} \frac{n_{i}}{{(σ_{∊}^{2} + n_{i} σ_{α}^{2})}^{2}}, \\ {[M]}_{11} = \lim_{N \to \infty} \frac{1}{2 N} tr (V^{- 1} G_{1} V^{- 1} G_{1}) = \lim_{N \to \infty} \frac{1}{2 N} \sum_{i = 1}^{m} \frac{n_{i}^{2}}{{(σ_{∊}^{2} + n_{i} σ_{α}^{2})}^{2}} \end{matrix}

It is easy to see that matrix M is the average of nonnegative definite matrices. Under Assumption 1.4, M is positive definite if and only if ${\lim \inf}_{m \to \infty} \sum_{i = 1}^{m} n_{i} ∕ m > 1$ . Thus the main restriction in Assumption 1.6 is the requirement that M exists.

Assumption 1.7. For any cell partition E₁, … E_L of the covariate space, $\lim_{N \to \infty} \sum_{i = 1}^{m} \sum_{j = 1}^{n_{i}} I_{{x_{i j} \in E_{l}}} x_{i j}^{T} ∕ N$ exists for l = 1, …, L.

For model (2), under Assumptions 1.1-1.7, as N → ∞,

\sqrt{N} (\begin{matrix} {f - e (β_{0})} ∕ N \\ \hat{β} - β_{0} \end{matrix}) \overset{D}{\to} N (0, {DVD}^{T}),

where

D = {(\begin{matrix} N^{- 1 ∕ 2} [I_{{x_{11} \in E_{1}}} \dots I_{{x_{{mn}_{m}} \in E_{1}}}] \\ ⋮ \\ N^{- 1 ∕ 2} [I_{{x_{11} \in E_{L}}} \dots I_{{x_{{mn}_{m}} \in E_{L}}}] \\ N^{- 1 ∕ 2} J_{β β}^{- 1} X^{T} V^{- 1} \end{matrix})}_{(L + p) \times N},

and

{DVD}^{T} = {(\begin{matrix} H & Λ J_{β β}^{- 1} \\ J_{β β}^{- 1} Λ^{T} & J_{β β}^{- 1} \end{matrix})}_{(L + p) \times (L + p)} .

The off-diagonal and diagonal elements of H, H_lk and H_ll, are

H_{lk} = σ_{α}^{2} \lim_{N \to \infty} \frac{1}{N} \sum_{i = 1}^{m} {(\sum_{j = 1}^{n_{i}} I_{{x_{ij} \in E_{l}}}) (\sum_{j = 1}^{n_{i}} I_{{x_{ij} \in E_{k}}})},

(8)

H_{ll} = σ_{∊}^{2} \lim_{N \to \infty} \frac{1}{N} \sum_{i = 1}^{m} \sum_{j = 1}^{n_{i}} I_{{x_{ij} \in E_{l}}} + σ_{α}^{2} \lim_{N \to \infty} \frac{1}{N} \sum_{i = 1}^{m} {(\sum_{j = 1}^{n_{i}} I_{{x_{ij} \in E_{l}}})}^{2},

(9)

and

Λ = {(\begin{matrix} Λ_{1}^{T} \\ ⋮ \\ Λ_{L}^{T} \end{matrix})}_{L \times p} = \lim_{N \to \infty} \frac{1}{N} (\begin{matrix} \sum_{i = 1}^{m} \sum_{j = 1}^{n_{i}} I_{{x_{ij} \in E_{1}}} x_{ij}^{T} \\ ⋮ \\ \sum_{i = 1}^{m} \sum_{j = 1}^{n_{i}} I_{{x_{ij} \in E_{L}}} x_{ij}^{T} \end{matrix}) .

(10)

The proof of Theorem 1 is given in Appendix A.

Corollary 1. Consistent estimators for the quantities given in (8), (9), (10) and (4) are

\begin{matrix} {\hat{H}}_{lk} = {\hat{σ}}_{α}^{2} \frac{1}{N} \sum_{i = 1}^{m} {(\sum_{j = 1}^{n_{i}} I_{{x_{ij} \in E_{l}}}) (\sum_{j = 1}^{n_{i}} I_{{x_{ij} \in E_{k}}})} \\ {\hat{H}}_{ll} = σ_{∊}^{2} \frac{1}{N} \sum_{i = 1}^{m} \sum_{j = 1}^{n_{i}} I_{{x_{ij} \in E_{l}}} + σ_{α}^{2} \frac{1}{N} \sum_{i = 1}^{m} {(\sum_{j = 1}^{n_{i}} I_{{x_{ij} \in E_{l}}})}^{2}, \\ {\hat{Λ}}_{l}^{T} = \sum_{i = 1}^{m} \sum_{j = 1}^{n_{i}} I_{{x_{ij} \in E_{l}}} x_{ij}^{T} ∕ N, {\hat{J}}_{β β} = X^{T} {\hat{V}}^{- 1} X ∕ N . \end{matrix}

Corollary 2. For model (2), under Assumptions 1.1-1.7 in Theorem 1, as N → ∞, ${f - e (\hat{β})} ∕ \sqrt{N} \overset{D}{\to} N (0, Σ)$ , where $Σ = H - Λ J_{β β}^{- 1} Λ^{T}$ is an L × L matrix that is consistently estimated by $\hat{Σ} = \hat{H} - \hat{Λ} {\hat{J}}_{β β}^{- 1} {\hat{Λ}}^{T}$ based on Corollary 1.

The proof of Corollary 2 is given in Appendix B.

Our goodness of fit test statistic is then given by the quadratic form

T = \frac{1}{N} {f - e (\hat{β})}^{T} {\hat{Σ}}_{ζ}^{-} {f - e (\hat{β})},

(11)

where ${\hat{Σ}}_{ζ}^{-}$ denotes the Moore-Penrose generalized-inverse (also called ‘pseudo-inverse’) of a slight modification of the consistent variance estimator $\hat{Σ}$ . We define ${\hat{Σ}}_{ζ}^{-}$ in the following paragraph. Under the null hypothesis that model (2) is the true model, T has an asymptotic central $χ_{r}^{2}$ distribution, where with probability converging to 1 as N → ∞, $r = rank ({\hat{Σ}}_{ζ}) = rank (Σ)$ . This result applies to the modification ${\hat{Σ}}_{ζ}$ of any consistent estimator $\hat{Σ}$ of the Σ matrix, such as restricted maximum likelihood (REML) estimators.

The modification ${\hat{Σ}}_{ζ}$ of variance estimates we describe next applies in several places in this paper. The issue is always to avoid numerical instabilities and asymptotic distributional anomalies due to rank differences between consistent nonnegative-definite variance estimators $\hat{Σ}$ and the true asymptotic variance Σ. Assume in what follows that there exists a known threshold ξ (say, 10⁻⁴) smaller than all non-zero singular values of Σ. Since Σ is nonnegative definite, all singular values are nonnegative. Denoting the spectral decomposition of $\hat{Σ}$ as $\sum_{k = 1}^{q} c_{k N} v_{k N} v_{k N}^{T}$ , where ${v_{k N}}_{k = 1}^{q}$ is an orthonormal basis of eigenvectors of $\hat{Σ}$ and c_kN are the corresponding eigenvalues, the modification ${\hat{Σ}}_{ζ}$ of $\hat{Σ}$ is defined as ${\hat{Σ}}_{ζ} = \sum_{k = 1}^{N} I_{[c_{k N} > ζ]} c_{k N} v_{k N} v_{k N}^{T}$ , and its Moore-Penrose pseudo-inverse is then ${\hat{Σ}}_{ζ}^{-} = \sum_{k = 1}^{q} I_{[c_{k N} > ζ]} {(c_{k N})}^{- 1} v_{k N} v_{k N}^{T}$ . We prove in Proposition 1, Appendix D, that ${\hat{Σ}}_{ζ}$ is coordinate-free and unique, that rank $rank ({\hat{Σ}}_{ζ}) = r$ with probability converging to 1, and that the asymptotic distribution of T in (11) is $χ_{r}^{2}$ and would persist if the Moore-Penrose choice of pseudo-inverse were replaced by any other generalized-inverse of ${\hat{Σ}}_{ζ}$ as defined in Rao (1973, Sec. 1.b). In addition, the assumption of a fixed known threshold ξ can be replaced by allowing ξ = ξ_N to depend non-randomly on N and converge to 0 sufficiently slowly. (The rate would depend on the specific consistent estimator $\hat{Σ}$ .)

2.2.2. LMM with multilevel additive random effects

We now consider the general LMM with multilevel additive random effects given in equation (1) with a fixed covariate matrix X.

Again, the covariate space, comprised of covariates not all of which need be included in the model, is divided into L disjoint regions E₁, …, E_L, and for l = 1, 2, …, L, we define the observed and expected vectors f = (f₁, …, f_L) and e(β) = (e₁(β), …, e_L(β)) as $f_{l} = \sum_{k = 1}^{N} I_{{x_{k} \in E_{l}}} y_{k}$ , and $e_{l} (β) = \sum_{k = 1}^{N} I_{{x_{k} \in E_{l}}} E (y_{k}) = \sum_{k = 1}^{N} I_{{x_{k} \in E_{l}}} x_{k}^{T} β$ . The impact of the choice of L is discussed further in the simulation section (Section 3).

Conditions in Miller (1977) that ensure the consistency and asymptotic normality of the MLE of θ = (β, ψ) are given in the Supplementary Material S1 (assumptions A.1-A.7). We additionally make assumptions A.8-A.9 (Supplementary Material S1) to ensure the existence of large-sample averages involving x_k and I_{[x_k∈E_j]}.

Theorem 2. For model (1), under Assumptions A.1-A.9 given in the Supplementary Material S1,

T = {f - e (\hat{β})}^{T} {\hat{Σ}}_{ζ}^{-} {f - e (\hat{β})} ∕ N \overset{D}{\to} χ_{r}^{2},

(12)

as N → ∞, where $\hat{β}$ is the MLE of β, $\hat{Σ}$ is a consistent estimator of $Σ = H - Λ J_{β β}^{- 1} Λ^{T}$ , ${\hat{Σ}}_{ζ}$ is the modification of $\hat{Σ}$ as defined in the last paragraph of Section 2.2.1, ${\hat{Σ}}_{ζ}^{-}$ denotes the Moore-Penrose pseudoinverse of ${\hat{Σ}}_{ζ}$ , and for r ≡ rank(Σ), $P (r a n k ({\hat{Σ}}_{ζ}) = r) \to 1$ , as N → ∞. Here H = limN→∞ FVF^T, with

F = \frac{1}{\sqrt{N}} (\begin{matrix} I_{{x_{1} \in E_{1}}} \dots I_{{x_{N} \in E_{1}}} \\ ⋮ \\ I_{{x_{1} \in E_{L}}} \dots I_{{x_{N} \in E_{L}}} \end{matrix})

(13)

and the l-th row of Λ given by

Λ_{l}^{T} = \lim_{N \to \infty} \frac{1}{N} \sum_{k = 1}^{N} I_{{x_{k} \in E_{l}}} x_{k}^{T}

(14)

The proof of Theorem 2 is similar to that of Theorem 1 and is given in Supplementary Material S2.

2.3. Test statistic and its asymptotic properties for two-level LMM with parameters estimated by least squares and method of moments

We consider the LMM (2), but now only require that E(α_i) = E(∊_ij) = 0, $V a r (α_{i}) = σ_{α}^{2}$ , $V a r (∊_{i j}) = σ_{∊}^{2}$ , and that there is a δ > 0, such that $E (α_{i}^{4 + δ}) < \infty$ and $E (∊_{i j}^{4 + δ}) < \infty$ , instead of assuming normality of α_i and ∊_ij. To compensate for the weaker distributional assumptions, we assume for simplicity in probability limit theorems that the covariate vectors x_ij are random, that {(x_i1, …, x_{in_i}), n_i} are i.i.d. and that $E (n_{i}^{2}) < \infty$ and $∣ ∣ E (x_{i}^{T} x_{i}) ∣ ∣ < \infty$ , where x_i denotes the n_i × p matrix of covariates for the ith cluster.

We estimate β by the generalized least squares estimator

\tilde{β} = {(X^{T} {\tilde{V}}^{- 1} X)}^{- 1} (X^{T} {\tilde{V}}^{- 1} Y) = {(X^{T} {\tilde{V}}^{- 1} X)}^{- 1} (X^{T} {\tilde{V}}^{- 1} Y) + o_{p} (1), as N \to \infty,

(15)

where V depends on the variance components $ψ = (σ_{α}^{2}, σ_{∊}^{2})$ . They are estimated by the method of moments by equating the right-hand sides of

E {\sum_{i = 1}^{m} \sum_{j = 1}^{n_{i}} {(y_{ij} - {\overset{‒}{y}}_{i .})}^{2}} = \sum_{i = 1}^{m} \sum_{j = 1}^{n_{i}} {(x_{ij}^{T} β - {\overset{‒}{x}}_{i}^{T} β)}^{2} + (N - m) σ_{∊}^{2}

(16)

E {\sum_{i = 1}^{m} \sum_{j = 1}^{n_{i}} {({\overset{‒}{y}}_{i .} - {\overset{‒}{y}}_{. .})}^{2}} = \sum_{i = 1}^{m} n_{i} {({\overset{‒}{x}}_{i .}^{T} β - {\overset{‒}{x}}_{. .}^{T} β)}^{2} + (N - \frac{1}{N} \sum_{i = 1}^{m} n_{i}^{2}) σ_{α}^{2} + (m - 1) σ_{∊}^{2}

(17)

respectively with their estimates based on the sum of squares within groups (SSW) and the sum of squares between groups (SSB) in the analysis of variance, given by

SSW = \sum_{i = 1}^{m} \sum_{j = 1}^{n_{i}} {(y_{ij} - {\overset{‒}{y}}_{i .})}^{2} and SSB = \sum_{i = 1}^{m} \sum_{j = 1}^{n_{i}} {({\overset{‒}{y}}_{i .} - {\overset{‒}{y}}_{. .})}^{2} = \sum_{i = 1}^{m} n_{i} {\overset{‒}{y}}_{i .}^{2} - N {\overset{‒}{y}}_{. .}^{2} .

The notation ${\overset{‒}{y}}_{i}$ stands for ${\overset{‒}{y}}_{i .} = \sum_{j} y_{i j} ∕ n_{i}$ and ${\overset{‒}{y}}_{. .} = \sum_{i} \sum_{j} y_{i j} ∕ N$ , and similar averages apply to the covariates x. Because different clusters are independent, and $\sum_{j = 1}^{n_{i}} {(y_{i j} - {\overset{‒}{y}}_{i .})}^{2}, n_{i} {\overset{‒}{y}}_{i .}^{2}$ , and $\sum_{j = 1}^{n_{i}} y_{i j}$ have finite second moments, SSW/m and SSB/m satisfy the law of large numbers. The estimating equations (15), (16) and (17) can be solved iteratively for $θ = (β, σ_{α}^{2}, σ_{∊}^{2})$ and yield consistent estimates $\tilde{θ}$ (Richardson and Welsh, 1994; Jiang, 1996).

To obtain the test statistic, we again compute the observed and expected values in each of the L cells of the covariate space as in (5) and (6).

Theorem 3. For the LMM (2), under Assumptions 3.1-3.3 in Appendix D, as N → ∞, ${f - e (\tilde{β})} ∕ \sqrt{N} \overset{D}{\to} N (0, Σ)$ , where $Σ = H - Λ J_{β β}^{- 1} Λ^{T}$ . Thus $T = {f - e (\tilde{β})}^{T} {\hat{Σ}}_{ζ}^{-} {f - e (\tilde{β})} ∕ N \overset{D}{\to} χ_{k}^{2}$ , where $\hat{Σ}$ is a consistent estimator of Σ, ${\hat{Σ}}_{ζ}$ is the modification of $\hat{Σ}$ as defined in the last paragraph of the section 2.2.1, ${\hat{Σ}}_{ζ}^{-}$ denotes the Moore-Penrose pseudoinverse of ${\hat{Σ}}_{ζ}$ , and $P (r a n k ({\hat{Σ}}_{ζ}) = r a n k (Σ) = r) \to 1$ N → ∞.

As X is a matrix of random variables in this section, Assumption 3.3 ensures that Assumptions 1.4-1.7 hold, which are needed when X are assumed to be fixed in Section 2.1. The matrices H, Λ and J_ββ are the same as for the two-level LMM (2) when parameters are estimated by maximum likelihood under the assumption of normality of α_i and ∊_ij and are specified in (8), (9), (10) and (4). The proof of the theorem is given in Appendix E.

Remark 1. Theorem 1 and Theorem 3 are still valid when empirical quantiles instead of fixed cut-o s are used to define cell partition, is we assume that the empirical quantiles of coordinates of x_i converge to unique limits or under Assumption 3.3.

2.4. Power of the test

For the multi-level LMM (1), we derive the theoretical power under local, and more specifically under contiguous alternatives for the test in (12) for the situation where some covariates that influence the outcome y are omitted from model (1). This case also covers omitted interactions of covariates or omitted higher order terms and is thus practically relevant.

Let X be the true N × p covariate matrix and X* be a submatrix of X of dimension N × p* used to fit model (1), with p* < p. The null hypothesis is H₀: θ_N = θ₀. We assess the power of T under the alternative

H_{1} : θ_{N} = θ_{0} + a ∕ \sqrt{N},

(18)

with θ₀ = (β₀, ψ₀), where several components of β₀ are 0. The vector $β_{0}^{*}$ corresponding to X*. Here $a ∕ \sqrt{N}$ is the vector difference between the parameter values under the alternative hypothesis and the parameter values under the null hypothesis.

Based on the derivation for Theorem 2, we have that under H₀, ${f - e ({\hat{β}}^{*})} ∕ \sqrt{N} \overset{D}{\to} N ({0, Σ}^{*})$ . By applying Le Cam’s third lemma (see Appendix F for details), we find that under the alternative hypothesis H₁ in (18),

{f - e ({\hat{β}}^{*})} ∕ \sqrt{N} \overset{D}{\to} N (τ, Σ^{*}),

(19)

where

τ = \lim_{N \to \infty} [Λ - Λ^{*} {{(X^{*})}^{T} V^{- 1} X^{*}}^{- 1} {{(X^{*})}^{T} V^{- 1} X}] a,

(20)

with Λ given by expression (14). Λ* is computed using X* in place of X in (14).

Thus under H₁, T* has a limiting noncentral χ² distribution

T^{*} = \frac{1}{N} {f - e ({\hat{β}}^{*})}^{T} ({\hat{Σ}}_{ζ}^{*}) - {f - e ({\hat{β}}^{*})} \overset{D}{\to} χ_{r}^{2} (λ),

(21)

where $r = r a n k ({\hat{Σ}}_{ζ}^{*})$ and the non centrality parameter is $λ = τ^{T} {({\hat{Σ}}_{ζ}^{*})}^{-} τ$ . For a given type I error level α, the power is thus $P (T^{*} > χ_{r, α}^{2})$ , where $χ_{r, α}^{2}$ is the 1 − α quantile of the central $χ_{r}^{2}$ distribution and P denotes the non central $χ_{r}^{2} (λ)$ distribution. In the computation of the power we use the Moore-Penrose inverse of a modification ${\hat{Σ}}_{ζ}^{*}$ as in (11) of a consistent estimator ${\hat{Σ}}^{*}$ in (20).

As an illustration, we show the asymptotic power to detect lack of fit for an omitted covariate for the two-level LMM (2). We assume Y ~ N(X^T, V), where X = (1, x₁, x₂, x₃) and V is the block diagonal covariance matrix. The x_ij = (x_ij1, x_ij2, x_ij3), i = 1, …, m; j = 1, …, n_i are i.i.d. and drawn from a multivariate normal distribution

(\begin{matrix} x_{1} \\ x_{2} \\ x_{3} \end{matrix}) \sim N ([\begin{matrix} 0 \\ 0 \\ 0 \end{matrix}], [\begin{matrix} 1 & 0 & ρ_{13} \\ 0 & 1 & ρ_{23} \\ ρ_{13} & ρ_{23} & 1 \end{matrix}]),

(22)

and x_ij and n_i are independent. In fitting the model, we omit x₃ leading to X* = (1, x₁, x₂) and a = (0, 0, 0, β₃) in (20). For this setting, τ in (20) and Σ* in (19) can be computed explicitly as a functions of the moments of X and n_i (section 2.2.3., Tang, 2010). We study the impact of the magnitude of the variance components $σ_{α}^{2}$ and $σ_{∊}^{2}$ and the correlations ρ₁₃ and ρ₂₃ in (22) between the omitted covariate x₃ and the covariates in the model (x₁ and x₂) on the theoretical power when the cell partition is based on theoretical quantiles of the omitted covariate x₃ with L = 8 cells. For ρ₁₃ = 0.5 and ρ₂₃ = 0.6, Figure 1 (left panel) plots the theoretical power against $β_{3} ∕ {(σ_{α}^{2} + σ_{∊}^{2})}^{1 ∕ 2}$ for three choices of $(σ_{α}^{2}, σ_{∊}^{2})$ all corresponding to the same overall variance $σ_{α}^{2} + σ_{∊}^{2}$ and varying β₃ on the x-axis. For any fixed pair of $(σ_{α}^{2}, σ_{∊}^{2})$ , the power of the test, not surprisingly, increases as a function of β₃, the coefficient of the omitted covariate x₃. This can also be seen by taking a first order Taylor expansion of the theoretical power formula around λ = 0, as the power for λ close to zero depends linearly on λ = τ^T(Σ*)⁻¹τ, which is a function of $β_{3}^{2}$ . Figure 1 (left panel) shows that for any fixed β₃ the power increases when the random effect $σ_{α}^{2}$ decreases compared to the error term $σ_{∊}^{2}$ . Figure 1 (right panel) plots the power for $σ_{α}^{2} = 1$ , $σ_{∊}^{2} = 0.25$ for different choices of (ρ₁₃, ρ₂₃). The power increases as $ρ_{13}^{2} + ρ_{23}^{2}$ decreases. For ρ₁₃ = 0 and ρ₂₃ = 0, that is, when x₃ is uncorrelated with x₁ and x₂, the power is not affected by the individual values of $σ_{α}^{2}$ and $σ_{∊}^{2}$ , but only depends on the sum $σ_{α}^{2} + σ_{∊}^{2}$ . The theoretical formulas for power under contiguous-alternatives given here will generally be close to the actual power only for very large sample sizes. However, numerical studies presented in the next section show that these formulas often also agree with empirical power in samples of moderate size (m = 50, N = 150 to N = 200).

Left: Theoretical power as a function of $(σ_{α}^{2}, σ_{∊}^{2})$ ; Right: Power as a function of (ρ₁₃, ρ₂₃).

3. Simulations to assess power and robustness of the test statistic

For a given number of clusters m, we first generated the cluster sizes n_i from a uniform distribution on {2, 3, 4, 5} for i = 1, …, m and then drew $N = \sum_{i = 1}^{m} n_{i}$ independent covariates x_ij for all simulations presented below. We present scenarios for which we believe out test would be practically most relevant: models with omitted main effects (Scenario I), omitted interaction terms and main effects (Scenarios II and III) and misspecified functional forms of a covariate (Scenario IV). We covered a range of effect sizes to provide a fair assessment of the performance of our test.

3.1. Main effects only (Scenario I)

Here x_ij = (x_1ij, x_2ij, x_3ij) were drawn from the multivariate normal distribution given in (22). Given X = (1, x₁, x₂, x₃), β, σ_α = 1 and σ_∊ = 0.5, we generated Y from a multivariate normal distribution, Y ~ N(X^T β, V).

3.1.1. Size and power of T

To check the size of the test for various choices of cell partition based on X, we let ρ₁₃ = ρ₂₃ = 0 in (22), β = (β₀, β₁, β₂, β₃) = (1, 1, 1, 1), m = 500 and fit model (2) with all covariates X in the model. Table 1 gives the number of cells and the covariates that are the basis of the cell partition in the first column. Cell partitions in the computation of the test statistic T were based on empirical quantiles of the respective components of X. For all cell partitions in Table 1 the empirical sizes were close to the nominal α levels of 0.05 and 0.1.

Table 1.

Empirical size of the test under different cell partitions (Scenario I). m = 500, E(N) = 1750, β₃ = 1, ρ₁₃ = ρ₂₃ = 0, σ_α = 1, σ_ε = .5, K = 2000. L denotes the number of cells for the test statistic.

L	α	Emp. Size	α	Emp. Size
8 (x₁)	0.05	0.052	0.1	0.103
3×4 (x₁, x₂)	0.05	0.053	0.1	0.108
5×4 (x₁, x₃)	0.05	0.045	0.1	0.094
6×7 (x₂, x₃)	0.05	0.047	0.1	0.096

Open in a new tab

To assess the power of the test, we generated data from model (2) that includes all three covariates but then omitted x₃ in fitting the model to the data. We set (ρ₁₃, ρ₂₃) = (0.5, 0.6), β = (β₀, β₁, β₂, β₃) = (1, 1, 1, 0.25). We then generated K = 2000 datasets for a given X. We repeated this data generation process for D = 1000 independently drawn design matrices X. For a given X, we computed the theoretical power of T* in (21) based on the asymptotic χ² distribution with the true values of $σ_{α}^{2}$ and $σ_{∊}^{2}$ . The empirical moments for X were used in the calculation of the non-centrality parameter λ. For a given X and each generated Y, we calculated the estimated theoretical power based on the asymptotic χ² distribution with the variance components estimated based on the given Y and empirical moments of X in (20). We then repeated the calculation of the estimated theoretical power for each of the K = 2000 generated Y and by taking the average, we obtained a mean estimated theoretical power for that given X. For each given design matrix X, we also calculated the empirical power based on K = 1000 iterations on Y. For m = 20 clusters the cell partition was based on the empirical quantiles of the omitted x₃ with L = 8 cells to avoid empty cells, but for m = 50 or 500, we used theoretical quantiles of x₃ as cell boundaries for computational ease.

The mean theoretical power (“Theo.Pow.”), the mean estimated theoretical power (“Theo.Pow.hat”) and the empirical power (“Empi.Pow.n”) agreed very well, even when m is small (Table 2). However, only for m = 500 was there adequate power to detect lack of fit when β₃ = 0.25, which is substantially smaller than the coefficients β₁ = β₂ = 1 of x₁, and x₂, the covariates included in the model. When the effect of the omitted covariate was larger, β₃ = 0.8, the test statistic had approximately 80% power even for m = 50 clusters.

Table 2.

Power and robustness study (Scenario I) with L = 8,K = 2000, D = 1000, (ρ₁₃ ,ρ₂₃) = (.5,.6),σ_α = 1, σ_ε = .5. Standard deviation (std.dev.) relates to variation across the randomly generated 500 covariate matrices X

Power	m = 500, EN = 1750		m = 50, EN = 175		m = 20, EN = 70
β₃ = .25	mean	std.dev.	mean	std.dev.	mean	std.dev.
Theo.Pow.	0.800	0.040	0.120	0.023	0.086	0.018
Theo.Pow.hat	0.799	0.039	0.125	0.023	0.090	0.019
Empi.Pow.n	0.799	0.037	0.111	0.022	0.063	0.017

Misspecification of the error term distribution

Empi.Pow.t₃	0.798	0.036	0.112	0.023	0.062	0.019
Empi.Pow.t₅	0.799	0.036	0.111	0.022	0.063	0.018

Misspecification of the random intercept distribution

Empi.Pow.t₃	0.817	0.033	0.132	0.027	0.076	0.021
Empi.Pow.t₅	0.799	0.036	0.116	0.023	0.067	0.018

Power	m = 500, EN = 1750		m = 50, EN = 175		m = 20, EN = 70
β₃ = .8	mean	std.dev.	mean	std.dev.	mean	std.dev.
Theo.Pow.	1	0	0.847	0.104	0.541	0.189
Theo.Pow.hat	1	0	0.821	0.096	0.512	0.151
Empi.Pow.n	1	0	0.820	0.102	0.444	0.161

Misspecification of the error term distribution

Empi.Pow.t₃	1	0	0.821	0.102	0.449	0.164
Empi.Pow.t₅	1	0	0.821	0.102	0.444	0.162

Misspecification of the random intercept distribution

Empi.Pow.t₃	1	0	0.852	0.074	0.545	0.155
Empi.Pow.t₅	1	0	0.824	0.094	0.472	0.159

Open in a new tab

3.1.2. Robustness of T with respect to error and random effects distributions

In Table 2 we also assessed the impact of misspecification of the error distribution on the power of the test statistic. Using the same setting as in the power calculations given above, we generated ∊ from a t distribution with k = 3 or 5 degrees of freedom (d.f.) instead of from a $N (0, σ_{∊}^{2})$ . We rescaled the variance of ∊ so that the noise had the same variance as in the normal case. The power of the test under a t-distribution was virtually the same as with a normal error distribution indicating that our test is robust to symmetric violations of normality. For example, for m = 50 with β₃ = 0.8, the power was 0.83 for the normal error distribution and for t-distributions with 3 and 5 d.f. (Table 2). We also used the same misspecification for the random effects distribution, and observed very similar results (Table 2). We chose the t distribution because it is symmetric but has heavier tails than the normal distribution and it satisfies the conditions given in Section 2.3 on the existence of moments of the random effects and errors.

3.1.3. Impact of choice of the cell partition on power

As is true for Pearson’s chi-squared test, the choice of cell partition strongly impacts the performance of our goodness of fit test. To illustrate the impact of the cell partition on the power of our test we generated y from a model with E(y) = 1 + x₁ + x₂ + 0.15x₃, with σ_α = 1 and σ_∊ = 0.5, but then omitted x₃ in the subsequent model fitting. We studied cell partitions based on only x₁, only x₂, only x₃, both x₁ and x₂, both x₁ and x₃, or both x₂ and x₃, all based on empirical quartiles of the covariates. Table 3 shows that a lack of fit is detectable by our test statistic only when the cell partition involves the omitted covariate x₃, and power decreases as correlations (ρ₁₃, ρ₂₃) between the covariates increase.

Table 3.

Impact of cell partition on empirical power when covariate x₃ with β₃ = .15 is omitted from model fitting (Scenario I) with m = 500, σ_α = 1, σ_ε = 0.5, K = 2000.

Cell Variables	ρ₁₃ = 0, ρ₂₃ = 0		ρ₁₃ = 0.2, ρ₂₃ = 0.3		ρ₁₃ = 0.4, ρ₂₃ = 0.5
Cell Variables	L=12	L=42	L=12	L=42	L=12	L=42
x ₁	0.056	0.060	0.046	0.044	0.045	0.044
x ₂	0.055	0.046	0.048	0.048	0.051	0.050
x ₃	0.985	0.871	0.936	0.748	0.630	0.367
x₁, x₂	0.054	0.050	0.052	0.046	0.051	0.048
x₁, x₃	0.968	0.821	0.896	0.732	0.578	0.382
x₂, x₃	0.962	0.843	0.913	0.752	0.642	0.435

Open in a new tab

3.2. Normally distributed covariates with an omitted interaction term (Scenario II)

We generated y from a linear model with E(y) = 1 + x₁ + x₂ + β₃x₁x₂, where x₁ and x₂ are independent and $x_{i} ~ N (μ_{i}, σ_{i}^{2})$ , i = 1, 2. We first let β₃ = 0.2 and set σ_α = 1, σ_∊ = 0.5. We then fit model (2) without x₃ = x₁x₂. The test had adequate power only when the cell partition is based on empirical quantiles of x₁ and x₂, or on the omitted interaction term x₃ = x₁x₂, but not if the cell partition was based on either x₁ or x₂ alone, for ρ₁₂ = 0 and ρ₁₂ = .3 (Table 1, Web Supplementary Material). Figure 2 shows the power of the test as a function of the number of cells computed based on quantiles of the omitted covariate x₃ for various values of μ_i = E(X_i), i = 1, 2. The power was higher for smaller absolute values of μ₁ and μ₂ and was largest for L = 11 cells for μ₁ = 2 and μ₂ = 1 and L = 7 cells for μ₁ = 1 and μ₂ = 0.5.

The impact of number of cells for cell partition on theoretical power

Figure 3 plots the theoretical power against $β_{3} ∕ {(σ_{α}^{2} + σ_{∊}^{2})}^{1 ∕ 2}$ for three choices of $(σ_{α}^{2}, σ_{∊}^{2})$ corresponding to the same overall variance $σ_{α}^{2} + σ_{∊}^{2}$ and varying β₃ on the x-axis when the cell partition was based on x₃ with L = 8 cells using fixed cell boundaries. Our conclusions are consistent with those in Section 3.1. For any fixed pair of $(σ_{α}^{2}, σ_{∊}^{2})$ , the power of the test increased as a function of β₃, the coefficient of the omitted covariate x₃. For any fixed β₃, the power increased when the random effect $σ_{α}^{2}$ decreases compared to the error term $σ_{∊}^{2}$ .

The impact of $(σ_{α}^{2}, σ_{∊}^{2})$ on theoretical power, ρ₁₂ = 0 (Scenario II)

3.3. Omitted main effect and interaction term (Scenario III)

Here we generated y from a LMM that includes three covariates x₁, x₂ and x₃ through E(y) = 1 + x₁ + x₂ + 0.05x₃ + 0.1x₁x₃, with σ_α = 1 and σ_∊ = 0.5. We then omitted x₃ and any interactions with x₃ from the model fitting and investigate the power of our test in working model M1 with covariates x₁, x₂, and working model M2 with x₁, x₂ and $x_{1}^{2}$ . The cell partition for M1 and M2 was based on empirical quantiles of x₁ with L = 8 cells. We simulated x₃ ~ N(0, σ²) with σ = 1.5, and let x₁ = e^x₃. The covariate $x_{2} ~ χ_{1}^{2}$ was generated independently of x₁ and x₃.

Under this setting, with m = 500 clusters, our test had power 1 to detect lack of fit of the working model M1. The Wald test for inclusion of the quadratic term $x_{1}^{2}$ in model M2 had power 1. As x₃ is not available, a Wald test cannot be applied for any term related to x₃. We thus would select model M2 (with covariates x₁, x₂, $x_{1}^{2}$ ) and a Wald-type test is not able to further assist in testing model inadequacy. However, our proposed test had power of 0.938 to detect lack of fit of M2, under the cell partition based on x₁, a covariate in M2.

This example, in which the omitted covariate is not available in the dataset but a correlated variable is, shows the usefulness of our test in addition to Wald type tests. However, our test has reasonable power against omitted-covariate alternatives only when the partitions of the covariate space are based on variables that are correlated with the omitted covariates.

3.4. Misspecified functional form of a covariate (Scenario IV)

We generated y from a LMM with $E (y) = 1 + x_{1} + x_{2} + 0.1 x_{3}^{2}$ with σ_α = 1 and σ_∊ = .5, and studied the power of our test when instead of $x_{3}^{2}$ only x₃ is used in the working model. x = (x₁, x₂, x₃) was generated from the multivariate normal distribution given in (22) with (ρ₁₃, ρ₂₃) = (.5, .6). We used empirical quantiles of x₃ to define L = 8 cells for m = 500 or 50 clusters.

When m = 500, our test had approximately 87% power in detecting model inadequacy (Table 4). The Wald test to assess the significance of x₃ however had of only approximately 6% power. Thus based on Wald test, x₃ would not be included in the model and therefore it is unlikely that the higher order term $x_{3}^{2}$ would be considered.

Table 4.

Power and robustness study (Scenario IV). L = 8, K = 2000, D = 1000, (ρ₁₃, ρ₂₃) = (.5, .6), β₃ = 0.1, σ_α = 1, σ_ε = .5. Note: The standard deviation (std.dev.) relates to variation across the randomly generated 1000 covariate matrices X. Each number was obtained as the mean over 1000 simulated covariate matrices X and for each generated X, K = 2000 iterations were used to simulate the response vector Y.

	m= 500, EN = 1750		m = 50, EN = 175
	Mean Power	std.dev.	Mean Power	std.dev.
Cell Variable x₃	0.873	0.035	0.122	0.028
Wald Test on x₃	0.059	0.024	0.065	0.026

Misspecification of the error term distribution

ε_ij simulated from t₃

Cell Variable x₃	0.873	0.035	0.123	0.030
Wald Test on x₃	0.059	0.024	0.065	0.026

ε_ij simulated from t₅

Cell Variable x₃	0.873	0.035	0.122	0.029
Wald Test on x₃	0.060	0.025	0.066	0.030

Misspecification of the random effect distribution

α_i simulated from t₃

Cell Variable x₃	0.878	0.032	0.139	0.033
Wald Test on x₃	0.061	0.027	0.067	0.031

α_i simulated from t₅

Cell Variable x₃	0.872	0.035	0.126	0.030
Wald Test on x₃	0.061	0.026	0.067	0.031

Open in a new tab

We also investigated the impact of symmetric misspecification of the error and random effect distribution in this scenario. When the t-distribution was used for the error term (or for the random effect term) results are similar to those for the normally distributed error term (or random effect term) (Table 4). Thus in this scenario, our test was robust to symmetric misspecification of the error or random effect distribution, similar to Scenario I.

3.5. Remarks

The primary purpose of the goodness of fit tests studied in this paper is to assess the quality of the fixed-effect part of the mean response in the presence of a mixed-effect variance structure. Yet it is well known that there is ambiguity in Gaussian linear models as to which terms contribute to the fixed-effect predictors and which terms to the variance. To be specific, we consider the model

Y_{ij} = β_{0} + {β_{1}}^{T} X^{*} + γ X_{3} + α_{i} + ∊_{ij}

(23)

where X* = (X₁, X₂), (X*, X₃) are jointly normally distributed with means 0, and $α ~ N (0, σ_{α}^{2})$ , and the random error $∊ ~ N (0, σ_{∊}^{2})$ . By grouping the γX₃ term together with the error ∊, we see that model (23) is equivalent to the model

Y = β_{0} + {β_{1}}^{* T} X^{*} + α_{i} + ∊^{*}

(24)

where $β_{1}^{*}$ and ∊* are defined in terms of E(X₃|X*) = M^T X* and $V (X_{3} ∣ X^{*}) = σ_{R}^{2}$ by ∊* = ∊ + γ(X₃−M^TX*), $β_{1}^{*} = β_{1} + M γ$ , and $V (∊^{*}) = σ_{∊}^{2} + γ^{2} σ_{R}^{2}$ . This argument shows that the portion of a normal linear model describing $E (Y ∣ D)$ is not uniquely determined, where D denotes the data-vector of covariates, that is $D = (X^{*}, X_{3})$ in (23), and $D = X^{*}$ in (24). However, since our goodness of fit tests for adequacy of the mean structure are considered conditional on D, and are specified in terms of covariate-defined cells, these two models (23) and (24) are in fact distinguishable if cells under (23) are taken to depend non-trivially on the omitted covariate X₃.

This argument also highlights the lack of power for the test in the setting of main effects (Scenarios I and III) with an omitted covariate when the cell partition was not based on the omitted covariate or a transformation of it, or for an omitted interaction term, when the cell partition is based on only one of the variables that define the interaction (Scenario II). When cell partitions are based on only on X* no lack of fit in the mean structure can be detected, as it is correctly specified with respect to X*.

4. Data example

On April 26, 1986, an accident at the Chernobyl power plant in Ukraine, close to the border with Belarus, released large amounts of radioactive materials including iodine-131 (I-131) into the atmosphere from the destroyed reactor. Deposition of these materials contaminated the territory. Radioisotopes of iodine, e.g. I-131, are concentrated in the thyroid gland. Belarusians exposed to the accident were enrolled in a cohort study to evaluate the relationship between I-131 doses and thyroid cancer risk (Stezhko et al, 2004). Investigators were also interested in studying iodine deficiency in this population, as it impacts I-131 absorption.

We therefore evaluated the relationship between levels of serum thyroglobulin (TG), a marker of iodine deficiency, and variables that might reflect or impact dietary iodine intake, including age at the time of exam, age at the time of the accident, rural or urban residence, smoking status, urinary iodine levels, serum thyroid-stimulating hormone (TSH) levels, serum anti-thyroglobulin antibody (ATG) levels, thyroid volume, presence of thyroid nodules (yes/no), presence of goiter (yes/no) and presence of any thyroid abnormality (yes/no).

We used data on m = 933 men from four of the five study regions, who had complete covariate information, whose ATG and TSH levels were measured by a luminescence assay, and who had TG ≤ 80 (to exclude those with thyroid disease). Among these men, 404 had a single TG measurement, 484 had two, 42 three and 3 four TG measurements during follow-up, resulting in N = 1510 observations. log(TG) was normally distributed (Anderson-Darling test p-value p=0.09).

We fit various models using Proc GLIMMIX, SAS 9.2. Model 1 included all the variables mentioned above, with the exception of presence of nodules, and an interaction term of ATG levels with presence of any thyroid abnormality that was marginally significant (Wald test p-value p = 0.054) and had a log-likelihood value of −1625.2. The random effect variance estimate was ${\hat{σ}}_{α}^{2} = 0.29$ and the error variance estimate was ${\hat{σ}}_{∊}^{2} = 0.25$ . Model 2 had no interaction term, but included presence of nodules and resulted in a log-likelihood of −1621.3. The variance component estimates were similar to model 1, ${\hat{σ}}_{α}^{2} = 0.29$ and ${\hat{σ}}_{∊}^{2} = 0.26$ . However, as models 1 and 2 are not nested, we could not compare them using a likelihood ratio test.

To assess the fit of both models, each person in the dataset was assigned to one of the L = 8 cells defined by the quartiles of ATG and the response “yes” or “no” to the question “presence of any thyroid abnormality”. There was no indication of lack of fit for either model, with p = 0.32 and p = 0.40 for models 1 and 2 respectively. We also calculated the test statistic for a second cell partition with L = 4 cells defined by “presence of nodules” (yes/no) and “presence of goiter” (yes/no), with p = 0.19 and p = 0.70 for models 1 and 2 respectively. These results suggested that both models fit the data adequately. Thus omitting the interaction term of the variable “presence of any thyroid abnormality” with ATG levels does not affect the fit to the data.

5. Discussion

Schoenfeld (1980) presented a class of omnibus chi-squared goodness of fit tests for the proportional hazards regression model. We adapted this idea and proposed a class of goodness of fit tests for testing the statistical adequacy of the mean structure of a linear mixed model, with cell partitions based on covariates. We described the asymptotic properties of the test when parameters are estimated and developed its theoretical power under local alternatives. We assessed factors that affect the power, the impact of choice of cell partitions on the test as well as the robustness of the test with respect to error distribution and distribution of random e orts in simulations. When a specific covariate associated with outcome is omitted, such as an interaction term or a covariate correlated with terms already in the model, cell partitions based on the omitted covariate result in adequate power of the test. In our simulations we studied models involving only a few covariates. In such cases, Wald testing and likelihood-based model building tools could undoubtedly be used instead. In practical settings our test would be recommended when there are many potential predictors that should in fact not appear in the model. In such circumstances, many nonlinear terms involving omitted variables would not be Wald-tested. We also found that the estimated theoretical power calculated using Le Cam’s third lemma was reliable at least when the number of clusters m was above 50. However, when m is very small, it may be advisable to rely on the empirical power computed through simulations. Our test was also robust to symmetric violations of the normality assumption of the error distribution as well as the violation of normality of the random effects distribution.

This goodness of fit test can be used to test the statistical adequacy of the fixed effects part of a finally selected LMM. It should not be used if one wants to test if a specific covariate should be included in the model, as standard tests such as the Wald test have better power for that purpose (e.g. Scenario I, Table 2). However, when a covariate is missing from the dataset, our test can detect model inadequacy when the cell partition is based on an existing covariate in the working model, which is correlated with the omitted covariate, while no Wald-type test can be applied (Scenario III). Also, the Wald test did not have power to select a variable that entered the mean model only through a quadratic term (Scenario IV), while our test clearly showed lack of fit of the finally selected model with respect to cells defined by that variable. This is particularly important in the situation when many predictors are available, and testing all possible higher order terms or interactions is not practical. In addition, investigators might not consider the inclusion of a higher order term for a variable that has no main effect. We have shown using simple examples that our proposed test has good power to detect many sorts of model inadequacies, not all of which would be tested exhaustively by other methods.

To implement the test one only needs the final model parameter estimates and their variance covariance matrix, which are standard outputs from any statistical software. As a note of caution, in applying the test one must modify the estimated variance matrix $\hat{Σ}$ , projecting its eigenspace corresponding to extremely small eigenvalues to 0, to ensure the correct degrees of freedom for the test statistic.

Pan and Lin (2005) developed methods for checking the adequacy of generalized linear mixed models by comparing the cumulative sums of residuals over covariates or predicted values. Our proposed test has additional flexibility in defining cells based on multiple covariates, the test statistic follows a known distribution and is thus easily computed, and we present a broader class of LMMs.

Our goodness of fit test examines multiple features of the data, corresponding to residuals within each covariate cell and bears some relation to the multiaspect framework by Pesarin and Salmaso (2010), Salmaso and Solari (2005) and Marozzi (2007). Future work could attempt to adapt their permutational approaches for several populations to the goodness of fit test in a single population.

Notably, the cell partition used for our test is based on covariates, not on the response variable as for standard Pearson χ² statistic. In future research we plan to further investigate the choice of covariate-based cells partition on the performance of our proposed test. A related issue is sparse cells. Our asymptotic results were derived letting the sample size go to infinity for a fixed cell partition and thus asymptotically cells are not sparse. However, in a real dataset the issue of sparse cells could arise. Maydeu-Olivares and Joe (2005, 2006) and Cagnone (2012) studied the impact of sparse cells when assessing the goodness of fit of latent variable models. For use with heavily cross-classified and sparse covariate-space cell decompositions the limited-information approach of Maydeu-Olivares and Joe (2005, 2006) could be used in our setting and will be part of future investigations. Other possible extensions include derivation of the distribution of the test statistic for random components with heavy tails, for example, under symmetric α-stable distributional assumption for the errors and random effects. However, these extensions of the mixed model theory presented in our paper are technically difficult and we are not aware of any related results in the literature.

Supplementary Material

NIHMS585672-supplement-01.pdf^{(73.7KB, pdf)}

Acknowledgment

We thank the investigators of the ‘US-Belarusian Study of Thyroid Cancer and Other Thyroid Diseases Following the Chernobyl Accident’ (National Cancer Institute, Columbia University, U.S.A, and Republican Research Center of Radiation Medicine and Human Ecology, Belarus) for providing the data and Jincao Wu for help with computations. We also thank the reviewers for helpful comments and suggestions. This work is part of M. Tang’s Ph.D. thesis done at the University of Maryland.

Appendix

Appendix A: Proof of Theorem 1

Let J be the limit of the sample information matrix per observation given in (3). The consistency of the MLE $\hat{θ}$ in model (2) follows from Miller (1977). By Taylor series expansion of the score function S(θ) = ▽ log L(θ), where L(θ) denotes the likelihood function,

\sqrt{N} (\hat{θ} - θ_{0}) \approx {- \frac{1}{N} \frac{\partial S (θ_{0})}{\partial θ}}^{- 1} \frac{1}{\sqrt{N}} S (θ_{0}) \approx J^{- 1} \frac{1}{\sqrt{N}} S (θ_{0}) .

(25)

As the Fisher information (3) is block diagonal, $J^{- 1} = [\begin{matrix} J_{β β}^{- 1} & 0 \\ 0 & M^{- 1} \end{matrix}]$ . Under Assumption 1.2 Y − Xβ ~ N(0, V), and the score functions for β, i.e. the first p components of S(θ), are S_β(θ) = X^TV⁻¹(Y − Xβ). By extracting the first p components of (25), with A ≈ B denoting $A - B \overset{P}{\to} 0$ , we have

\sqrt{N} (\hat{β} - β_{0}) \approx J_{β β}^{- 1} S_{β} (θ_{0}) ∕ \sqrt{N} = J_{β β}^{- 1} X^{T} V^{- 1} (Y - X β_{0}) ∕ \sqrt{N} .

Thus,

\begin{matrix} \sqrt{N} (\begin{matrix} {(f - e (β_{0})} ∕ N \\ \hat{β} - β_{0} \end{matrix}) \approx & (\begin{matrix} N^{- 1 ∕ 2} [I_{{x_{11} \in E_{1}}} \dots I_{{x_{{mn}_{m}} \in E_{1}}}] \\ ⋮ \\ N^{- 1 ∕ 2} [I_{{x_{11} \in E_{L}}} \dots I_{{x_{{mn}_{m}} \in E_{L}}}] \\ N^{- 1 ∕ 2} J_{β β}^{- 1} X^{T} V^{- 1} \end{matrix}) (Y - X β - 0) \\ = & D_{(L + p) \times N} (Y - X β_{0}), \end{matrix}

which is a linear combination of Gaussian random variables.

Under Assumptions 1.4, 1.5, 1.7, which ensure the existence of components of the covariance matrix of the test statistic, we get as N → ∞,

\sqrt{N} (\begin{matrix} {(f - e (β_{0})} ∕ N \\ \hat{β} - β_{0} \end{matrix}) \overset{D}{\to} N (0, {DVD}^{T}) .

(26)

Appendix B: Proof of Corollary 2

Under asymptotic normality of $\sqrt{N} (\hat{β} - β_{0})$ ,

\begin{matrix} \frac{1}{\sqrt{N}} {f - e (\hat{β})} = \frac{1}{\sqrt{N}} {f - e (β_{0})} + \frac{1}{\sqrt{N}} {e (β_{0}) - e (\hat{β})} \\ \approx & \frac{1}{\sqrt{N}} {f - e (β_{0})} = \frac{1}{\sqrt{N}} \nabla e (β_{0}) {\hat{β} - β_{0}} \approx \frac{1}{\sqrt{N}} {f - e (β_{0})} - Λ \sqrt{N} {\hat{β} - β_{0}} \\ = & (I ∣ - Λ) \sqrt{N} (\begin{matrix} {f - e (β_{0})} ∕ N \\ \hat{β} - β_{0} \end{matrix}) . \end{matrix}

Since $N^{- 1 ∕ 2} {f - e (\hat{β})}$ is a linear combination of components of the left hand side of (26), $N^{- 1 ∕ 2} {f - e (\hat{β})} \overset{D}{\to} N (0, Σ)$ , with $Σ = H - {Λ J}_{β β}^{- 1} Λ^{T}$ .

Appendix C: Proposition 1

Proposition 1. Suppose that a sequence Z_N of random q-vectors is asymptotically distributed as $N (0, Σ_{0})$ , where rank(Σ₀) = r ≥ 1 and there exists a known ξ > 0 smaller than the minimum positive eigenvalue of Σ₀, and that $\hat{Σ}$ is a consistent covariance-matrix-valued estimator of Σ₀.

Let the spectral decomposition of $\hat{Σ}$ be given by

\hat{Σ} = \sum_{k = 1}^{q} c_{kN} v_{kN} v_{kN}^{T}

where c_kN are the eigen values and ${v_{k N}}_{k = 1}^{q}$ form an orthonormal eigenbasis determined from $\hat{Σ}$ . Define

{\hat{Σ}}_{ζ} = \sum_{k = 1}^{q} c_{kN} I_{[c_{k N} > ζ]} v_{kN} v_{kN}^{T} and {\hat{Σ}}_{ζ}^{-} \equiv \sum_{k = 1}^{q} I_{[c_{kN} > ζ]} (1 ∕ c_{kN}) v_{kN} v_{kN}^{T}

and let ${\tilde{Σ}}^{-}$ be any other generalized inverse of ${\hat{Σ}}_{ζ}$ , i.e. any matrix such that ${\hat{Σ}}_{ζ} {\tilde{Σ}}^{-} {\hat{Σ}}_{ζ} = {\hat{Σ}}_{ζ}$ . Then

$P (r a n k ({\hat{Σ}}_{ζ}) = r a n k (Σ_{0})) \to 1$ and ${\hat{Σ}}_{ζ} \overset{P}{\to} Σ_{0}$ .
$Z_{N}^{T} {\hat{Σ}}_{ζ}^{-} Z_{N} \overset{D}{\to} χ_{r}^{2}$ and $Z_{N}^{T} {\tilde{Σ}}^{-} Z_{N} \overset{D}{\to} χ_{r}^{2}$ as N → ∞.

Proof of Proposition. Note first that while the eigenvectors v_kN are not necessarily uniquely determined if any eigenvalues have multiplicity greater than 1, the eigenspaces spanned by {v_kN: 1 ≤ k ≤ q, c_kN ≤ s} are uniquely and measurably determined from $\hat{Σ}$ for each real s > 0. Therefore ${c_{k N}}_{k = 1}^{q}$ and all of the random variance matrices ${\hat{Σ}}_{ζ}$ , ${\hat{Σ}}_{ζ}^{-}$ are well-defined, coordinate-free and measurably defined from $\hat{Σ}$ .

Without loss of generality, let the eigenvalues c_k,N of $\hat{Σ}$ be indexed in nondecreasing order. Since the k’th smallest eigenvalue is a continuous function on the set of q × q symmetric nonnegative definite matrices (Golub and van Loan 1983, pp. 18-19), it follows from the convergence $\hat{Σ} - Σ_{0} \overset{P}{\to} 0$ , that for arbitrarily small ∊ ∈ (0, ξ), the event

A_{N} (∊) \equiv [c_{q - r, N} \leq \in, c_{q - r + 1, N} > ζ, \sup_{x : ‖ x ‖ = 1} ‖ (\hat{Σ} - Σ_{0}) x ‖ \geq ∊]

has probability converging to 1 as N → ∞. This implies that on A_N(∊), the range space of ${\hat{Σ}}_{ζ}$ is exactly the span of the eigenvectors v_kN with k ≥ q − r + 1, and therefore that rank $({\hat{Σ}}_{ζ}) = r$ on the event A_N(∊). Moreover, on the event A_N(∊), for all x ∈ R^q with ||x|| = 1,

\begin{matrix} ‖ ({\hat{Σ}}_{ζ} - Σ_{0}) x ‖ \leq & ‖ (\hat{Σ} - Σ_{0}) x ‖ + ‖ ({\hat{Σ}}_{ζ} - \hat{Σ}) x ‖ \\ \leq & ∊ + ‖ \sum_{k = 1}^{q - r} c_{kN} v_{kN} (x^{T} v_{kN}) ∣ \leq ∊ + ∊ \end{matrix}

since max{|c_kN|: k ≤ q − r} ≤ ∊ and $\sum_{k = 1}^{q} {(x^{T} v_{k N})}^{2} = {∣ ∣ x ∣ ∣}^{2} = 1$ . This shows the matrix sup-norm of ${\hat{Σ}}_{ζ} - Σ_{0}$ converges in probability to 0 as N → ∞, completing the proof of (i).

By (i), the asymptotic distribution of Z_N is the same as ${\hat{Σ}}_{ζ}^{1 ∕ 2} W$ , where $W ~ N (0, I_{q \times q})$ is independent of Z_N and the matrix square-root is the symmetric square-root equal to $\sum_{k = 1}^{q} c_{k N}^{1 ∕ 2} I_{c_{k N} > ζ} v_{k N} v_{k N}^{T}$ . Therefore, by the continuous mapping theorem, the asymptotic distribution of $Z_{N}^{T} {\hat{Σ}}_{ζ}^{-} Z_{N}$ is the same as the distribution of $W^{T} {(\hat{Σ})}^{1 ∕ 2} {\hat{Σ}}_{ζ} {(\hat{Σ})}^{1 ∕ 2} W$ , which is $χ_{r}^{2}$ since ${({\hat{Σ}}_{ζ})}^{1 ∕ 2} {\hat{Σ}}_{ζ}^{-} {({\hat{Σ}}_{ζ})}^{1 ∕ 2}$ is symmetric and idempotent with trace r. The only feature of ${\hat{Σ}}_{ζ}^{-}$ that has been used in this proof is the generalized-inverse property ${\hat{Σ}}_{ζ} {\tilde{Σ}}^{-} {\hat{Σ}}_{ζ} = {\hat{Σ}}_{ζ}$ shared by ${\tilde{Σ}}^{-}$ . This fact about generalized inverses, which completes the proof of assertion (ii), was previously proved in detail by Rao (1973, 1b.5.(viii), 3b.4.(vii) or 3b.5.(iv)).

Appendix D: Assumptions for Theorem 3

For the rest of the Appendices we employ the notation v^⊗2 = vv^T for any vector v.

Assumption 3.1. The true parameter point θ₀ = (β₀, ψ₀) is an interior point of $ϴ = (R^{p}, (R^{+}) R + 1)$ .

Assumption 3.2. E(α_i) = E(∊_ij) = 0, $V a r (α_{i} = σ_{α}^{2})$ , $V a r (∊_{i j}) = σ_{∊}^{2}$ and there is a δ > 0, such that $E (α_{i}^{4 + δ}) < \infty$ and $E (∊_{i j}^{4 + δ}) < \infty$ .

Assumption 3.3. X is a matrix of random variables, (x_i, n_i) are i.i.d. with $∣ ∣ E (x_{i}^{T} x_{i}) ∣ ∣ < \infty$ , $E (x_{1}^{T} x_{1})$ being positive definite, and $E (n_{i}^{2}) < \infty$ .

Appendix E: Proof of Theorem 3

The following Lemma is used in proving Theorem 3.

Lemma 1. Let {u_in: n ≥ 1, 1 ≤ i ≤ n} be a triangular array of i.i.d. random variables within each row (i.e., across i) with mean 0 and finite variance $σ_{u}^{2}$ , and that these variables are independent of the random array {c_in: n ≥ 1, 1 ≤ i ≤ n which satisfies, as n → ∞, (a) max_1≤i≤n |c_in| → 0 and (b) $\sum_{i = 1}^{n} c_{i n}^{2} \to κ$ in probability, where κ ∈ (0, ∞). Then $\sum_{i = 1}^{n} c_{i n} u_{i n} \overset{D}{\to} N (0, κ)$ as n → ∞.

Proof of Lemma 1: ${\sum_{i = 1}^{k} c_{i n} u_{i n}}_{k = 1}^{n}$ is a martingale with respect to the filtration $F_{k n} = σ ({c_{i n}, . u_{i n} : 1 \leq i \leq k})$ and the Lemma follows directly from the Martingale Central Limit Theorem (Hall and Heyde, 1980).

Proof of Theorem 3: Let the n_i×p covariate matrix for the i-th cluster be $x_{i} = (x_{i 1}^{T}, \dots, x_{i n_{i}}^{T})$ . Then

\sqrt{N} (\tilde{β} - β_{0}) = \sqrt{N} {(X^{T} {\tilde{V}}^{- 1} X)}^{- 1} X^{T} {\tilde{V}}^{- 1} (Y - X β_{0}) \approx {(\frac{X^{T} V^{- 1} X}{N})}^{- 1} \frac{1}{\sqrt{N}} \sum_{i = 1}^{m} x_{i}^{T} V_{i}^{- 1} (y_{i} - x_{i} β_{0}) .

Then

\begin{matrix} {f - e (\tilde{β})} N = & {f - e (β_{0})} ∕ \sqrt{N} + {e (β_{0}) - e (\tilde{β})} ∕ \sqrt{N} \\ \approx & {f - e (β_{0})} ∕ \sqrt{N} - \nabla e (β_{0}) (\tilde{β} - β_{0}) ∕ \sqrt{N} \\ \approx & \sum_{i = 1}^{m} \frac{1}{\sqrt{N}} {(\begin{matrix} z_{i 1} \\ ⋮ \\ z_{iL} \end{matrix}) - \frac{\nabla e (β_{0})}{N} {(\frac{X^{T} V^{- 1} X}{N})}^{- 1} x_{i}^{T} V_{i}^{- 1} (y_{i} - x_{i} β_{0})}, \end{matrix}

with $z_{i l} = \sum_{j = 1}^{n_{i}} I_{{x_{i j} \in E_{l}}} (y_{i j} - E (y_{i j})) = \sum_{j = 1}^{n} I_{{x_{i j} \in E_{l}}} (y_{i j} - x_{i j} β_{0})$ , i = 1, …, m, l = 1, …, L. Let $\tilde{Λ} = N^{- 1} \nabla e (β_{0}) \overset{P}{\to} Λ$ , ${\tilde{J}}_{β β} = N^{- 1} X^{T} V^{- 1} X \overset{P}{\to} J_{β β}$ . We next show that $(f - e (\tilde{β})) ∕ \sqrt{N}$ has a limiting Gaussian distribution by using the multivariate Central Limit Theorem. For any constant vector C = (C₁, …, C_L)^T, since the inverse of V_i is $V_{i}^{- 1} = I_{n_{i}} ∕ σ_{∊}^{2} - σ_{α}^{2} ∕ (σ_{∊}^{2} (σ_{∊}^{2} + n_{i} σ_{α}^{2})) 1^{\otimes 2}$ , we have

\begin{matrix} C^{T} N^{- 1 ∕ 2} {f - e (\tilde{β})} \approx \sum_{i = 1}^{m} \frac{1}{\sqrt{N}} {\sum_{l = 1}^{L} C_{l} z_{il} - C^{T} \tilde{Λ} {\tilde{J}}_{β β}^{- 1} x_{i}^{T} V_{i}^{- 1} (y_{i} - x_{i} β_{0})} \\ = & \sum_{i = 1}^{m} [\frac{1}{\sqrt{N}} \sum_{l = 1}^{n_{i}} {\sum_{l = 1}^{L} C_{l} I_{{x_{ij} \in E_{l}}} - C^{T} \tilde{Λ} {\tilde{J}}_{β β}^{- 1} (\frac{1}{σ_{∊}^{2}} x_{ij} - \frac{n_{i} σ_{α}^{2}}{σ_{∊}^{2} (σ_{∊}^{2} + n_{i} σ_{α}^{2})} {\overset{‒}{x}}_{i .})}] α_{i} \\ + & \sum_{i = 1}^{m} \sum_{j = 1}^{n_{i}} \frac{1}{\sqrt{N}} {\sum_{l = 1}^{L} C_{l} I_{{x_{ij} \in E_{l}}} - C^{T} \tilde{Λ} {\tilde{J}}_{β β}^{- 1} (\frac{1}{σ_{∊}^{2}} x_{ij} - \frac{n_{i} σ_{α}^{2}}{σ_{∊}^{2} (σ_{∊}^{2} + n_{i} σ_{α}^{2})} {\overset{‒}{x}}_{i .})} ∊_{ij} \\ = & \sum_{i = 1}^{m} c_{i, n_{i}} α_{i} + \sum_{s = 1}^{N} w_{s} ∊_{s}, \end{matrix}

where the double index (i, j) is placed in one-to-one correspondence with the single index s. Because ${α_{i}}_{i = 1}^{m}$ and ${∊_{s}}_{s = 1}^{N}$ are i.i.d and satisfy conditions (a) and (b) of Lemma 1, the above sums have limiting normal distributions as m → ∞. Because α_i and ∊_ij are independent $\sum_{i = 1}^{m} c_{i, n_{i}} α_{i}$ and $\sum_{s = 1}^{N} w_{s} ∊_{s}$ are conditionally independent given (x_i, n_i). As the two sums are jointly normal and asymptotically uncorrelated they are asymptotically independent and the limiting distribution of $C^{T} N^{- 1 ∕ 2} (f - e (\tilde{β}))$ is normal. Moreover, for any constant vector C, its limiting variance is of the form C^TΣC with the same fixed Σ,

\begin{matrix} Σ = & \lim_{N \to \infty} \frac{1}{N} \sum_{i = 1}^{m} Var (\begin{matrix} \sum_{j = 1}^{m} I_{{x_{ij} \in E_{l}}} (y_{ij} - x_{ij} β_{0}) \\ ⋮ \\ \sum_{j = 1}^{n_{i}} I_{{x_{ij} \in E_{L}}} (y_{ij} - x_{ij} β_{0}) \end{matrix}) - \\ \lim_{N \to \infty} [\frac{\nabla e (β_{0})}{N}] {[\frac{X^{T} V^{- 1} X}{N}]}^{- 1} {[\frac{\nabla e (β_{0})}{N}]}^{T} = H - Λ J_{β β}^{- 1} Λ^{T} . \end{matrix}

Therefore, $N^{- 1 ∕ 2} {f - e (\tilde{β})} \overset{D}{\to} N (0, Σ)$ , and $T = {f - e (\tilde{β})}^{T} Σ^{- 1} (f - e (\tilde{β})) ∕ N \overset{D}{\to} χ_{r}^{2}$ , where r = rank(Σ). We replace Σ with ${\hat{Σ}}_{ζ}$ , the reconstructed estimated variance matrix defined as immediately following equation (11) by means of the singular value decomposition applied to any consistent estimator $\hat{Σ}$ of Σ. One such consistent estimator of Σ is to replace all parameters in Σ with least squares and method of moments estimators. Based on Proposition 1 in Appendix C, $rank ({\hat{Σ}}_{ζ}) = rank (Σ)$ for large N. Thus

T = {f - e (\tilde{β})}^{T} {\hat{Σ}}_{ζ}^{-} {f - e (\tilde{β})} ∕ N \overset{D}{\to} χ_{r}^{2} .

Appendix F: Derivation of the power of the test

We derive the power of the test for LMM (1) under contiguous alternatives, based on Le Cam’s third lemma (Van der Vaart, 2000).

Lemma 2. (Le Cam’s third lemma) Let P_N and Q_N be two measures on a measurable space, corresponding to a null distribution under investigation, and an alternative hypothesis respectively. Suppose W_N is a real valued statistic for every N. If

(W_{N}, \log \frac{{dQ}_{N}}{{dP}_{N}}) \overset{P N}{\to} N_{L + 1} ([\begin{matrix} μ \\ - σ ∕ 2 \end{matrix}], [\begin{matrix} Σ & τ \\ τ^{T} & σ^{2} \end{matrix}]),

(27)

then $W_{N} \overset{Q_{N}}{\to} N_{L} (μ + τ, Σ)$ .

Let H₀ : θ_N = θ₀, and $H_{1} : θ_{N} = θ_{0} + a ∕ \sqrt{N}$ , where a is a constant vector and θ_N → θ₀, as n → ∞. By Taylor expansion, under Theorem 5.21 in van der Vaart (2000),

\begin{matrix} \log & \frac{{dQ}_{N}}{{dP}_{N}} = \log \frac{Likelihood (θ_{N}; Y, X)}{Likelihood (θ_{0}; Y, X)} = \log \frac{L (θ_{N})}{L (θ_{0})} \\ \approx {(\nabla \log (L (θ_{0})))}^{T} \frac{a}{\sqrt{N}} + \frac{1}{2} \frac{a^{T}}{\sqrt{N}} (\nabla^{\otimes 2} \log (L (θ_{0}))) \frac{a}{\sqrt{N}} \approx {(S_{N} (θ_{0}))}^{T} \frac{a}{\sqrt{N}} - \frac{1}{2} a^{T} J (θ_{0}) a, \end{matrix}

where

\begin{matrix} S_{N} (θ_{0}) = & \nabla \log (L (θ_{0})) \\ = & [\begin{matrix} X^{T} V^{- 1} (Y - X β_{0}) \\ - \frac{1}{2} tr (V^{- 1} \frac{\partial V}{\partial σ_{α}^{2}}) + \frac{1}{2} {(Y - X β_{0})}^{T} V^{- 1} \frac{\partial V}{\partial σ_{α}^{2}} V^{- 1} (Y - X β_{0}) \\ - \frac{1}{2} tr (V^{- 1}) + \frac{1}{2} {(Y - X β_{0})}^{T} V^{- 1} V^{- 1} (Y - X β_{0}) \end{matrix}] \end{matrix}

(28)

and the limit of the sample Fisher information per observation is

J (θ_{0}) = \lim_{N \to \infty} - \nabla^{\otimes 2} \log (L (θ_{0})) ∕ N = \lim_{N \to \infty} Var (S_{N} (θ_{0}) ∕ \sqrt{N}) .

Thus

\log \frac{{dQ}_{N}}{{dP}_{N}} \overset{P_{N}}{\to} N (\frac{1}{2} a^{T} J (θ_{0}) a, a^{T} J (θ_{0}) a) .

For the special case when we fit a reduced model to the data, using $X_{N \times p^{*}}^{*}$ instead of X_N×p with p* < p, we estimate the coefficient β* corresponding to X*. The sum over the expected values under the model, e(·), in (6) has R^p as its domain. Let e* (·) denote the sum over the expected values under the reduced model, computed using $X_{N \times p^{*}}^{*}$ instead of X_N×p with domain R^p*. Let $W_{N} = (f - e^{*} ({\hat{β}}^{*})) ∕ \sqrt{N}$ be the first vector component of (27). Under the null hypothesis P_N, W_N → N(0, Σ*), based on Corollary 2.

Next, we compute the variance-covariance matrix Σ in (27), which is equivalent to the variance-covariance matrix of $a^{T} S_{N} (θ_{0}) ∕ \sqrt{N}$ and $(f - e^{*} ({\hat{β}}^{*})) ∕ \sqrt{N}$ .

\begin{matrix} \frac{a}{\sqrt{N}} (f - e^{*} ({\hat{β}}^{*})) \approx \frac{1}{\sqrt{N}} (f - e^{*} (β_{0}^{*})) - \frac{1}{\sqrt{N}} \nabla e^{*} (β_{0}^{*}) ({\hat{β}}^{*} - β_{0}^{*}) \\ \approx & \frac{1}{\sqrt{N}} (f - e^{*} (β_{0}^{*})) - Λ^{*} \sqrt{N} ({\hat{β}}^{*} - β_{0}^{*}) \\ \approx & \frac{1}{\sqrt{N}} (f - e^{*} (β_{0}^{*})) - Λ^{*} {(J_{β β}^{*})}^{- 1} {(X^{*})}^{T} V^{- 1} (Y - X^{*} β_{0}^{*}) ∕ \sqrt{N} = \frac{1}{\sqrt{N}} (A - B) (Y - X^{*} β_{0}^{*}), \end{matrix}

where $J_{β β}^{*}$ denotes the information matrix corresponding to β*, and

A = [\begin{matrix} I_{{x_{11} \in E_{1}}} \dots I_{{x_{{mn}_{n}} \in E_{1}}} \\ ⋮ \\ I_{{x_{{mn}_{n}} \in E_{L}}} \dots I_{{x_{{mn}_{n}} \in E_{L}}} \end{matrix}], B = Λ^{*} {(J_{β β}^{- 1})}^{- 1} {(X^{*})}^{T} V^{- 1} .

Thus,

\begin{matrix} Cov (\frac{f - e^{*} ({\hat{β}}^{*})}{\sqrt{N}}, \log \frac{{dQ}_{N}}{{dP}_{N}}) = & Cov (\frac{f - e^{*} ({\hat{β}}^{*})}{\sqrt{N}}, \frac{a^{T} S_{N} (θ_{0})}{\sqrt{N}}) \\ = & \frac{1}{N} Cov (f - e^{*} ({\hat{β}}^{*}), a_{1}^{T} S_{β} + a_{2} S_{σ_{α}^{2}} + a_{3} S_{σ_{∊}^{2}}) . \end{matrix}

(29)

Under equation (28), since both $t r (V^{- 1} (\partial V ∕ \partial σ_{α}^{2}))$ and tr(V⁻¹) are scalars, we have

\begin{matrix} Cov (f - e^{*} ({\hat{β}}^{*}), S_{σ_{α}^{2}}) \\ = & Cov ((A - B) (Y - X^{*} β_{0}^{*}), \frac{1}{2} {(Y - X β_{0})}^{T} V^{- 1} \frac{\partial V}{\partial σ_{α}^{2}} V^{- 1} (Y - X β_{0})) \\ = & (A - B) Cov (Y - X^{*} β_{0}^{*}), \frac{1}{2} {(Y - X β_{0})}^{T} V^{- 1} \frac{\partial V}{\partial σ_{α}^{2}} V^{- 1} (Y - X^{*} β_{0}^{*})) = 0 . \end{matrix}

Similarly, we get $C o v (f - e^{*} ({\hat{β}}^{*}), S_{σ_{∊}^{2}}) = 0$ . Therefore (29) becomes

\begin{matrix} Cov (\frac{f - e^{*} ({\hat{β}}^{*})}{\sqrt{N}}, \log \frac{{dQ}_{n}}{{dP}_{n}}) = \frac{1}{N} Cov (f - e^{*} ({\hat{β}}^{*}), a_{1}^{T} S_{β}) = \frac{1}{N} (A - B) Var (Y) V^{- 1} {Xa}_{1} \\ = {Λ - \frac{1}{N} Λ^{*} {(J_{β β}^{*})}^{- 1} [{(X^{*})}^{T} V^{- 1} X]} a_{1} = {Λ - Λ^{*} {[{(X^{*})}^{T} V^{- 1} (X^{*})]}^{- 1} [{(X^{*})}^{T} V^{- 1} X]} a_{1} . \end{matrix}

Since both $f - e^{*} ({\hat{β}}^{*})$ and $a_{1}^{T} S_{β}$ can be written as a matrix multiplied by the same normal vector $Y - X β_{0} = Y - X^{*} β_{0}^{*}$ , we obtain asymptotic joint normality of $f - e^{*} ({\hat{β}}^{*})$ and $a_{1}^{T} S_{β}$ . Because $f - e^{*} ({\hat{β}}^{*})$ is asymptotically uncorrelated with both $S_{σ_{α}^{2}}$ and $S_{σ_{∊}^{2}}$ as shown in the above, $f - e^{*} ({\hat{β}}^{*})$ and a^TS_N(θ₀) are also asymptotically jointly normal.

Footnotes

Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

References

Cagnone S. A note on goodness of fit test in latent variable models with categorical variables. Commun. Stat. A Theor. 2012;41:2983–2990. [Google Scholar]
Claeskens G, Hart JD. Goodness-of-fit tests in mixed models. Test. 2009;10:1100–1120. [Google Scholar]
Cox DR. Tests of separate families of hypotheses; Proc. Fourth Berkeley Symp. on Math. Statist. and Prob..1961. pp. 105–123. [Google Scholar]
Crainiceanu CM, Ruppert D. Likelihood ratio tests in linear mixed models with one variance component. J. R. Statist. Soc. B. 2004;66:165–185. D. [Google Scholar]
Godfrey LG. Misspecification Tests in Econometrics: The Lagrange Multiplier Principle and Other Approaches (Econometric Society Monographs) Cambridge University Press; Cambridge, United Kingdom: 1988. [Google Scholar]
Golub GH, Van Loan CF. Matrix Computations. The Johns Hopkins Studies in Mathematical Sciences, The Johns Hopkins University Press; Baltimore and London: 1983. [Google Scholar]
Hall P, Heyde CC. Martingale Limit Theory and its Applications. Academic Press; New York: 1980. [Google Scholar]
Jiang J. REML Estimation: Asymptotic Behavior and Related Topics. Ann. Stat. 1996;24:255–286. [Google Scholar]
Jiang J. Goodness-of-fit tests for mixed model diagnostics. Ann. Stat. 2001;4:1137–1164. [Google Scholar]
Khuri AI, Mathew T, Sinha BK. Statistical Tests for Mixed Linear Models. John Wiley & Sons; New York: 1998. [Google Scholar]
Lombardía MJ, Sperlich S. Semiparametric inference in generalized mixed effects models. J. R. Statist. Soc. B. 2008;70:913–930. [Google Scholar]
Marozzi M. Multivariate tri-aspect non-parametric testing. J. Nonparametr. Stat. 2007;19:269–282. [Google Scholar]
Marozzi M. A modified Cucconi test for location and scale change alternatives. Colomb. J. Statist. 2012;35:369–382. M. [Google Scholar]
Maydeu-Olivares A, Joe H. Limited- and full-information estimation and goodness-of-fit testing in 2n contingency tables: A unified framework. J. Am. Stat. Assoc. 2005;100:1009–1020. [Google Scholar]
Maydeu-Olivares A, Joe H. Limited information goodness-of-fit testing in multidimensional contingency tables: a unified framework. Psychometrika. 2006;71:713–732. [Google Scholar]
McCulloch CE, Searle SR. Generalized, Linear, and Mixed Models. John Wiley & Sons; New York: 2001. [Google Scholar]
Miller JJ. Asymptotic properties of maximum likelihood estimates in the mixed model of the analysis of variance. Ann. Stat. 1977;5:746–762. [Google Scholar]
Pan Z, Lin DY. Goodness-of-fit methods for generalized linear mixed models. Biometrics. 2005;61:1000–1009. doi: 10.1111/j.1541-0420.2005.00365.x. [DOI] [PubMed] [Google Scholar]
Pesarin F, Fortunato L. Permutation Tests for Complex Data: Theory, Applications and Software. John Wiley & Sons; New York: 2010. [Google Scholar]
Rao CR. Linear statistical inference and its applications. second Edition John Wiley & Sons; New York: 1973. [Google Scholar]
Rao CR, Wu Y. A strongly consistent procedure for model selection in a regression problem. Biometrika. 1989;76(2):369–374. [Google Scholar]
Richardson AM, Welsh AH. Asymptotic properties of restricted maximum likelihood (REML) estimates for hierarchical mixed linear models. Austral. J. Statist. 1994;36:31–43. [Google Scholar]
Ritz C. Goodness-of-fit tests for mixed models. Board of the Foundation of Scand. J. Stat. 2004;31:443–458. [Google Scholar]
Salmaso L, Solari A. Multiple aspect testing for case-control designs. Metrika. 2005;62:331–340. [Google Scholar]
Schoenfeld D. Chi-squared goodness-of-fit tests for the proportional hazards regression model. Biometrika. 1980;67:145–153. [Google Scholar]
Self SG, Liang K. Asymptotic properties of maximum likelihood estimators and likelihood ratio tests under nonstandard conditions. Amer. Statist. Assoc. 1987;82:605–610. [Google Scholar]
Shao J. An asymptotic theory for linear model selection. Stat. Sinica. 1997;7:221–264. [Google Scholar]
Stezhko VA, Buglova EE, Danilova LI, et al. A cohort study of thyroid cancer and other thyroid diseases after the Chernobyl accident: Objectives, design and methods. Radiat. Res. 2004;161:481–492. doi: 10.1667/3148. [DOI] [PubMed] [Google Scholar]
Tang M. Ph.D Thesis: Goodness of fit test for generalized linear mixed models. Department of Mathematics, University of Maryland; College Park: 2010. [Google Scholar]
van der Vaart AW. Asymptotic Statistics. Cambridge University Press; Cambridge, United Kingdom: 2000. [Google Scholar]
Wand MP. Fisher information for generalized linear mixed models. J. Multivariate. Anal. 2007;98:1412–1416. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

NIHMS585672-supplement-01.pdf^{(73.7KB, pdf)}

[R1] Cagnone S. A note on goodness of fit test in latent variable models with categorical variables. Commun. Stat. A Theor. 2012;41:2983–2990. [Google Scholar]

[R2] Claeskens G, Hart JD. Goodness-of-fit tests in mixed models. Test. 2009;10:1100–1120. [Google Scholar]

[R3] Cox DR. Tests of separate families of hypotheses; Proc. Fourth Berkeley Symp. on Math. Statist. and Prob..1961. pp. 105–123. [Google Scholar]

[R4] Crainiceanu CM, Ruppert D. Likelihood ratio tests in linear mixed models with one variance component. J. R. Statist. Soc. B. 2004;66:165–185. D. [Google Scholar]

[R5] Godfrey LG. Misspecification Tests in Econometrics: The Lagrange Multiplier Principle and Other Approaches (Econometric Society Monographs) Cambridge University Press; Cambridge, United Kingdom: 1988. [Google Scholar]

[R6] Golub GH, Van Loan CF. Matrix Computations. The Johns Hopkins Studies in Mathematical Sciences, The Johns Hopkins University Press; Baltimore and London: 1983. [Google Scholar]

[R7] Hall P, Heyde CC. Martingale Limit Theory and its Applications. Academic Press; New York: 1980. [Google Scholar]

[R8] Jiang J. REML Estimation: Asymptotic Behavior and Related Topics. Ann. Stat. 1996;24:255–286. [Google Scholar]

[R9] Jiang J. Goodness-of-fit tests for mixed model diagnostics. Ann. Stat. 2001;4:1137–1164. [Google Scholar]

[R10] Khuri AI, Mathew T, Sinha BK. Statistical Tests for Mixed Linear Models. John Wiley & Sons; New York: 1998. [Google Scholar]

[R11] Lombardía MJ, Sperlich S. Semiparametric inference in generalized mixed effects models. J. R. Statist. Soc. B. 2008;70:913–930. [Google Scholar]

[R12] Marozzi M. Multivariate tri-aspect non-parametric testing. J. Nonparametr. Stat. 2007;19:269–282. [Google Scholar]

[R13] Marozzi M. A modified Cucconi test for location and scale change alternatives. Colomb. J. Statist. 2012;35:369–382. M. [Google Scholar]

[R14] Maydeu-Olivares A, Joe H. Limited- and full-information estimation and goodness-of-fit testing in 2n contingency tables: A unified framework. J. Am. Stat. Assoc. 2005;100:1009–1020. [Google Scholar]

[R15] Maydeu-Olivares A, Joe H. Limited information goodness-of-fit testing in multidimensional contingency tables: a unified framework. Psychometrika. 2006;71:713–732. [Google Scholar]

[R16] McCulloch CE, Searle SR. Generalized, Linear, and Mixed Models. John Wiley & Sons; New York: 2001. [Google Scholar]

[R17] Miller JJ. Asymptotic properties of maximum likelihood estimates in the mixed model of the analysis of variance. Ann. Stat. 1977;5:746–762. [Google Scholar]

[R18] Pan Z, Lin DY. Goodness-of-fit methods for generalized linear mixed models. Biometrics. 2005;61:1000–1009. doi: 10.1111/j.1541-0420.2005.00365.x. [DOI] [PubMed] [Google Scholar]

[R19] Pesarin F, Fortunato L. Permutation Tests for Complex Data: Theory, Applications and Software. John Wiley & Sons; New York: 2010. [Google Scholar]

[R20] Rao CR. Linear statistical inference and its applications. second Edition John Wiley & Sons; New York: 1973. [Google Scholar]

[R21] Rao CR, Wu Y. A strongly consistent procedure for model selection in a regression problem. Biometrika. 1989;76(2):369–374. [Google Scholar]

[R22] Richardson AM, Welsh AH. Asymptotic properties of restricted maximum likelihood (REML) estimates for hierarchical mixed linear models. Austral. J. Statist. 1994;36:31–43. [Google Scholar]

[R23] Ritz C. Goodness-of-fit tests for mixed models. Board of the Foundation of Scand. J. Stat. 2004;31:443–458. [Google Scholar]

[R24] Salmaso L, Solari A. Multiple aspect testing for case-control designs. Metrika. 2005;62:331–340. [Google Scholar]

[R25] Schoenfeld D. Chi-squared goodness-of-fit tests for the proportional hazards regression model. Biometrika. 1980;67:145–153. [Google Scholar]

[R26] Self SG, Liang K. Asymptotic properties of maximum likelihood estimators and likelihood ratio tests under nonstandard conditions. Amer. Statist. Assoc. 1987;82:605–610. [Google Scholar]

[R27] Shao J. An asymptotic theory for linear model selection. Stat. Sinica. 1997;7:221–264. [Google Scholar]

[R28] Stezhko VA, Buglova EE, Danilova LI, et al. A cohort study of thyroid cancer and other thyroid diseases after the Chernobyl accident: Objectives, design and methods. Radiat. Res. 2004;161:481–492. doi: 10.1667/3148. [DOI] [PubMed] [Google Scholar]

[R29] Tang M. Ph.D Thesis: Goodness of fit test for generalized linear mixed models. Department of Mathematics, University of Maryland; College Park: 2010. [Google Scholar]

[R30] van der Vaart AW. Asymptotic Statistics. Cambridge University Press; Cambridge, United Kingdom: 2000. [Google Scholar]

[R31] Wand MP. Fisher information for generalized linear mixed models. J. Multivariate. Anal. 2007;98:1412–1416. [Google Scholar]

PERMALINK

Goodness of Fit Tests for Linear Mixed Models

Min Tang

Eric V Slud

Ruth M Pfeiffer

Abstract

1. Introduction

2. Goodness of fit test statistic for linear mixed models

2.1. The linear mixed model

2.2. Test statistic and its asymptotic behavior when parameters are estimated by maximum likelihood

2.2.1. LMM with a single random effect

2.2.2. LMM with multilevel additive random effects

2.3. Test statistic and its asymptotic properties for two-level LMM with parameters estimated by least squares and method of moments

2.4. Power of the test

Figure 1.

3. Simulations to assess power and robustness of the test statistic

3.1. Main effects only (Scenario I)

3.1.1. Size and power of T

Table 1.

Table 2.

3.1.2. Robustness of T with respect to error and random effects distributions

3.1.3. Impact of choice of the cell partition on power

Table 3.

3.2. Normally distributed covariates with an omitted interaction term (Scenario II)

Figure 2.

Figure 3.

3.3. Omitted main effect and interaction term (Scenario III)

3.4. Misspecified functional form of a covariate (Scenario IV)

Table 4.

3.5. Remarks

4. Data example

5. Discussion

Supplementary Material

Acknowledgment

Appendix

Appendix A: Proof of Theorem 1

Appendix B: Proof of Corollary 2

Appendix C: Proposition 1

Appendix D: Assumptions for Theorem 3

Appendix E: Proof of Theorem 3

Appendix F: Derivation of the power of the test

Footnotes

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases