Bayesian model selection in linear mixed models for longitudinal data

Oludare Ariyo; Adrian Quintero; Johanna Muñoz; Geert Verbeke; Emmanuel Lesaffre

doi:10.1080/02664763.2019.1657814

. 2019 Aug 22;47(5):890–913. doi: 10.1080/02664763.2019.1657814

Bayesian model selection in linear mixed models for longitudinal data

Oludare Ariyo ^a,^b,^CONTACT, Adrian Quintero ^a, Johanna Muñoz ^a, Geert Verbeke ^a, Emmanuel Lesaffre ^a

PMCID: PMC9041623 PMID: 35707327

ABSTRACT

Linear mixed models (LMMs) are popular to analyze repeated measurements with a Gaussian response. For longitudinal studies, the LMMs consist of a fixed part expressing the effect of covariates on the mean evolution in time and a random part expressing the variation of the individual curves around the mean curve. Selecting the appropriate fixed and random effect parts is an important modeling exercise. In a Bayesian framework, there is little agreement on the appropriate selection criteria. This paper compares the performance of the deviance information criterion (DIC), the pseudo-Bayes factor and the widely applicable information criterion (WAIC) in LMMs, with an extension to LMMs with skew-normal distributions. We focus on the comparison between the conditional criteria (given random effects) versus the marginal criteria (averaged over random effects). In spite of theoretical arguments, there is not much enthusiasm among applied statisticians to make use of the marginal criteria. We show in an extensive simulation study that the three marginal criteria are superior in choosing the appropriate longitudinal model. In addition, the marginal criteria selected most appropriate model for growth curves of Nigerian chicken. A self-written R function can be combined with standard Bayesian software packages to obtain the marginal selection criteria.

KEYWORDS: Deviance information criterion, linear mixed models, marginalized likelihood, pseudo-Bayes factor, widely applicable information criterion

1. Introduction

Longitudinal studies have become central in a great variety of research areas. The longitudinal study design is the only study design that allows to relate determinants measured at the start of the study to changes in the subjects' condition over time. Numerous books have recently appeared on longitudinal study designs, see e.g. [2,12,13,21,35]. When the response is Gaussian, linear mixed-effects models (LLMs) are one of the most popular tools to analyze longitudinal data. Since its introduction by Laird and Ware [27], the LMM has been applied in a great variety of research areas and extended in many ways, e.g. to generalized linear mixed-effects models and non-linear mixed-effects models. Its popularity has much to do with its ability to describe both the impact of covariates on the mean longitudinal evolution as well as how individual profiles differ over time from the mean curve. The impact on the mean longitudinal curve is evaluated by their regression coefficients, which are referred to as the fixed effects. The subject-specific profiles are expressed as latent variables called random effects. In this way, the LMM fits subject-specific profiles and accounts for correlation among responses from the same subject. Another important feature is that the LMM allows for unbalanced data, i.e. when the number and timing of the observations per subject differ between subjects. The LMM parameters may be estimated using a frequentist approach. The properties of the estimated model parameters are then based on (restricted) maximum likelihood theory [54]. Alternatively, one could use the Bayesian framework. In the Bayesian approach, prior information on the model parameters is combined with information coming from the data. Using Bayes' theorem, an updated idea on the model parameters is obtained from the posterior distribution. The posterior distribution provides all information that is needed, and hence there is no need to refer to asymptotic normality properties for inference on the model parameters. This is especially useful in longitudinal studies with a small number of subjects and when the data are unbalanced [45]. Since most posterior distributions are analytically intractable, they need to be determined in a numerical way. Most popular numerical techniques are based on sampling from the posterior distribution. The Markov chain Monte Carlo (MCMC) techniques provide an important class of such methods. In this paper, we focus on fitting Bayesian LMMs to longitudinal data and compare the performance of different selection criteria. While in a Bayesian model, all parameters are stochastic (and thus random), we will (as many others) still use the standard terminology of fixed and random effects.

A variety of LMMs can be fitted to the data at hand depending on several aspects such as (i) the covariates that are considered in the fixed part of the model, (ii) the random effects structure to be included, e.g. random intercepts and/or random slopes, and (iii) possible transformations of the response. When considering several LMMs, it is important to select a parsimonious model that fits adequately the current and also future data. Unfortunately, there is little agreement on what criterion to choose for Bayesian model selection.

One of the first model selection criteria suggested in the literature is the Bayes factor [24], which is defined as the ratio of the marginal likelihood of two competing models. Although this criterion has a natural interpretation, its computation remains difficult in practice and the results can be sensitive to the choice of the prior distributions, presenting difficulties especially with improper priors. Gelfand and Dey [15] proposed the pseudo-Bayes factor (PSBF), which updates the (improper) prior to a proper posterior and calculates the Bayes factor using the generated posterior as prior. This alternative criterion, although relatively easy to compute, is not yet commonly used. The most popular Bayesian model selection criterion is the deviance information criterion (DIC) [48]. The DIC is similar to the AIC often used in the frequentist framework, i.e. it represents a trade-off between model fit and model complexity. The aim of DIC is to estimate the predictive ability of the fitted model to future samples from the same population. More recently, the widely applicable information criterion (WAIC) was proposed [55] for model selection in the Bayesian framework. This criterion estimates the predictive accuracy of the model and includes a bias correction for using the data twice, i.e. to estimate the model and to evaluate model's accuracy. It has also been argued that WAIC is a more fully Bayesian approach (compared to DIC) and is suitable for singular models, such as LMMs for longitudinal data when the random effects are considered as parameters in the model [18].

Apart from the above three model selection criteria, a wide variety of (Bayesian) statistical approaches have been suggested to select the most appropriate LMM. While it is not the aim of this paper to give a comprehensive overview, the reader should be aware of the large number of alternative approaches proposed in the literature. For instance, a popular alternative approach is to use Bayesian variable selection techniques, often based on the SSVS approach of George and McCulloch [19]. Examples of this approach can be found in Chen and Dunson [7], Cai and Dunson [5] and Gong et al. [20].

Bayesian software for hierarchical models most often makes use of the data augmentation (DA) algorithm. For the LMM, this implies that the random effects are estimated jointly with the other parameters. Hereby, the DA algorithm avoids to take the integral over the distribution of the random effects, which is the classical approach in the frequentist framework. Thus, in the frequentist approach classically the marginal version of the LMM is fitted to the data, while in the Bayesian approach the hierarchical or conditional version of the LMM is usually fitted.

Whether the marginal or the conditional version of the LMM is fitted to the data, it has an impact on the performance of the model selection criteria even when the conditional and marginal LMM essentially lead to the same model. The model selection criteria applied to the hierarchical specification of the LMM is referred to as the conditional criterion. Hence, one has the conditional DIC (cDIC), and similarly the conditional PSBF (cPSBF) and the conditional WAIC (cWAIC). On the other hand, when the model selection criterion is applied to the marginal specification of the LMM, one speaks of the marginal DIC (mDIC), marginal PSBF (mPSFB) and marginal WAIC (mWAIC). As will be shown in Section 5, these two versions of the model selection criteria are associated with different aims: cDIC (and similarly for cPSBF and cWAIC) considers the random effects as parameters of focus in the model whereas for mDIC (also mPSBF and mWAIC) the population of random effects represents the focus. In practice, this implies for mixed effects models that the conditional selection criteria evaluate the performance of the model when the population consists of all (future) measurements of the subjects included in the current study, while the marginal version of the criteria measures the performance of the model for all (future measurements of all) future subjects from the same population.

The problem is that in practice, model selection is most often based on cDIC (cPSBF, cWAIC) because of computational convenience. Indeed, cDIC can be immediately calculated using the conditional likelihood and it is automatically reported by WinBUGS [50] and other Bayesian software. However, most researchers are interested in knowing how well the model performs in the future. That is why one argues that conditional model selection criteria have the wrong focus, see e.g. [52]. Apart from not having the correct focus, model selection based on cDIC is questionable because the properties of DIC are based on the log-concavity of the likelihood, a condition that is violated in hierarchical models when the latent variables are considered as parameters in the model [33]. The implication of using cDIC as model selection has been documented via simulations for financial volatility models [6]. The authors concluded that in contrast to mDIC, cDIC tends to select overly complex models. For overdispersed count data, Millar [37] pointed out that the conditional-level DIC is an unreliable tool for model selection, while the same is true for the conditional WAIC [38]. Merkle et al. [36] advocated the use of marginal information criteria for item response models and show that mWAIC corresponds to leave-one-cluster-out, whereas cWAIC corresponds to leave-one-unit-out.

While we focus in this paper on Bayesian model selection, we note that also in the frequentist paradigm the performance of the conditional versus marginal model selection criteria has been compared extensively. A broad overview of a wide range of model selection criteria for the LMM is discussed in Mũller et al. [39] for model selection in a frequentist content, including conditional and marginal information criteria. A short section in that paper is devoted to the Bayesian paradigm. Further, Fang [11] showed that the marginal AIC (mAIC) is asymptotically equivalent to the leave-one-cluster-out cross-validation while the conditional AIC (cAIC) is asymptotically equivalent to the leave-one-observation-out cross-validation. Srivastava and Kubokawa [51] derived three conditional AICs and showed theoretically and by simulations that their proposals outperform cAIC and mAIC of Vaida and Blanchard [52]. Finally, Sefken et al. [46] introduce the R-package ‘cAIC4’ for the calculation of the cAIC for LMMs estimated with lme4. To determine the marginal criteria, extra computations are needed, which renders them less popular.

In practice, researchers' are often not aware of the difference between the marginal and conditional version of the information criteria, therefore, rely on default software [36]. That is why we have set up a simulation study that compares the performance of the two versions of the selection criteria for LMMs with longitudinal data. The first set of simulations makes use of the classical model LMM assumptions, i.e. when the random effects and measurement errors have a normal distribution. In the second set of simulations, we have simulated from LMMs with a skewed-normal and t-distribution for the random effects and measurement errors. Finally, we considered settings were we select both fixed and random effect jointly. All these sets of simulations clearly show the superiority of the marginal selection criteria. Moreover, in the analysis of a real data set, we again illustrate that the conditional criteria choose the least appropriate LMM. In order to promote the use of the marginal criteria for LMMs, we have written R software for the LMMs considered in our simulation study that can easily be combined with classical Bayesian software to compute the criteria mDIC, mPSBF and mWAIC for LMMs.

The rest of the article is organized as follows. In Section 2, we present the classical linear mixed model for longitudinal data. In Section 3, we treat the skew-normal LMM. The model selection criteria are introduced in Section 4 and the difference between conditional and marginalized versions is discussed in Section 5. In Section 6, we compare the criteria in an extensive simulation study, in order to give some practical recommendations. We also compared alternative versions of DIC and WAIC as suggested in the literature. In the same section, we discuss the simulation results when the normality assumption in the LMM is relaxed. A comparison of the conditional and marginal criteria on a real data set is done in Section 7. We give concluding remarks in Section 8.

2. The linear mixed-effects model

The classical LMM [27] for longitudinal data can be expressed as

Y_{i} = X_{i} β + Z_{i} b_{i} + ϵ_{i},

(1)

where $Y_{i}$ is an $m_{i}$ -dimensional response vector of measurements for the $i th$ subject ( $i = 1, \dots, n$ ). $X_{i}$ and $Z_{i}$ are $m_{i} \times p$ and $m_{i} \times q$ -dimensional covariate matrices, respectively, and $β$ is a p-dimensional vector of fixed effects. The residual component vector $ϵ_{i}$ is distributed as $N_{m_{i}} (0, Σ_{i})$ , where $Σ_{i}$ is an $m_{i} \times m_{i}$ positive-definite covariance matrix. It is usually assumed that $Σ_{i} = σ_{ϵ}^{2} I_{m_{i}}$ , where $I_{m_{i}}$ denotes the identity matrix of dimension $m_{i}$ .

The q-dimensional random-effects vectors $b_{i}$ are assumed independent from the residuals and distributed as $N_{q} (0, D)$ , where $D$ is a $q \times q$ positive-definite covariance matrix. Model (1) is called a mixed-effects model because it combines the fixed-effects structure $β$ with the subject-specific random effects $b_{1}, \dots, b_{n}$ . The LMM is advantageous because the data are not required to be balanced, and additionally, the within- and between-individual variations can be explicitly modeled through $Σ_{i}$ and $D$ , respectively.

In the frequentist setting, the model parameters are estimated from the marginalized model for the response, after integrating out the random effects [54]. The marginalized distribution has a closed form for model (1), namely

p (y_{i} | β, D, Σ_{i}) = N_{m_{i}} (X_{i} β, Z_{i} D Z_{i}^{'} + Σ_{i}) .

(2)

In the Bayesian framework, inference is usually based on the hierarchical formulation of the model. In the first hierarchical stage, the response follows the conditional distribution $p (y_{i} | β, Σ_{i}, b_{i}) = N_{m_{i}} (μ_{i}, Σ_{i}) = N_{m_{i}} (X_{i} β + Z_{i} b_{i}, Σ_{i})$ , whilst in the second stage, the subject-specific effects are specified with distribution $p (b_{i} | D) = N_{q} (0, D)$ .

3. The skew-normal linear mixed model

A m-dimensional random vector $Y$ follows a m-variate skew-normal (SN) distribution with location vector $μ_{0} \in {I R}^{m},$ $m \times m$ positive definite scale matrix $H$ and $m \times q$ skewness matrix $Δ,$ if its density function is given by

\begin{aligned} f (y | μ_{0}, H, Δ) & = 2^{q} ϕ_{m} (y | μ_{0}, H + Δ Δ^{'}) \\ \times Φ_{q} (Δ^{'} {(H + Δ Δ^{'})}^{- 1} (y - μ_{0}) | 0, {(I_{q} + Δ^{'} H^{- 1} Δ)}^{- 1}), \end{aligned}

(3)

where $ϕ_{m}$ and $Φ_{q}$ are the density function and the cumulative distribution functions of the m-dimensional and q-dimensional normal distribution, respectively. If we substitute $Δ = 0,$ Equation (3) reduces to the usual symmetric multivariate distribution $N_{m} (μ_{0}, H) .$ Arellano et al.[3] denote $Y \sim S N_{m, q} (μ, H, Δ)$ and $Y \sim S N_{m} (μ, H, Δ)$ when $m = q .$ Also, when $m = q,$ $Δ = diag (δ_{1}, \dots, δ_{m})$ and $H$ diagonal, Equation (3) reduces to the multivariate skew-normal distribution, see e.g. [47]. In practical settings, when the response and the covariate are highly skewed distributed, it might be more realistic to assume a multivariate SN for both random effects and measurement error [22].

The classical LMM (1) can be extended by assuming that

b_{i} \sim S N_{q} (0, D, Δ_{b}) and ϵ_{i} \sim S N_{m_{i}} (0, Ψ_{i}, Δ_{ϵ_{i}}), i = 1, \dots, n,

all independent. This results in the following skew-normal linear mixed model (SNLMM):

\begin{aligned} y_{i} | b_{i}, β, Ψ_{i}, Δ_{ϵ_{i}} & \sim S N_{m_{i}} (X_{i} β + Z_{i} b_{i}, Ψ_{i}, Δ_{ϵ_{i}}) \\ b_{i} | D, Δ_{b} & \sim S N_{q} (0, D, Δ_{b}), \end{aligned}

where $D = D (α)$ is a dispersion matrix, usually associated with the between-units variances, with $α$ unknown parameters in $D .$ In addition, $Δ_{ϵ_{i}}$ and $Δ_{b}$ are diagonal matrices with unknown elements $δ_{ϵ_{i 1}}, \dots, δ_{ϵ_{i_{m_{i}}}}$ and $δ_{b_{1}}, \dots, δ_{b_{q}},$ respectively. These components correspond to the skewness parameters. The marginal version of the SNLMM was shown by Arellano et al. [4] to be equal to

f_{Y_{i}} (y_{i} | Θ, ϑ) = 2^{m_{i} + q} ϕ_{n_{i}} (y_{i} | X_{i} β, Ψ_{i}) Φ_{m_{i + q}} (μ_{2 i} - Γ_{i} μ_{1 i} | 0, R_{i} + Γ_{i} Λ_{i} Γ_{i}^{'}),

where for $i = 1, \dots, n$ :

\begin{aligned} Ψ_{i} & = (δ_{ϵ}^{2} + σ_{ϵ}^{2}) I_{m_{i}} + Z_{i} (Δ_{b}^{2} + G) Z_{i}^{'}, μ_{1 i} = \frac{Λ_{i} Z_{i}^{'} (y_{i} - X_{i} β)}{δ_{ϵ}^{2} + σ_{ϵ}^{2}}, \\ μ_{2 i} & = (\frac{δ_{ϵ}}{\sqrt{σ_{ϵ}^{2} (δ_{ϵ}^{2} + σ_{ϵ}^{2})}} (y_{i} - X_{i} β)), Γ_{i} = (\begin{matrix} \frac{δ_{ϵ}}{\sqrt{σ_{ϵ}^{2} (δ_{ϵ}^{2} + σ_{ϵ}^{2})}} Z_{i} \\ - Δ_{b} (Δ_{b}^{2} + G)^{- 1} \end{matrix}), \\ R_{i} & = (\begin{matrix} I_{m i} & 0 \\ 0 & (I_{q} + Δ_{b} G^{- 1} Δ_{b})^{- 1} \end{matrix}), Λ_{i} = ((Δ_{b}^{2} + G)^{- 1} + \frac{Z_{i}^{'} Z_{i}}{δ_{ϵ}^{2} + σ_{ϵ}^{2}}) . \end{aligned}

Note that Arellano et al [4] also suggested a skew-t distribution whereby the basic Gaussian distribution is replaced by the t-distribution.

4. Bayesian criteria for model selection

Let $θ$ represent all model parameters of the LMM. For the marginal LMM, this includes the fixed effects and the parameters making up the covariance matrix of the random effects augmented with skewness parameters for the SNLMM. With the conditional LMM, the random effects are part of $θ$ . Further, we denote the collected (longitudinal) responses by $y$ and the obtained covariate values by the matrix $X$ . The posterior distribution is $p (θ ∣ y, X) = p (y ∣ θ, X) p (θ) / p (y ∣ X)$ . Since the posterior distribution does not have a closed form for the LMM, it is approximated using MCMC methods. Namely, K (dependent) values $θ^{1}, \dots, θ^{K}$ are sampled from the posterior distribution. The true posterior summary measures can then be approximated by their sampled versions.

When describing longitudinal data, a set of well-justified models can be established with different specifications for the fixed effects, random effects, covariance structure of the random effects and measurement error. Therefore, a model selection procedure is necessary to find an adequate model that explains current and future data. A variety of model selection procedures has been proposed in the Bayesian framework, but there is no consensus about the best criterion. Here we discuss the most popular criteria; they are also relatively easy to compute in practice.

4.1. The pseudo-Bayes factor

The Bayes factor (BF) could be viewed as the Bayesian equivalent of the likelihood ratio test. The Bayes factor can be used for testing the hypothesis that $y$ is generated by model $M_{1}$ with parameters $θ_{1}$ versus the alternative model $M_{2}$ with parameters $θ_{2}$ . Hereby BF measures the change from prior to posterior odds in favor of the null model, namely

{BF}_{1, 2} = \frac{p (M_{1} ∣ y)}{1 - p (M_{1} ∣ y)} = \frac{p (M_{1} ∣ y)}{p (M_{2} ∣ y)} = \frac{p (y ∣ M_{1}) p (M_{1})}{p (y ∣ M_{2}) p (M_{2})},

where $p (M_{1})$ and $p (M_{2})$ are the prior model probabilities, commonly set as $p (M_{1}) = p (M_{2}) = 0.5$ . In that case, the Bayes factor in favor of model $M_{1}$ is given by ${BF}_{1, 2} = p (y ∣ M_{1}) / p (y ∣ M_{2})$ where $p (y ∣ M_{r}) = \int p (y ∣ θ_{r}, M_{r}) p (θ_{r} ∣ M_{r}) d θ_{r}$ for $r = {1, 2}$ . The use of the Bayes factor is, however, limited in practice since it has been shown to be quite sensitive to the choice of the prior distributions $p (θ_{r} ∣ M_{r})$ and is not defined for improper priors, see e.g. [15].

Several alternatives for BF have been suggested to reduce the impact of $p (θ_{r} ∣ M_{r})$ . One proposal is PSBF, which is based on the partitions of the data set as follows. For the ith subject, one partitions the data set into a learning set $y_{L} = {y_{i} : i \in L}$ and a testing set $y_{T} = {y_{i} : i \in T}$ [14], whereby the testing and learning parts are defined respectively as $T = {i}$ and $L = {1, \dots, i - 1, i + 1, \dots, n}$ . The pseudo-Bayes factor in favor of model $M_{1}$ with respect to model $M_{2}$ is then obtained as

{PSBF}_{1, 2} = \frac{\prod_{i = 1}^{n} p (y_{i} ∣ y_{(i)}, M_{1})}{\prod_{i = 1}^{n} p (y_{i} ∣ y_{(i)}, M_{2})},

where $y_{(i)}$ is the total sample without $y_{i} .$ The component $p (y_{i} ∣ y_{(i)}, M_{r})$ is the probability of observing $y_{i}$ given the model $M_{r}$ fitted with all observations in the sample except $y_{i}$ . Thus the PSBF makes use of pseudo-marginal likelihoods in the numerator and denominator instead of the classical marginal likelihoods. The product terms are called conditional predictive ordinates (CPOs) [15]. For the ith subject under model $M_{r}$ , ${CPO}_{r, i}$ is defined as ${CPO}_{r, i} = p (y_{i} ∣ y_{(i)}, M_{r})$ . ${CPO}_{r, i}$ is computed from the sampled values $θ_{r}^{1}, \dots, θ_{r}^{K}$ under model $M_{r}$ as follows:

{CPO}_{r, i} \approx {[\frac{1}{\frac{1}{K} \sum_{k = 1}^{K} p (y_{i} ∣ θ_{r}^{k}, M_{r})}]}^{- 1} .

This statistic can be highly unstable for a very small value of the likelihood [44]. To ensure stability, different approaches have been prescribed in the literature [9,10,15,44]. However, there is no perfect approach due to computational issues [25].

The log-pseudo marginal likelihood is then for each model equal to ${LPML}_{r} = \sum_{i = 1}^{n} \log ({CPO}_{r, i})$ . Therefore, the ${PSBF}_{1, 2}$ in favor of model $M_{1}$ respect to model $M_{2}$ can be computed as

{PSBF}_{1, 2} = \exp ({LPML}_{1} - {LPML}_{2}) .

4.2. The deviance information criterion

The DIC suggested by Spiegelhalter et al. [48] is based on the predictive accuracy of the estimated model defined as

DIC = - 2 \log p (y | \bar{θ}) + 2 p_{D I C},

(4)

where $p_{D I C}$ corresponds to the effective number of parameters, given by

p_{D I C} = - 2 E_{θ | y} [\log p (y | θ)] + 2 \log [p (y | \bar{θ})],

which quantifies the number of parameters to be estimated after incorporating the prior information into the model. As seen above, the point estimator is the posterior mean of the parameters, but other estimates such as the median have also been suggested.

Defining the deviance as $D (θ) = - 2 \log {p (y | θ)} + 2 \log {f (y)}$ , the effective number of parameters can alternatively be written as $p_{D} = \bar{D (θ)} - D (\bar{θ})$ where $\bar{D (θ)}$ is the posterior mean of the deviance.

For practical purposes, we can ignore $f (y) .$ The mean deviance $\bar{D (θ)}$ can be approximated by $\frac{1}{K} \sum_{k = 1}^{K} D (θ^{k})$ and the plug-in deviance $D (\bar{θ})$ by $D (\frac{1}{K} \sum_{k = 1}^{K} θ^{k})$ . This criterion is popular because it is easy to compute once we have an MCMC sample and can be directly obtained in several Bayesian packages such as WinBUGS. However, DIC has been criticized, see [49] for details. For instance, DIC is not invariant to non-linear transformations of $θ$ and negative values for $p_{D I C}$ can occur in some cases.

4.3. The widely applicable information criterion

The widely applicable information criterion (WAIC) [55] is a fully Bayesian estimator that averages over the posterior distribution of $θ$ instead of conditioning on a point estimator $\hat{θ} (y)$ as done for DIC. For a future observation ${\tilde{y}}_{i}$ , this criterion measures the predictive accuracy of the model based on the log-posterior predictive distribution $\log p_{θ | y} ({\tilde{y}}_{i})$ of the parameter vector $θ$ . Since ${\tilde{y}}_{i}$ is unknown, predictive accuracy is defined by the expected log-predictive distribution (elpd) as

{elpd}_{i} = E_{f} [\log p_{θ | y} ({\tilde{y}}_{i})] = \int \log p_{θ | y} ({\tilde{y}}_{i}) f ({\tilde{y}}_{i}) d {\tilde{y}}_{i},

where f is the unknown distribution under the true model. For each observation of a new data set, elpd is computed to establish the predictive accuracy of that data set. This is called the expected log-pointwise predictive density (elppd) defined as $elppd = \sum_{i = 1}^{n} E_{f} [\log p_{θ | y} ({\tilde{y}}_{i})]$ .

Predictive accuracy can also be defined with a point estimate $\hat{θ} (y)$ , often $\hat{θ} (y) = E (θ | y)$ , as the expected log predictive distribution given the point estimator ${elpd}_{\hat{θ} (y)} = E_{f} (\log p (\tilde{y} | \hat{θ} (y)) = \int \log p_{θ | y} ({\tilde{y}}_{i}) f ({\tilde{y}}_{i}) d {\tilde{y}}_{i}$ . The log pointwise predictive distribution (lppd) based on the observed data is calculated as follows:

lppd = \log \prod_{i = 1}^{n} p_{θ | y} (y_{i}) = \sum_{i = 1}^{n} \log \int_{θ} p (y_{i} | θ) p (θ | y) d θ .

In practice, lppd can be estimated using an MCMC sample from the posterior distribution as

\hat{lppd} = \sum_{i = 1}^{n} \log [\frac{1}{K} \sum_{k = 1}^{K} p (y_{i} | θ^{k})] .

With the WAIC criterion, the expected log pointwise predictive density elppd is estimated as the log pointwise predictive distribution lppd with a bias correction ${\hat{elppd}}_{W A I C} = \hat{lppd} - p_{W A I C}$ . The measure $p_{W A I C}$ corresponds to an estimate of the effective number of parameters given by

p_{W A I C} = 2 \sum_{i = 1}^{n} [\log (\frac{1}{K} \sum_{k = 1}^{K} p (y_{i} | θ^{k})) - \frac{1}{K} \sum_{k = 1}^{K} \log p (y_{i} | θ^{k})] .

Note that, WAIC can be alternatively expressed as

WAIC = - 2 \hat{lppd} + 2 p_{W A I C},

similar to DIC in (4).

One of the strengths of WAIC is its invariability to the scale of the model parameters, which implies that WAIC does not change when $θ$ is replaced by $ψ = h (θ)$ , with h a strictly monotone function.

5. Marginal and conditional criteria

In practice, the choice between conditional and marginal information criteria should be motivated by the aim of the study [52]. Most often, this means that the marginal model selection criteria should be used since they estimate the predictiveness of the model when new clusters (in longitudinal studies, this implies new subjects) are involved, whereas the conditional criteria estimate the predictiveness of the model when new elements in the cluster (in longitudinal studies, new observations from the existing subjects) are involved. Nevertheless, when it comes to selecting the correct LMM it might still be that conditional criteria do a good job. In other words, it might be that the relative ordering of preference models is basically the same for both the conditional and marginal criteria. All of these comments apply to all three considered model selection criteria, but since cDIC is obtained automatically in most Bayesian software, it is the standard criterion in practice. Therefore, the literature shows some focus on DIC when examining the performance of conditional and marginal criteria. Despite the popularity of DIC, many have shown that the asymptotic justification of DIC [48] does not hold for hierarchical models, see e.g. Li et al. [31].

6. Simulation studies

We have carried out three simulation studies. In the first two studies, we based the simulated data on two classical data sets: the Potthoff and Roy data set [41] and the Jimma Infant Growth study [28]. They were chosen because the first is representative for a balanced longitudinal study, while for the second study the time points are (somewhat) irregular and subjects drop out from the study. Using the fitted LMMs as population models, the performance of the conditional and marginal versions of DIC, PSBF and WAIC are contrasted using simulations. mDIC can be obtained from a WinBUGS run by working with the marginal model instead of the hierarchical model. To avoid specifying the marginal model in the estimation process, an R function was implemented, which computes the marginalized version of DIC, PSBF and WAIC for a Gaussian, skew-normal and skew-t distribution of the random effects and measurement error. This R function takes the parameters sampled in the MCMC procedure from any Bayesian package and calculates the marginalized version using the closed form (2) and its extensions allowing for skew-normal and skew-t distributions. In addition, the conditional version of the three criteria is also computed by this function.

The main objective of the simulation study is to assess how well PSBF, DIC and WAIC select the correct model. According to the minimum value strategy, the model with the minimum value for the criterion is selected. Several simulation studies examining the performance of AIC and BIC, see e.g. [29], suggest to select the more complex model only if they differ in the criterion value with more than 5. This will be referred to as the absolute difference strategy. We will apply this strategy to all criteria. However, there is no evidence that this criterion is justified outside DIC.

6.1. The data sets and population models

In the dental study analyzed by Potthoff and Roy [41], the distance in (mm) from the pituitary to the pterygomaxillary fissure was measured at years 8, 10, 12 and 14 on 11 girls and 16 boys. We fitted the following linear mixed model as a function of age and sex (0= Female, 1=Male):

y_{i j} = β_{0} + β_{1} {sex}_{i} + β_{2} {age}_{i j} + b_{0 i} + ϵ_{i j}, (i = 1, \dots, 27; j = 1, \dots, 4),

(5)

where $y_{i j}$ is the distance (mm) measure of child i at time j and $b_{0 i}$ is a random intercept assumed to follow $b_{0 i} \sim N (0, σ_{b}^{2})$ . Using the SAS procedure MIXED [34], we obtained the following maximum likelihood estimates: ${\hat{β}}_{0} = 24.9688$ , ${\hat{β}}_{1} = 1.4831$ , ${\hat{β}}_{2} = - 2.3210$ , ${\hat{σ}}_{b}^{2} = 2.0495$ and ${\hat{σ}}_{ϵ}^{2} = 3.2668$ . These values were used as true parameters in this simulation study. The Jimma Infant Growth data set is based on the growth characteristics of about 8000 live births from South-West Ethiopia examined between September 1992 and September 1993. The growth characteristics height, weight and arm circumference of the babies were examined approximately every 60 days, but there were occasional deviations from the planned visits. Also, some children dropped out from the study for a variety of reasons such as relocation of their parents during the study or death of the child. This creates an unbalanced structure for the data. For the purpose of this simulation study, we have taken weight as response with covariates age and sex (0 = Girls, 1 = Boys) of the child, and age of the mother at delivery (agem). The details of the original analysis can be found in [28,30] where a sample of 495 children was selected to fit the model. This subset will also be the basis for this simulation study. The weight evolves in a non-linear way. To make use of an LMM, the time variable age was transformed into ${newage}_{i j} = \sqrt{{age}_{i j}} - ({age}_{i j} + 1) - 0.02 \times {age}_{i j}$ using fractional polynomials [30]. Initially, our population model is based on the following random intercept and slope model:

y_{i j} = β_{0} + β_{1} {sex}_{i} + β_{2} {newage}_{i j} + β_{3} {agem}_{i} + b_{0 i} + b_{1 i} \times {newage}_{i j} + ϵ_{i j},

(6)

assuming $(b_{0 i}, b_{1 i})^{'} \sim N (0, D)$ . Again, the estimates from this model (see Appendix) are used as the true values for the parameters in the simulation.

6.2. Simulation study 1

In the first simulation study, we consider the most popular setting of assuming normality for the random effects and measurement error. We believe that it is essential to show the performance of the selection criteria in this most popular setting. The performance of the model selection criteria may depend on whether the models differ in the fixed components or the random effects structure. Therefore, we examined the performance of the conditional and marginal criteria under two scenarios. For each of the two data sets we considered two scenarios. In Scenario I, we assumed that the random effects structure is known but that the considered models differ from the true model in the fixed part. For Scenario II, we assumed that the fixed part is known but the random effects part is unknown.

Regarding the prior distributions, we assigned independent vague normal priors, $N (0, 1000^{2})$ for the regression coefficients and a vague inverse gamma prior for the residual variance, i.e. $σ^{2} \sim I G (0.001, 0.001)$ . The conditionally conjugate prior for the random-effects covariance matrix is the inverse Wishart distribution, but this choice has been shown to be problematic when the number of clusters (here subjects) is small [16,42]. Therefore, we have taken uniform priors $U (0, 100)$ for the standard deviation of the random effects, see [16]. For the models with at least random intercept and slope, we assigned a uniform prior distribution $U (- 0.5, 0.5)$ for all pairwise correlations between random effects to ensure positive definiteness of the covariance matrix $D$ [40] following a proof in [8].

6.2.1. The balanced case: the Potthoff and Roy data set

As indicated above, we have considered two scenarios:

Scenario I: We assumed that the random effects structure is correct and considered models that differ in the fixed part. Besides the true data-generating model (5), we considered an overspecified model, which includes the interaction of age with sex and an underspecified model, which ignores the effect of sex. Hence, the alternative models are

$y_{i j} = β_{0} + β_{1} {age}_{i j} + β_{2} {sex}_{i} + β_{3} {age}_{i j} \times {sex}_{i} + b_{0 i} + ϵ_{i j}$ (overspecified),
$y_{i j} = β_{0} + β_{1} {age}_{i j} + b_{0 i} + ϵ_{i j}$ (underspecified).

Scenario II: We assumed that the fixed structure is correct and considered models that differ in the random effects. The overspecified model includes an additional random slope whereas the underspecified alternative ignores the random intercept in the data, more specifically

$y_{i j} = β_{0} + β_{1} {age}_{i j} + β_{2} {sex}_{i} + b_{0 i} + b_{1 i} \times {age}_{i j} + ϵ_{i j}$ (overspecified),
$y_{i j} = β_{0} + β_{1} {age}_{i j} + β_{2} {sex}_{i} + ϵ_{i j}$ (underspecified).

We simulated 500 data sets based on model (5). The covariate age was taken as in the original data set and sex was generated from a Bernoulli distribution with probability of success equal to 0.6, where 0.6 is the proportion of boys in the original data set. All the models in this simulation study were estimated based on three chains of 15, 000 iterations (discarding the first 5000 as a burn-in) and thinning equal to 10. Convergence of the MCMC samples was assessed with the Brooks–Gelman–Rubin (BGR) diagnostic. In cases where BGR was larger than $1.1,$ a new MCMC sample was selected with 10, 000 extra iterations until obtaining convergence.

In Table 1, we present for each criterion and for the two selection strategies, the percentage of times the correct, the overspecified or the underspecified model was chosen. The performance of the marginalized criteria is clearly better than the conditional counterparts in all cases. For instance, when using the minimum value selection rule, in most cases the percentage of correct selection for the marginalized version is almost twice that of the conditional counterpart. In addition, note that for the absolute difference rule in Scenario I, the percentage of correct model selections for the conditional version of DIC and of WAIC is basically zero. This strategy seems to work well also for PSBF and WAIC in Scenario II, but not in Scenario I. In Scenario II, the conditional versions of DIC, PSBF and WAIC favor overspecified models with additional random effects as also observed in [6] for financial volatility models.

Table 1. Simulation study 1: performance of the Bayesian model selection criteria for the Potthoff & Roy data set.

		Minimum value			Absolute difference
Scenario	Criteria	Over	Correct	Under	Over	Correct	Under
I	cDIC	18.6	67.6	13.8	2.4	1.0	96.6
	mDIC	16.8	76.4	6.8	1.4	55.2	43.4
	cPSBF	27.0	43.0	30.0	18.6	29.8	51.6
	mPSBF	17.6	75.2	7.2	2.8	65.2	32.0
	cWAIC	19.8	31.0	49.2	2.6	0.0	97.4
	mWAIC	18.8	75.0	6.2	1.4	58.4	40.2
II	cDIC	46.2	53.8	0.0	10.4	89.6	0.0
	mDIC	15.0	85.0	0.0	0.6	99.4	0.0
	cPSBF	52.4	47.6	0.0	32.0	68.0	0.0
	mPSBF	14.4	85.6	0.0	1.2	98.8	0.0
	cWAIC	63.2	36.8	0.0	16.0	84.0	0.0
	mWAIC	18.0	82.0	0.0	0.8	99.2	0.0

Open in a new tab

6.2.2. The unbalanced case: the Jimma infant growth study

Again we considered two scenarios:

Scenario I: We assumed that the random effects structure is correct and considered the following models that differ in the fixed part parameters, namely

Model (6) and including the interaction $newage \times sex$ (overspecified),
Model (6) but ignoring the covariate sex (underspecified).

Scenario II: We assumed that the covariates in the fixed part are correct and considered the following models that differ in the random effects structure, i.e.

Model (6) and including an additional random slope for ${newage}^{2}$ (overspecified),
Model (6) but ignoring the random slope for $newage$ (underspecified).

We generated 500 data sets from model (6). The covariate age was taken as in the original data set (i.e. 8, 10, 12, 14) and sex was generated from a Bernoulli distribution with probability of success equal to 0.6, where 0.6 is the proportion of boys in the original data set. The age of the mother was generated from a normal distribution ${agem}_{i} \sim N (24.49, 6.29)$ and we have taken $0, 60, 120, \dots, 360$ days as the moments of measurements. We created an unbalanced data set by allowing subjects to drop out randomly at days 240, 300 or 360.

As shown in Table 2, the marginalized criteria strongly outperform their conditional counterparts in both scenarios and selection strategies. We see again for Scenario II that all conditional criteria support the overspecified alternative with an additional random slope and that in this scenario the absolute difference strategy also works for PSBF and WAIC. With the minimum value rule, the probability of correctly selecting the data-generating model is about 1/3 with the conditional criteria. Hence, carrying out model selection based on the conditional criteria performs worse than selecting the models at random.

Table 2. Simulation study 1: performance of the Bayesian model selection criteria for the Jimma infant growth data set.

		Minimum value			Absolute difference
Scenario		Over	Correct	Under	Over	Correct	Under
I	cDIC	34.4	34.0	31.6	15.2	29.0	55.8
	mDIC	21.2	58.0	20.8	0.8	32.4	66.8
	cPSBF	33.0	32.8	34.2	47.0	31.8	21.2
	mPSBF	21.0	57.8	21.2	3.0	44.0	53.0
	cWAIC	36.2	31.2	32.6	14.4	26.4	59.2
	mWAIC	21.2	58.2	20.6	0.8	32.6	66.6
II	cDIC	63.2	36.8	0.0	43.2	56.8	0.0
	mDIC	26.4	73.6	0.0	0.2	99.8	0.0
	cPSBF	55.2	44.8	0.0	51.8	48.2	0.0
	mPSBF	28.0	72.0	0.0	2.8	97.2	0.0
	cWAIC	66.0	34.0	0.0	49.2	50.8	0.0
	mWAIC	27.4	72.6	0.0	0.2	99.8	0.0

Open in a new tab

6.3. Simulation study 2: additional simulations for the balanced case

We first evaluated the sensitivity of the results to some changes in the population model based on the Potthoff and Roy data. First, we varied the signal-to-noise ratio in model (5) by setting the value of $σ_{ϵ}^{2}$ to be $\frac{1}{4}$ , $\frac{1}{2}$ , 1, 2 and 4 times of the estimated residual variance as specified in Section 6.1. Table 3 displays the results on model selection. Again, the marginal criteria outperform their conditional counterparts irrespective of the scenario and selection strategy. Note that the performance of mDIC decreases with increasing residual variance and using the absolute difference strategy.

Table 3. Simulation study 2: percentage correct selection when changing the residual variance in the Potthoff & Roy data set.

		Minimum value					Absolute difference
Scenario	Criteria	0.25	0.5	1	2	4	0.25	0.5	1	2	4
I	cDIC	64.6	70.2	77.0	77.8	79.2	0.6	1.2	3.2	10.8	24.6
	mDIC	81.6	83.0	83.0	82.8	82.0	93.0	92.8	92.6	88.6	78.6
	cPSBF	31.8	36.6	40.8	58.4	68.0	30.3	38.2	39.0	39.6	39.4
	mPSBF	91.2	94.0	83.2	90.8	87.8	95.4	97.8	93.0	97.8	94.0
	cWAIC	41.4	36.6	39.4	38.4	39.0	0.4	0.2	0.2	0.4	0.2
	mWAIC	81.2	81.6	82.4	82.0	81.6	92.2	93.0	92.8	89.0	79.0
II	cDIC	44.4	47.4	50.8	51.6	55.4	86.2	86.2	87.2	88.4	89.0
	mDIC	80.4	82.4	83.6	85.4	86.4	99.2	99.4	99.6	99.6	90.2
	cPSBF	60.4	58.4	44.8	62.2	73.4	52.0	55.8	65.8	67.5	69.6
	mPSBF	83.8	86.8	84.2	84.0	83.8	98.7	97.9	97.6	91.2	86.4
	cWAIC	34.4	32.8	34.2	36.6	36.2	81.4	81.8	83.4	81.0	82.6
	mWAIC	77.6	81.0	82.6	82.0	82.4	97.6	99.2	99.2	99.0	92.2

Open in a new tab

Second, we varied the number of subjects in the study as 25, 50, 75 and 100. As shown in Table 4, the marginal criteria perform best regardless of the sample size. Note also that the performance of the marginal criteria increases with increasing sample size in both scenarios and selection strategies, which is not the case for the conditional criteria. For instance, the percentage of correct model selection for cDIC decreases with sample size for Scenario II with both selection rules.

Table 4. Simulation study 2: percentage correct selection when changing the sample size in the Potthoff & Roy data set.

		Minimum value				Absolute difference
Scenario	Criteria	25	50	75	100	25	50	75	100
I	cDIC	67.6	77.0	79.0	80.6	1.0	3.2	7.2	19.0
	mDIC	76.4	83.0	84.2	82.8	52.2	92.6	98.4	99.0
	cPSBF	43.0	40.8	49.4	45.4	0.0	0.4	0.8	0.0
	mPSBF	75.2	83.0	84.4	83.0	83.1	93.0	93.2	96.1
	cWAIC	31.0	39.4	41.4	43.8	0.0	44.8	44.6	40.6
	mWAIC	75.0	82.4	83.8	82.0	56.2	92.8	98.8	98.8
II	cDIC	53.8	50.8	47.4	41.0	89.6	87.2	87.2	84.8
	mDIC	85.0	83.6	86.2	83.8	99.2	99.6	99.2	99.4
	cPSBF	47.6	44.8	47.8	53.0	65.2	65.8	66.2	65.8
	mPSBF	85.6	84.2	86.0	83.0	90.2	97.6	97.6	97.9
	cWAIC	36.8	34.2	34.2	31.8	83.8	83.4	83.2	82.6
	mWAIC	82.0	82.6	84.6	80.2	99.4	99.2	99.2	99.3

Open in a new tab

Our results are in line with the findings in [33], who pointed out asymptotic problems with cDIC. Our simulation study also indicates that cWAIC is not better in this sense.

We additionally evaluated the model selection performance for alternative versions of DIC and WAIC. We denote as ${DIC}_{1}$ the criterion advocated in [48] where the complexity ( $p_{D I C 1}$ ) is defined in Section 4.2. The alternative version ${DIC}_{2}$ is the approximation to ${DIC}_{1}$ [17]. The complexity penalty $(p_{D I C 2})$ is a function of the variance of the deviance calculated as

p_{D I C 2} = 2 {var}_{θ | y} (\log {p (y | θ)}) .

(7)

Further, we modified DIC by letting the penalty term depend on the sample size. It has been suggested in [23] that the penalization should be defined based on the effective sample size $n_{e}$ , which depends on the within-subjects error structure. In the context of the LMM, statistical software like SAS defines $n_{e}$ as the total number of (independent) subjects, i.e. $n_{e} = n .$ Otherwise, $n_{e}$ is defined as the number of total data points, $n_{e} = n_{T}$ . We defined the following DIC criteria as ${DIC}_{3}$ and ${DIC}_{4}$ with effective degrees of freedom defined as $p_{D I C 3} = \log (n) p_{D I C 1}$ and $p_{D I C 4} = \log (n_{T}) p_{D I C 1}$ , respectively. These modifications are more a BIC-type as pointed out by a referee, however, we believe that it will be a useful exercise to evaluate their performance in this context.

The effective number of parameters of WAIC can be estimated in two ways [18]; $p_{W A I C 1}$ as defined in Section 4.3 and the alternative version $p_{W A I C 2}$ given as the variance of the log posterior distribution as

p_{W A I C 2} = \sum_{i = 1}^{n} {var}_{θ | y} (\log p (y_{i} | θ)) .

We notice from Table 5 that Spiegelhalter's DIC ( ${DIC}_{1}$ ) outperforms ${DIC}_{2}$ for the conditional versions. This may be expected since the alternative definition (7) is explicitly based on approximate posterior normality, which is likely not satisfied in the hierarchical version of the model. The marginal versions of ${DIC}_{1}$ and ${DIC}_{2}$ perform similarly.

Table 5. Simulation study 2: performance of alternative criteria for the Potthoff & Roy data set.

		Minimum value			Absolute difference
Scenario	Criteria	Over	Correct	Under	Over	Correct	Under
I	$c D I C_{1}$	18.6	67.6	13.8	2.4	1.0	96.6
	$c D I C_{2}$	11.8	36.0	52.2	1.6	0.0	98.4
	$c D I C_{3}$	3.2	85.0	11.8	0.6	22.8	76.6
	$c D I C_{4}$	4.2	40.4	55.4	1.4	19.6	79.0
	$c W A I C_{1}$	19.8	31.0	49.2	2.6	0.0	97.4
	$c W A I C_{2}$	16.8	41.2	42.0	2.6	0.0	97.4
	$m D I C_{1}$	16.8	76.4	6.8	1.4	55.2	43.4
	$m D I C_{2}$	16.8	73.8	9.4	1.4	52.2	46.4
	$m D I C_{3}$	1.8	65.4	32.8	0.2	34.0	65.8
	$m D I C_{4}$	2.8	53.2	44.0	0.2	24.0	75.8
	$m W A I C_{1}$	18.8	75.0	6.2	1.4	58.4	40.2
	$m W A I C_{2}$	17.8	75.2	7.0	1.4	56.2	42.4
II	$c D I C_{1}$	46.2	53.8	0.0	10.4	89.6	0.0
	$c D I C_{2}$	0.6	99.2	0.2	0.0	99.4	0.6
	$c D I C_{3}$	0.0	47.8	52.2	0.0	36.8	63.2
	$c D I C_{4}$	0.0	0.8	99.2	0.0	0.6	99.4
	$c W A I C_{1}$	63.2	36.8	0.0	16.0	84.0	0.0
	$c W A I C_{2}$	55.8	44.2	0.0	10.4	89.6	0.0
	$m D I C_{1}$	15.0	85.0	0.0	0.6	99.4	0.0
	$m D I C_{2}$	8.0	92.0	0.0	0.2	99.8	0.0
	$m D I C_{3}$	2.4	97.6	0.0	0.2	99.6	0.2
	$m D I C_{4}$	0.4	99.6	0.0	0.0	99.2	0.8
	$m W A I C_{1}$	18.0	82.0	0.0	0.8	99.2	0.0
	$m W A I C_{2}$	15.4	84.6	0.0	0.8	99.2	0.0

Open in a new tab

As expected, ${DIC}_{4}$ penalizes model complexity more heavily than ${DIC}_{3} .$ Regardless of the selection strategy, we observed that by increasing the penalization, the percentage of correct model selection decreases under the marginal versions and increases under the conditional versions.

As for the different versions of WAIC, we observed that the percentage of correct selection for ${WAIC}_{2}$ is slightly higher in the conditional version whereas the performance of the marginal versions is similar irrespective of the scenario. Absolute difference, however, is not a good alternative to the conditional version of DIC and WAIC alternatives.

6.4. Simulation study 3: extra simulation for possible extensions of LMM

6.4.1. Simulation study: jointly selection of both fixed and random effects

Depending on the data at hand, researchers are usually faced with the challenge of choosing the correct model. It is therefore important to select a parsimonious model that fits the data accurately. Since there is minimal agreement on which criteria to choose for Bayesian model selection, we evaluated the performance of the marginal and conditional criteria in choosing the correct model among other alternative models. Based on Potthoff & Roy data, we generated 500 data sets from Equation (5) and considered five possible alternative models for the data. We considered, namely, (i) different scale of the covariates, (ii) distributional assumptions not satisfied for either or both random-effects and measurement error, (iii) the nature of measurement error (heteroscedastic or heteroscedastic), (iv) wrong random effects structure. The following models were considered jointly with the model given by Equation (5).

C1: The model generating data specified in Equation (5).
C2: Equation (5) with age replaced by ${age}^{2}$ and including an additional random slope for age.
C3: Equation (5) age replaced by ${age}^{2} .$
C4: Equation (5) age replaced by $\log (age) .$
C5: Equation (5) with the normality assumption for random effects replaced by the skew-normal assumption.
C6: Equation (5) with the normality assumption for random effects replaced by the skew-normal assumption and heteroscedastic measurement error is assumed.

As seen in Table 6, the marginal criteria select the data-generating model (C1) in about 70% of the times contrary to the conditional criteria which select the true model in about 10% of the time. It is interesting to note that the conditional criteria select C5 (the model that assumes a skew-normal distribution for the random effects) in about 65% while the marginal criteria choose C5 in about 2%. The results show the superiority of the marginal criteria in selecting the true data-generating model.

Table 6. Simulation study 3: percentage of times the criteria selection select the required model described in Section 6.4.1 in the Potthoff & Roy data set.

	Model
Criteria	C1	C2	C3	C4	C5	C6
cDIC	12.8	7.0	3.6	4.0	70.6	2.0
cWAIC	13.2	8.4	8.0	4.6	64.2	1.6
cPSBF	10.8	10.6	6.0	5.8	66.8	0.0
mDIC	76.2	18.4	1.2	2.8	1.4	0.0
mWAIC	67.4	20.4	2.2	3.0	4.2	2.8
mPSBF	74.8	8.6	11.4	3.4	1.8	0.0

Open in a new tab

6.4.2. Simulation study: normality assumption for the random effects and measurement errors are relaxed

We also assessed the performance of the model selection criteria when the normality assumption for the random effects and measurement errors are relaxed. For this simulation study, we generated 500 data sets from the model

y_{i j} = β_{0} + x_{i} β_{1} + t_{i j} β_{2} + b_{0 i} + ϵ_{i j}, i = 1, \dots, n = 200, j = 1, \dots, 6

(8)

where $t_{i j} = j,$ $β_{1} = 2,$ $β_{2} = 1$ and $ϵ_{i j} \sim S N_{1} (0, {0.5}^{2}, 4) .$

First, we assumed that $β_{0} + b_{0 i} \sim N (4, 4),$ i.e. $β_{0} = 4$ and $b_{0 i} \sim N (0, 4) .$ In addition, to show the advantages of the skew-normal distribution for the random effect it is penchant to accommodate skewness. Second, we have taken the previous one except now we generated the $β_{0} + b_{0 i}$ according to $G a m m a (2, 1)$ distribution (as done also in [4,26]) with probability density $f (x) = x \exp (- x)$ yielding a highly skewed distribution. The subject-specific covariate $x_{i}$ is binary with $x_{i} = 1$ if $i \leq n / 2$ and is zero otherwise, while $t_{i j}$ represents a covariate with values varying within individuals and the same for all individuals. For each of the 500 simulated data sets, model (8) was fit under alternative models as described in Section 6.2.1. We sampled 7000 iterations after discarding the initial 3000 iterations. The thinning factor was at 7 to avoid correlation problems in the generated chains

The following vague priors were assigned: $β \sim N (0, 10^{2}),$ $σ_{ϵ}^{2} \sim I G (0.001, 0.001),$ $σ_{b}^{2} \sim I G (0.001, 0.001),$ $δ_{ϵ} \sim N (0, 10^{2}) I I δ_{ϵ} > 0,$ $δ_{b} \sim N (0, 10^{2}) I I δ_{b} > 0.$ The marginal distribution corresponding to Equation (8) is expressed in the closed form, as seen in Section 3. The simulation results shown in Table 7 confirm the results obtained above under the Gaussian distribution.

Table 7. Simulation study 3: performance of the Bayesian model selection criteria for gamma(2,1) for random error and $n (0, 4)$ for random effect.

		Minimum value			Absolute difference
Scenario	Criteria	Over	Correct	Under	Over	Correct	Under
I	cDIC	29.6	43.2	27.2	39.8	60.2	0.0
	mDIC	13.0	60.8	26.2	22.4	77.6	0.0
	cPSBF	59.0	28.2	12.8	46.6	52.4	1.0
	mPSBF	11.0	67.4	21.6	44.2	55.8	0.0
	cWAIC	25.4	51.4	23.2	32.6	67.4	0.0
	mWAIC	11.0	62.4	26.6	20.2	79.8	0.0
II	cDIC	18.2	26.4	55.4	38.2	61.8	0.0
	mDIC	18.2	64.4	17.4	15.6	84.4	0.0
	cPSBF	19.2	56.4	37.2	47.2	51.4	1.4
	mPSBF	14.6	70.2	15.2	19.2	78.8	2.0
	cWAIC	15.6	20.4	64.0	32.2	67.8	0.0
	mWAIC	18.2	66.0	15.8	14.4	85.6	0.0

Open in a new tab

Finally, we repeated the above simulation when (i) both random effects and random error have a skew-normal distribution and when (ii) the random error follows a $t (3)$ distribution. The results (not shown) confirm the above simulation results.

7. Application

The Nigerian indigenous chicken (NIC) data set describes the longitudinal evolution of the body weight (BW) of chickens of different breeds raised in a university experimental farm. Four hundred and sixteen chickens were measured every week from hatching up to 20 weeks. The study aimed to evaluate the growth of different chicken breeds. Here we considered two classes of progenies. Two hundred and seventy chickens were produced from the same parent stock (pure breed), while 146 chickens have different parents (cross breed). The rational for the study and the experimental design can be found in [1]. See Figure 1 for the evolution of weights of the chickens over time. Assuming a quadratic growth model with subject-specific random intercept and slopes, we fitted an LMM model to the weight at the jth measurement time of the ith chicken as

y_{i j} = β_{0} + β_{1} b r e e d_{i} + β_{2} a g e_{i j} + β_{3} a g e_{i j}^{2} + b_{0 i} + b_{1 i} a g e_{i j} + b_{2 i} a g e_{i j}^{2} + ϵ_{i j},

(9)

where $y_{i j}$ is the chicken body weight (kg); ${breed}_{i}$ is the breed indicator (1 = pure breed, 2 = cross breed), the ${age}_{i j}$ represents the age (standardized). For the purpose of this study, we limited the chicken's age to 13 weeks since after that age a considerable amount of chicken died. Thus $x_{i j} = (1, {breed}_{i}, {age}_{i j}, {age}_{i j}^{2})^{'},$ $b_{i} = (b_{0 i}, b_{1 i}, b_{2 i})^{'}$ and $Z_{i j} = (1, {age}_{i j}, {age}_{i j}^{2}),$ $i = 1, \dots, 416, j = 1, \dots, 13$ .

Figure 1. — Nigerian indigenous chicken data set: longitudinal profiles of body weight for 416 chickens highlighting 10 randomly chosen chickens.

We first used model (9) together with the classical Gaussian assumptions as model to fit the weights of the chickens over time, and we refer to this as Model 9(a). Based on the model fit, Figure 2 shows histograms and the corresponding Q–Q plots of the standardized posterior means of $b_{i}$ and $ϵ_{i j}$ , whereby the posterior means were divided by their corresponding posterior standard deviations. The plots show that there is apparently a non-normal pattern for subject-specific intercepts and slopes. Also, the residual plot suggests deviation from normality. We note that such plots may be difficult to interpret because the shrinkage effect depends on the number of measurements per subject, see e.g. [53]. But here there were no missing responses up to week 13 and standardization was applied. Nevertheless, these plots triggered us to consider three additional models with the same fixed effects structure but differing in the error and random effects distribution:

Model 9(b): LMM with a univariate skew-normal distribution for measurement error and a trivariate Gaussian distribution for the random effects.
Model 9(c): LMM with model with a trivariate skew normal random effects with Gaussian measurement error.
Model 9(d): LMM with a univariate skew-normal distribution for measurement error and a trivariate skew-normal distribution for the random effects.

Figure 2. — Nigerian indigenous chicken data set: Histogram and normal Q–Q plots for standardized posterior means of random effects based on Model 9(a): subject-specific intercepts in the first row, subject-specific slope of $age$ in the second row, subject-specific slope for the ${age}^{2}$ in the third row and residual in the fourth row.

The vague priors used are the same as those described in Section 6.4.1. We used 25,000 iterations after discarding the first 10,000 and thinning was set to 10. Convergence of the MCMC samples was assessed with the BGR criteria. Resulting parameter estimates are shown in Table 8.

Table 8. Nigeria indigenous chicken data set: posterior mean (regression coefficients) and median (variance parts), $95 %$ probability intervals and the conditional and marginal criteria under the four fitted models, see Section 7.

	Model 9a			Model 9b			Model 9c			Model 9d
	Estimate	$2.50 %$	$97.50 %$	Estimate	$2.50 %$	$97.50 %$	Estimate	$2.50 %$	$97.50 %$	Estimate	$2.50 %$	$97.50 %$
$β_{0}$	0.335	0.321	0.349	0.369	0.284	0.848	0.359	0.353	0.374	0.315	0.299	0.329
$β_{1}$	−0.008	−0.014	−0.001	−0.009	−0.018	0.000	−0.028	−0.030	−0.021	−0.029	−0.035	−0.023
$β_{2}$	0.239	0.229	0.249	0.308	0.227	0.853	0.235	0.231	0.245	0.232	0.221	0.242
$β_{3}$	0.031	0.027	0.034	0.046	0.028	0.223	0.031	0.030	0.032	0.030	0.028	0.031
$δ_{b 1}$	–	–	–	–	–	–	0.003	0.001	0.009	0.003	0.000	0.009
$δ_{b 2}$	–	–	–	–	–	–	0.002	0.001	0.007	0.002	0.000	0.007
$δ_{b 3}$	–	–	–	–	–	–	0.002	0.001	0.007	0.002	0.000	0.007
$δ_{ϵ}$	–	–	–	0.051	0.048	0.054				0.060	0.055	0.064
$d 11$	0.013	0.011	0.015	0.013	0.011	0.319	0.015	0.014	0.017	0.014	0.012	0.016
$d 12$	0.010	0.009	0.012	0.010	0.008	0.318	0.007	0.001	0.040	0.008	−0.012	0.031
$d 13$	0.001	0.000	0.001	0.000	0.000	0.098	0.005	−0.002	0.023	0.004	−0.019	0.024
$d 22$	0.010	0.008	0.011	0.010	0.008	0.383	0.008	0.003	0.122	0.009	0.001	0.085
$d 23$	0.002	0.001	0.002	0.002	0.001	0.123	−0.003	−0.011	0.002	−0.003	−0.069	0.002
$d 33$	0.001	0.001	0.001	0.001	0.001	0.039	0.008	0.004	0.081	0.006	0.001	0.068
$σ_{ϵ}$	0.001	0.001	0.001	0.000	0.000	0.001	0.002	0.002	0.002	0.001	0.001	0.001
cDIC		−19117.4			−19809.10			−19710.85			−18574.70
cWAIC		−19782.2			−20361.42			−19117.48			−20128.30
cplppd		−15242.3			−15945.86			−15414.23			−16113.33
mDIC		−16821.6			−15673.10			−17269.46			−17362.04
mWAIC		−16808.5			−15472.20			−17488.41			−17511.63
mlppd		−16665.4			−16965.43			−16765.43			−17165.43

Open in a new tab

It can be observed from Table 8 that the conditional criteria support Model 9(b), which seems to be an incorrect model based on Figure 2. In contrast, the marginal criteria favor Model 9(d), which appears to be also the most appropriate model here. We further evaluated the effect of the quadratic term in the fixed and random effects. The results (results not shown) of both versions of the criteria show that ${age}^{2}$ is more important in the random effects part than in the fixed part and there is an agreement between the conditional and the marginal criteria on this.

8. Discussion

We have compared three Bayesian selection criteria in the context of LLM for longitudinal data. In addition, we extended these settings to the skew-normal and t(3) distribution for random effects and measurement error. The simulation studies show that the marginal criteria outperform their conditional counterparts. Our results confirm the results of [6] for volatility models, [32,36,38] for item response models and [43] in hierarchical models.

It is important to remark that calculating the marginalized criteria does not represent an additional computational effort for LLM since the marginalized likelihood can be written in a closed form at least for a number of important distributions for the random effects and measurement errors. However, for generalized linear mixed models computing the marginalized likelihood is more involved and numerical integration methods are needed [43]. The performance of the conditional criteria will be examined in a subsequent paper.

We examined two selection rules: minimum value and absolute difference for all criteria. However, our results did not show justification for absolute difference outside DIC.

In our simulation study, the performance for the marginalized versions of DIC, WAIC and PSBF is similar. However, in contrast to DIC, WAIC and PSBF have the advantage of being non-invariant to non-linear transformations of the parameters in focus. For this reason, our advice is to base model selection on the marginal versions of WAIC or PSBF. Nevertheless, our R function computes both the marginal and conditional versions of all three selection criteria with no additional computational efforts. The function can be downloaded from https://ibiostat.be/online-resources/bayesian.

Another useful exercise is to evaluate the performance of the selection criteria when varying the vague prior for the covariance matrix of the random effects. This is under current examination.

Acknowledgements

The computational resources and services used in this work were provided by the VSC (Flemish Supercomputer Center), funded by the Research Foundation – Flanders (FWO) and the Flemish Government department EWI. We would like to thank the anonymous reviewers and associate editor whose suggestions lead to substantial improvement in the paper. The authors appreciate Dr. Mathew Adeleke of the Discipline of Genetics, School of Life Sciences, University of KwaZulu-Natal South African for the NIC dataset.

Funding Statement

The research of the first author was funded by Tertiary Education Trust Fund (TETFund) – AS&D grant of the Federal University of Agriculture, Abeokuta, Nigeria.

Disclosure statement

No potential conflict of interest was reported by the authors.

ORCID

Oludare Ariyo http://orcid.org/0000-0003-3375-1831

Adrian Quintero http://orcid.org/0000-0001-7268-2221

Geert Verbeke http://orcid.org/0000-0001-8430-7576

References

1.Adeleke M., Peters S., Ozoje M., Ikeobi C., Bamgbose A. and Adebambo O.A., Genetic parameter estimates for body weight and linear body measurements in pure and crossbred progenies of Nigerian indigenous chickens, Livestock Res. Rural Dev. 23 (2011), pp. 1–7. [Google Scholar]
2.Anderson S.J., Longitudinal study designs, Handbook Res. Methods Health Soc. Sci. (2018), pp. 1–20. [Google Scholar]
3.Arellano-Valle R., Bolfarine H. and Lachos V., Skew-normal linear mixed models, J. Data. Sci. 3 (2005), pp. 415–438. [Google Scholar]
4.Arellano-Valle R., Bolfarine H. and Lachos V., Bayesian inference for skew-normal linear mixed models, J. Appl. Stat. 34 (2007), pp. 663–682. doi: 10.1080/02664760701236905 [DOI] [Google Scholar]
5.Cai B. and Dunson D.B., Bayesian covariance selection in generalized linear mixed models, Biometrics 62 (2006), pp. 446–457. doi: 10.1111/j.1541-0420.2005.00499.x [DOI] [PubMed] [Google Scholar]
6.Chan J. and Grant A., On the observed-data deviance information criterion for volatility modeling, J. Financ. Econom. 14 (2016), pp. 772–802. doi: 10.1093/jjfinec/nbw002 [DOI] [Google Scholar]
7.Chen Z. and Dunson D.B., Random effects selection in linear mixed models, Biometrics 59 (2003), pp. 762–769. doi: 10.1111/j.0006-341X.2003.00089.x [DOI] [PubMed] [Google Scholar]
8.Coakley E.S. and Rokhlin V., A fast divide-and-conquer algorithm for computing the spectra of real symmetric tridiagonal matrices, Appl. Comput. Harmon. Anal. 34 (2013), pp. 379–414. doi: 10.1016/j.acha.2012.06.003 [DOI] [Google Scholar]
9.Congdon P., Bayesian Models for Categorical Data, John Wiley & Sons, West Sussex, 2005. [Google Scholar]
10.Dey D.K., Chen M. -H. and Chang H., Bayesian approach for nonlinear random effects models, Biometrics 53 (1997), pp. 1239–1252. doi: 10.2307/2533493 [DOI] [Google Scholar]
11.Fan T.-H., Wang Y.-F. and Zhang Y.-C., Bayesian model selection in linear mixed effects models with autoregressive (p) errors using mixture priors, J. Appl. Stat. 41 (2014), pp. 1814–1829. doi: 10.1080/02664763.2014.894001 [DOI] [Google Scholar]
12.Funatogawa I., Longitudinal Data Analysis: Autoregressive Linear Mixed Effects Models, Springer, Singapore, 2017. [Google Scholar]
13.Gayle V. and Lambert P., What is Quantitative Longitudinal Data Analysis?, Bloomsbury Publishing, London, 2018. [Google Scholar]
14.Geisser S. and Eddy W.F., A predictive approach to model selection, J. Am. Stat. Assoc. 74 (1979), pp. 153–160. doi: 10.1080/01621459.1979.10481632 [DOI] [Google Scholar]
15.Gelfand A. and Dey D., Bayesian model choice: Asymptotics and exact calculations, J. R. Stat. Soc. Ser. B 56 (1994), pp. 501–514. [Google Scholar]
16.Gelman A., Prior distributions for variance parameters in hierarchical models (comment on article by Browne and Draper), Bayesian Anal. 1 (2006), pp. 515–534. doi: 10.1214/06-BA117A [DOI] [Google Scholar]
17.Gelman A., Carlin J., Stern H. and Rubin D., Bayesian Data Analysis, Chapman and Hall, Florida, USA, 2004. [Google Scholar]
18.Gelman A., Hwang J. and Vehtari A., Understanding predictive information criteria for Bayesian models, Stat. Comput. 24 (2014), pp. 997–1016. doi: 10.1007/s11222-013-9416-2 [DOI] [Google Scholar]
19.George E.I. and McCulloch R.E., Variable selection via Gibbs sampling, J. Am. Stat. Assoc. 88 (1993), pp. 881–889. doi: 10.1080/01621459.1993.10476353 [DOI] [Google Scholar]
20.Gong L., Flegal J.M., Spindler S.R. and Mote P.L., Bayesian model selection on linear mixed-effects models for comparisons between multiple treatments and a control. arXiv preprint arXiv:1509.07510 (2015)
21.Hoffman L., Longitudinal Analysis: Modeling within-Person Fluctuation and Change, Routledge, New York, 2015. [Google Scholar]
22.Huang Y. and Dagne G., Bayesian semiparametric nonlinear mixed-effects joint models for data with skewness, missing responses, and measurement errors in covariates, Biometrics 68 (2012), pp. 943–953. doi: 10.1111/j.1541-0420.2011.01719.x [DOI] [PMC free article] [PubMed] [Google Scholar]
23.Jones R.H., Bayesian information criterion for longitudinal and clustered data, Stat. Med. 30 (2011), pp. 3050–3056. doi: 10.1002/sim.4323 [DOI] [PubMed] [Google Scholar]
24.Kass R.E. and Raftery A.E., Bayes factors, J. Am. Stat. Assoc. 90 (1995), pp. 773–795. doi: 10.1080/01621459.1995.10476572 [DOI] [Google Scholar]
25.Lachos V.H., Castro L.M. and Dey D.K., Bayesian inference in nonlinear mixed-effects models using normal independent distributions, Comput. Stat. Data. Anal. 64 (2013), pp. 237–252. doi: 10.1016/j.csda.2013.02.011 [DOI] [Google Scholar]
26.Lachos V.H., Ghosh P. and Arellano-Valle R.B., Likelihood based inference for skew-normal independent linear mixed models, Stat. Sin. 20 (2010), pp. 303–322. [Google Scholar]
27.Laird N.M. and Ware J.H., Random-effects models for longitudinal data, Biometrics 38 (1982), pp. 963–974. doi: 10.2307/2529876 [DOI] [PubMed] [Google Scholar]
28.Lesaffre E., Asefa M. and Verbeke G., Assessing the goodness-of-fit of the Laird and Ware model an example: The Jimma Infant survival differential longitudinal study, Stat. Med. 18 (1999), pp. 835–854. doi: [DOI] [PubMed] [Google Scholar]
29.Lesaffre E. and Lawson A., Bayesian Biostatistics (Statistics in Practice), Wiley, Chichester, 2012. [Google Scholar]
30.Lesaffre E., Todem D. and Verbeke G., Flexible modelling of the covariance matrix in a linear random effects model, Biom. J. 42 (2000), pp. 807–822. doi: [DOI] [Google Scholar]
31.Li B., Bruyneel L. and Lesaffre E., A multivariate multilevel Gaussian model with a mixed effects structure in the mean and covariance part, Stat. Med. 33 (2013), pp. 1877–1899. doi: 10.1002/sim.6062 [DOI] [PubMed] [Google Scholar]
32.Li L., Qiu S., Zhang B. and Feng C.X., Approximating cross-validatory predictive evaluation in Bayesian latent variable models with integrated IS and WAIC, Stat. Comput. 26 (2016), pp. 881–897. doi: 10.1007/s11222-015-9577-2 [DOI] [Google Scholar]
33.Li Y., Zeng T. and Yu J., Robust deviance information criterion for latent variable models. CAFE Research Paper No. 13.19 Available at https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2316341 (2013)
34.Littell R.C., Milliken G.A., Stroup W.W., Wolfinger R.D. and Schabenberger O., SAS for Mixed Models, SAS Institute, North Carolina, USA, 2007. [Google Scholar]
35.McArdle J.J. and Nesselroade J.R., Longitudinal Data Analysis Using Structural Equation Models, American Psychological Association, Washington, DC, 2014. [Google Scholar]
36.Merkle E., Furr D. and Rabe-Hesketh S., Bayesian model assessment: Use of conditional vs marginal likelihoods. arXiv preprint arXiv:1802.04452 (2018)
37.Millar R., Comparison of hierarchical Bayesian models for overdispersed count data using DIC and Bayes' factors, Biometrics 65 (2009), pp. 962–969. doi: 10.1111/j.1541-0420.2008.01162.x [DOI] [PubMed] [Google Scholar]
38.Millar R.B., Conditional vs marginal estimation of the predictive loss of hierarchical models using WAIC and cross-validation, Stat. Comput. 28 (2018), pp. 375–385. doi: 10.1007/s11222-017-9736-8 [DOI] [Google Scholar]
39.Müller S., Scealy J.L. and Welsh A.H., Model selection in linear mixed models, Stat. Sci. 28 (2013), pp. 135–167. doi: 10.1214/12-STS410 [DOI] [Google Scholar]
40.Plummer M., Cannot invert matrix, November 2011 [Online; posted 11-November-2011].
41.Potthoff R. and Roy S., A generalized multivariate analysis of variance model useful especially for growth curve problems, Biometrika 5 (1964), pp. 313–326. doi: 10.1093/biomet/51.3-4.313 [DOI] [Google Scholar]
42.Quintero A. and Lesaffre E., Multilevel covariance regression with correlated random effects in the mean and variance structure, Biom. J. 59 (2017), pp. 1047–1066. doi: 10.1002/bimj.201600193 [DOI] [PubMed] [Google Scholar]
43.Quintero A. and Lesaffre E., Comparing hierarchical models via the marginalized deviance information criterion, Stat. Med. 37 (2018), pp. 2440–2454. doi: 10.1002/sim.7649 [DOI] [PubMed] [Google Scholar]
44.Raftery A.E., Newton M.A., Satagopan J.M. and Krivitsky P.N., Estimating the integrated likelihood via posterior simulation using the harmonic mean identity, Bayesian Stat. 8 (2007), pp. 1–45. [Google Scholar]
45.Raudenbush S.W. and Bryk A.S., Hierarchical Linear Models: Applications and Data Analysis Methods, vol. 1. Sage, California, 2002. [Google Scholar]
46.Säfken B., Rügamer D., Kneib T. and Greven S., Conditional model selection in mixed-effects models with cAIC4. arXiv preprint arXiv:1803.05664 (2018)
47.Sahu S.K., Dey D.K. and Branco M.D., A new class of multivariate skew distributions with applications to Bayesian regression models, Can. J. Stat. 31 (2003), pp. 129–150. doi: 10.2307/3316064 [DOI] [Google Scholar]
48.Spiegelhalter D., Best N., Carlin N. and van der Linde A., Bayesian measures of model complexity and fit, J. R. Stat. Soc. Ser. B 64 (2002), pp. 583–639. doi: 10.1111/1467-9868.00353 [DOI] [Google Scholar]
49.Spiegelhalter D., Best N., Carlin N. and van der Linde A., The deviance information criterion: 12 years on, J. R. Stat. Soc. Ser. B 76 (2014), pp. 485–493. doi: 10.1111/rssb.12062 [DOI] [Google Scholar]
50.Spiegelhalter D., Thomas A., Best N. and Lunn D., WinBUGS User Manual, 1.4 ed., 2003
51.Srivastava M.S. and Kubokawa T., Conditional information criteria for selecting variables in linear mixed models, J. Multivar. Anal. 101 (2010), pp. 1970–1980. doi: 10.1016/j.jmva.2010.05.007 [DOI] [Google Scholar]
52.Vaida F. and Blanchard S., Conditional Akaike information for mixed-effects models, Biometrika 92 (2005), pp. 351–370. doi: 10.1093/biomet/92.2.351 [DOI] [Google Scholar]
53.Verbeke G. and Lesaffre E., A linear mixed-effects model with heterogeneity in the random-effects population, J. Am. Stat. Assoc. 91 (1996), pp. 217–221. doi: 10.1080/01621459.1996.10476679 [DOI] [Google Scholar]
54.Verbeke G. and Molenberghs G., Linear Mixed Models for Longitudinal Data, Springer Series in Statistics, New York, 2000. [Google Scholar]
55.Watanabe S., Asymptotic equivalence of Bayes cross validation and widely applicable information criterion in singular learning theory, J. Mach. Learn. Res. 11 (2010), pp. 3571–3594. [Google Scholar]

[CIT0001] 1.Adeleke M., Peters S., Ozoje M., Ikeobi C., Bamgbose A. and Adebambo O.A., Genetic parameter estimates for body weight and linear body measurements in pure and crossbred progenies of Nigerian indigenous chickens, Livestock Res. Rural Dev. 23 (2011), pp. 1–7. [Google Scholar]

[CIT0002] 2.Anderson S.J., Longitudinal study designs, Handbook Res. Methods Health Soc. Sci. (2018), pp. 1–20. [Google Scholar]

[CIT0003] 3.Arellano-Valle R., Bolfarine H. and Lachos V., Skew-normal linear mixed models, J. Data. Sci. 3 (2005), pp. 415–438. [Google Scholar]

[CIT0004] 4.Arellano-Valle R., Bolfarine H. and Lachos V., Bayesian inference for skew-normal linear mixed models, J. Appl. Stat. 34 (2007), pp. 663–682. doi: 10.1080/02664760701236905 [DOI] [Google Scholar]

[CIT0005] 5.Cai B. and Dunson D.B., Bayesian covariance selection in generalized linear mixed models, Biometrics 62 (2006), pp. 446–457. doi: 10.1111/j.1541-0420.2005.00499.x [DOI] [PubMed] [Google Scholar]

[CIT0006] 6.Chan J. and Grant A., On the observed-data deviance information criterion for volatility modeling, J. Financ. Econom. 14 (2016), pp. 772–802. doi: 10.1093/jjfinec/nbw002 [DOI] [Google Scholar]

[CIT0007] 7.Chen Z. and Dunson D.B., Random effects selection in linear mixed models, Biometrics 59 (2003), pp. 762–769. doi: 10.1111/j.0006-341X.2003.00089.x [DOI] [PubMed] [Google Scholar]

[CIT0008] 8.Coakley E.S. and Rokhlin V., A fast divide-and-conquer algorithm for computing the spectra of real symmetric tridiagonal matrices, Appl. Comput. Harmon. Anal. 34 (2013), pp. 379–414. doi: 10.1016/j.acha.2012.06.003 [DOI] [Google Scholar]

[CIT0009] 9.Congdon P., Bayesian Models for Categorical Data, John Wiley & Sons, West Sussex, 2005. [Google Scholar]

[CIT0010] 10.Dey D.K., Chen M. -H. and Chang H., Bayesian approach for nonlinear random effects models, Biometrics 53 (1997), pp. 1239–1252. doi: 10.2307/2533493 [DOI] [Google Scholar]

[CIT0011] 11.Fan T.-H., Wang Y.-F. and Zhang Y.-C., Bayesian model selection in linear mixed effects models with autoregressive (p) errors using mixture priors, J. Appl. Stat. 41 (2014), pp. 1814–1829. doi: 10.1080/02664763.2014.894001 [DOI] [Google Scholar]

[CIT0012] 12.Funatogawa I., Longitudinal Data Analysis: Autoregressive Linear Mixed Effects Models, Springer, Singapore, 2017. [Google Scholar]

[CIT0013] 13.Gayle V. and Lambert P., What is Quantitative Longitudinal Data Analysis?, Bloomsbury Publishing, London, 2018. [Google Scholar]

[CIT0014] 14.Geisser S. and Eddy W.F., A predictive approach to model selection, J. Am. Stat. Assoc. 74 (1979), pp. 153–160. doi: 10.1080/01621459.1979.10481632 [DOI] [Google Scholar]

[CIT0015] 15.Gelfand A. and Dey D., Bayesian model choice: Asymptotics and exact calculations, J. R. Stat. Soc. Ser. B 56 (1994), pp. 501–514. [Google Scholar]

[CIT0016] 16.Gelman A., Prior distributions for variance parameters in hierarchical models (comment on article by Browne and Draper), Bayesian Anal. 1 (2006), pp. 515–534. doi: 10.1214/06-BA117A [DOI] [Google Scholar]

[CIT0017] 17.Gelman A., Carlin J., Stern H. and Rubin D., Bayesian Data Analysis, Chapman and Hall, Florida, USA, 2004. [Google Scholar]

[CIT0018] 18.Gelman A., Hwang J. and Vehtari A., Understanding predictive information criteria for Bayesian models, Stat. Comput. 24 (2014), pp. 997–1016. doi: 10.1007/s11222-013-9416-2 [DOI] [Google Scholar]

[CIT0019] 19.George E.I. and McCulloch R.E., Variable selection via Gibbs sampling, J. Am. Stat. Assoc. 88 (1993), pp. 881–889. doi: 10.1080/01621459.1993.10476353 [DOI] [Google Scholar]

[CIT0020] 20.Gong L., Flegal J.M., Spindler S.R. and Mote P.L., Bayesian model selection on linear mixed-effects models for comparisons between multiple treatments and a control. arXiv preprint arXiv:1509.07510 (2015)

[CIT0021] 21.Hoffman L., Longitudinal Analysis: Modeling within-Person Fluctuation and Change, Routledge, New York, 2015. [Google Scholar]

[CIT0022] 22.Huang Y. and Dagne G., Bayesian semiparametric nonlinear mixed-effects joint models for data with skewness, missing responses, and measurement errors in covariates, Biometrics 68 (2012), pp. 943–953. doi: 10.1111/j.1541-0420.2011.01719.x [DOI] [PMC free article] [PubMed] [Google Scholar]

[CIT0023] 23.Jones R.H., Bayesian information criterion for longitudinal and clustered data, Stat. Med. 30 (2011), pp. 3050–3056. doi: 10.1002/sim.4323 [DOI] [PubMed] [Google Scholar]

[CIT0024] 24.Kass R.E. and Raftery A.E., Bayes factors, J. Am. Stat. Assoc. 90 (1995), pp. 773–795. doi: 10.1080/01621459.1995.10476572 [DOI] [Google Scholar]

[CIT0025] 25.Lachos V.H., Castro L.M. and Dey D.K., Bayesian inference in nonlinear mixed-effects models using normal independent distributions, Comput. Stat. Data. Anal. 64 (2013), pp. 237–252. doi: 10.1016/j.csda.2013.02.011 [DOI] [Google Scholar]

[CIT0026] 26.Lachos V.H., Ghosh P. and Arellano-Valle R.B., Likelihood based inference for skew-normal independent linear mixed models, Stat. Sin. 20 (2010), pp. 303–322. [Google Scholar]

[CIT0027] 27.Laird N.M. and Ware J.H., Random-effects models for longitudinal data, Biometrics 38 (1982), pp. 963–974. doi: 10.2307/2529876 [DOI] [PubMed] [Google Scholar]

[CIT0028] 28.Lesaffre E., Asefa M. and Verbeke G., Assessing the goodness-of-fit of the Laird and Ware model an example: The Jimma Infant survival differential longitudinal study, Stat. Med. 18 (1999), pp. 835–854. doi: [DOI] [PubMed] [Google Scholar]

[CIT0029] 29.Lesaffre E. and Lawson A., Bayesian Biostatistics (Statistics in Practice), Wiley, Chichester, 2012. [Google Scholar]

[CIT0030] 30.Lesaffre E., Todem D. and Verbeke G., Flexible modelling of the covariance matrix in a linear random effects model, Biom. J. 42 (2000), pp. 807–822. doi: [DOI] [Google Scholar]

[CIT0031] 31.Li B., Bruyneel L. and Lesaffre E., A multivariate multilevel Gaussian model with a mixed effects structure in the mean and covariance part, Stat. Med. 33 (2013), pp. 1877–1899. doi: 10.1002/sim.6062 [DOI] [PubMed] [Google Scholar]

[CIT0032] 32.Li L., Qiu S., Zhang B. and Feng C.X., Approximating cross-validatory predictive evaluation in Bayesian latent variable models with integrated IS and WAIC, Stat. Comput. 26 (2016), pp. 881–897. doi: 10.1007/s11222-015-9577-2 [DOI] [Google Scholar]

[CIT0033] 33.Li Y., Zeng T. and Yu J., Robust deviance information criterion for latent variable models. CAFE Research Paper No. 13.19 Available at https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2316341 (2013)

[CIT0034] 34.Littell R.C., Milliken G.A., Stroup W.W., Wolfinger R.D. and Schabenberger O., SAS for Mixed Models, SAS Institute, North Carolina, USA, 2007. [Google Scholar]

[CIT0035] 35.McArdle J.J. and Nesselroade J.R., Longitudinal Data Analysis Using Structural Equation Models, American Psychological Association, Washington, DC, 2014. [Google Scholar]

[CIT0036] 36.Merkle E., Furr D. and Rabe-Hesketh S., Bayesian model assessment: Use of conditional vs marginal likelihoods. arXiv preprint arXiv:1802.04452 (2018)

[CIT0037] 37.Millar R., Comparison of hierarchical Bayesian models for overdispersed count data using DIC and Bayes' factors, Biometrics 65 (2009), pp. 962–969. doi: 10.1111/j.1541-0420.2008.01162.x [DOI] [PubMed] [Google Scholar]

[CIT0038] 38.Millar R.B., Conditional vs marginal estimation of the predictive loss of hierarchical models using WAIC and cross-validation, Stat. Comput. 28 (2018), pp. 375–385. doi: 10.1007/s11222-017-9736-8 [DOI] [Google Scholar]

[CIT0039] 39.Müller S., Scealy J.L. and Welsh A.H., Model selection in linear mixed models, Stat. Sci. 28 (2013), pp. 135–167. doi: 10.1214/12-STS410 [DOI] [Google Scholar]

[CIT0040] 40.Plummer M., Cannot invert matrix, November 2011 [Online; posted 11-November-2011].

[CIT0041] 41.Potthoff R. and Roy S., A generalized multivariate analysis of variance model useful especially for growth curve problems, Biometrika 5 (1964), pp. 313–326. doi: 10.1093/biomet/51.3-4.313 [DOI] [Google Scholar]

[CIT0042] 42.Quintero A. and Lesaffre E., Multilevel covariance regression with correlated random effects in the mean and variance structure, Biom. J. 59 (2017), pp. 1047–1066. doi: 10.1002/bimj.201600193 [DOI] [PubMed] [Google Scholar]

[CIT0043] 43.Quintero A. and Lesaffre E., Comparing hierarchical models via the marginalized deviance information criterion, Stat. Med. 37 (2018), pp. 2440–2454. doi: 10.1002/sim.7649 [DOI] [PubMed] [Google Scholar]

[CIT0044] 44.Raftery A.E., Newton M.A., Satagopan J.M. and Krivitsky P.N., Estimating the integrated likelihood via posterior simulation using the harmonic mean identity, Bayesian Stat. 8 (2007), pp. 1–45. [Google Scholar]

[CIT0045] 45.Raudenbush S.W. and Bryk A.S., Hierarchical Linear Models: Applications and Data Analysis Methods, vol. 1. Sage, California, 2002. [Google Scholar]

[CIT0046] 46.Säfken B., Rügamer D., Kneib T. and Greven S., Conditional model selection in mixed-effects models with cAIC4. arXiv preprint arXiv:1803.05664 (2018)

[CIT0047] 47.Sahu S.K., Dey D.K. and Branco M.D., A new class of multivariate skew distributions with applications to Bayesian regression models, Can. J. Stat. 31 (2003), pp. 129–150. doi: 10.2307/3316064 [DOI] [Google Scholar]

[CIT0048] 48.Spiegelhalter D., Best N., Carlin N. and van der Linde A., Bayesian measures of model complexity and fit, J. R. Stat. Soc. Ser. B 64 (2002), pp. 583–639. doi: 10.1111/1467-9868.00353 [DOI] [Google Scholar]

[CIT0049] 49.Spiegelhalter D., Best N., Carlin N. and van der Linde A., The deviance information criterion: 12 years on, J. R. Stat. Soc. Ser. B 76 (2014), pp. 485–493. doi: 10.1111/rssb.12062 [DOI] [Google Scholar]

[CIT0050] 50.Spiegelhalter D., Thomas A., Best N. and Lunn D., WinBUGS User Manual, 1.4 ed., 2003

[CIT0051] 51.Srivastava M.S. and Kubokawa T., Conditional information criteria for selecting variables in linear mixed models, J. Multivar. Anal. 101 (2010), pp. 1970–1980. doi: 10.1016/j.jmva.2010.05.007 [DOI] [Google Scholar]

[CIT0052] 52.Vaida F. and Blanchard S., Conditional Akaike information for mixed-effects models, Biometrika 92 (2005), pp. 351–370. doi: 10.1093/biomet/92.2.351 [DOI] [Google Scholar]

[CIT0053] 53.Verbeke G. and Lesaffre E., A linear mixed-effects model with heterogeneity in the random-effects population, J. Am. Stat. Assoc. 91 (1996), pp. 217–221. doi: 10.1080/01621459.1996.10476679 [DOI] [Google Scholar]

[CIT0054] 54.Verbeke G. and Molenberghs G., Linear Mixed Models for Longitudinal Data, Springer Series in Statistics, New York, 2000. [Google Scholar]

[CIT0055] 55.Watanabe S., Asymptotic equivalence of Bayes cross validation and widely applicable information criterion in singular learning theory, J. Mach. Learn. Res. 11 (2010), pp. 3571–3594. [Google Scholar]

PERMALINK

Bayesian model selection in linear mixed models for longitudinal data

Oludare Ariyo

Adrian Quintero

Johanna Muñoz

Geert Verbeke

Emmanuel Lesaffre

ABSTRACT

1. Introduction

2. The linear mixed-effects model

3. The skew-normal linear mixed model

4. Bayesian criteria for model selection

4.1. The pseudo-Bayes factor

4.2. The deviance information criterion

4.3. The widely applicable information criterion

5. Marginal and conditional criteria

6. Simulation studies

6.1. The data sets and population models

6.2. Simulation study 1

6.2.1. The balanced case: the Potthoff and Roy data set

Table 1. Simulation study 1: performance of the Bayesian model selection criteria for the Potthoff & Roy data set.

6.2.2. The unbalanced case: the Jimma infant growth study

Table 2. Simulation study 1: performance of the Bayesian model selection criteria for the Jimma infant growth data set.

6.3. Simulation study 2: additional simulations for the balanced case

Table 3. Simulation study 2: percentage correct selection when changing the residual variance in the Potthoff & Roy data set.

Table 4. Simulation study 2: percentage correct selection when changing the sample size in the Potthoff & Roy data set.

Table 5. Simulation study 2: performance of alternative criteria for the Potthoff & Roy data set.

6.4. Simulation study 3: extra simulation for possible extensions of LMM

6.4.1. Simulation study: jointly selection of both fixed and random effects

Table 6. Simulation study 3: percentage of times the criteria selection select the required model described in Section 6.4.1 in the Potthoff & Roy data set.

6.4.2. Simulation study: normality assumption for the random effects and measurement errors are relaxed

Table 7. Simulation study 3: performance of the Bayesian model selection criteria for gamma(2,1) for random error and n(0,4) for random effect.

7. Application

Figure 1.

Figure 2.

Table 8. Nigeria indigenous chicken data set: posterior mean (regression coefficients) and median (variance parts), 95% probability intervals and the conditional and marginal criteria under the four fitted models, see Section 7.

8. Discussion

Acknowledgements

Funding Statement

Disclosure statement

ORCID

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

Table 7. Simulation study 3: performance of the Bayesian model selection criteria for gamma(2,1) for random error and $n (0, 4)$ for random effect.

Table 8. Nigeria indigenous chicken data set: posterior mean (regression coefficients) and median (variance parts), $95 %$ probability intervals and the conditional and marginal criteria under the four fitted models, see Section 7.