Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2008 Jul 15.
Published in final edited form as: Comput Stat Data Anal. 2007 Jul 15;51(11):5220–5235. doi: 10.1016/j.csda.2006.09.031

Flexible Random Intercept Models for Binary Outcomes Using Mixtures of Normals

Brian Caffo a,*, Ming-Wen An a, Charles Rohde a
PMCID: PMC2031853  NIHMSID: NIHMS24237  PMID: 18628822

Abstract

Random intercept models for binary data are useful tools for addressing between-subject heterogeneity. Unlike linear models, the non-linearity of link functions used for binary data force a distinction between marginal and conditional interpretations. This distinction is blurred in probit models with a normally distributed random intercept because the resulting model implies a probit marginal link as well. That is, this model is closed in the sense that the distribution associated with the marginal and conditional link functions and the random effect distribution are all of the same family. It is shown that the closure property is also attained when the distributions associated with the conditional and marginal link functions and the random effect distribution are mixtures of normals. The resulting flexible family of models is demonstrated to be related to several others present in the literature and can be used to synthesize several seemingly disparate modeling approaches. In addition, this family of models offers considerable computational benefits. A diverse series of examples is explored that illustrates the wide applicability of this approach.

Keywords: Probit-normal, logit-normal, marginalized multilevel models

1. Introduction

Random intercept models for binary data are useful tools for addressing between subject heterogeneity. Typically, random intercept models are implemented by adding a normally distributed random effect into the linear predictor of a generalized linear model (or GLM, see Nelder and Wedderburn, 1972; McCullagh and Nelder, 1989), giving rise to a generalized linear mixed model (or GLMM, see Breslow and Clayton, 1993). Because of the non-linearity of the link functions for binary GLMMs, such models force a distinction between parameter interpretations conditional on random effects and marginal interpretations averaged over random effects.

Random intercept models for binary outcomes with a probit link function and normally distributed random intercept (probit-normal models) have the interesting property that the marginal link function is the inverse of a normal cumulative distribution function (CDF). In this case, we say the model is “closed” in the sense that the distributions associated with the marginal and conditional link functions and the random effect distribution are all of the same family.

In this manuscript we explore a general family of closed random intercept models. In particular, we consider instances when the distribution associated with the conditional link function and the random effect distribution are mixtures of normals. Simple properties of mixture of normals then imply that the distribution function associated with the marginal link function is also a mixture of normals. We emphasize both the conceptual and practical benefits of this class of models. Notably, we explore models that yield conditional and marginal interpretations of parameters.

To summarize the results, the proposed model offers the conceptual benefit of containing a wide class of common models for binary data as either special or limiting cases. Furthermore, we highlight some interesting practical advantages of these models. In particular, the added flexibility of placing a mixture distribution on the random effects protects against misspecification of this distribution. Placing a mixture distribution on the conditional link distribution both allows for easy post-hoc approximations of marginal link functions and easier fitting for marginalized multilevel models. Finally we demonstrate that the latent variable representations of the mixture distributions for the conditional link and random effect distributions can result in simple and elegant Gibbs samplers for Bayesian analysis.

The manuscript is laid out as follows. In Section 2 we present the notation and the model. In Section 3 we connect the mixture of normals model with several variants of random effect models in the literature. In Section 4 we illustrate with a diverse collection of useful applications of the mixture of normals approximation. Finally, in Section 5, we provide a summary and discussion of future work.

2. Random intercept model for binary outcomes

2.1. Notation

Consider the data given in Table 1, which arose from a teratology experiment (Weil, 1970), and was subsequently analyzed in Liang and Hanfelt (1994) and Heagerty and Zeger (1996). The objective is to compare the survival of rat pups in 16 control litters with that of the pups in the 16 treated litters. The treatment was a chemical agent administered to the mothers of each treated litter. We use this data set and experiment to motivate the model.

Table 1.

Teratology data. Numbers are (number survived, number dead) in each litter by treatment arm. For example, in the first control litter, all thirteen pups survived. Source Weil (1970).

(number survived, number dead)
Control (13, 0) (12, 0) (9, 0) (9, 0) (8, 0) (8, 0) (12, 1) (11, 1)
(9, 1) (9, 1) (8, 1) (11, 2) (4, 1) (5, 2) (7, 3) (7, 3)
Treatment (12, 0) (11, 0) (10, 0) (9, 0) (10, 1) (9, 1) (9, 1) (8, 1)
(8, 1) (4, 1) (7, 2) (4, 3) (5, 5) (3, 3) (3, 7) (0, 7)

Assume that {Yij} are repeated binary responses for subject/cluster i = 1, …, I and response j = 1, …, Ji. Therefore, in the Teratology data set, Yij represents survival or not (1 versus 0 respectively) for pup j from litter i. Let xij be a vector of covariates associated with Yij. For the Teratology data xij = (1, xij1)t, containing an intercept term and a treatment indicator, respectively.

Let Fw1 be a link function (see McCullagh and Nelder, 1989) that relates the probability of a success to a function of the covariates. As is typical for binary data, we assume that Fw (the inverse link function) is a distribution function, referred to as the “link distribution”. We assume that

Pr(Yij=1Ui=ui)=Fw(Δijui), (1)

where the {Ui} are cluster-specific random effects, used to model correlation and heterogeneity arising from unmeasured covariates specific to a cluster. The {Ui} are assumed to be independent and identically distributed random variables, having distribution function Fu. Throughout we assume that the {Yij} are conditionally independent given the {Ui}.

Users familiar with GLMMs will note two departures from common notation. First, the “transfer function”, Δij, is typically omitted and replaced with a linear combination of the covariates and slope parameters, such as

Δij=xijtβc. (2)

This departure is adopted to consider a broader class of marginal and conditional models, which we describe in detail. Secondly, the random effect is subtracted in (1) rather than added, a convention that will be discussed below.

2.2. Conditional models

A conditional model specifies Δij as in (2). The superscript c on the slope effects is used to denote that the effects are conditional, having an interpretation on the conditional link function’s scale.

Defining the Δij as such implies a marginal model. Specifically

Pr(Yij=1)=Fq(Δij), (3)

where Fq is the distribution of the sum of independent random variables having distribution functions Fu and Fw. To prove this fact, let {Wij} be iid draws from Fw, then note that

Pr(Yij=1)=EUi[Pr(Yij=1Ui=ui)]=Fw(Δijui)dFu(ui)=Pr(WijΔijuiUi=ui)dFu(ui)=EUi[Pr(Wij+uiΔijUi=ui)]=Pr(Wij+uiΔij)=Fq(Δij).

From this proof, we hope that the reason for the somewhat unusual convention of subtracting the random intercept is now clear.

We summarize the basic properties of the conditional model as

Conditional modelPr(Yij=1Ui=ui)=Fw(Δijui)Transfer functionΔij=xijtβcRandom effect distributionPr(Uiui)=Fu(ui)Implied marginal modelPr(Yij=1)=Fq(xijtβc).

As an example, consider again the Teratology data set. Assume that Fw is the standard normal distribution, Fu is a normal distribution with 0 mean and variance σu,12, and Δij is defined as in Equation 2. This model then corresponds to a probit-normal GLMM. By the standard properties of the normal distribution, the distribution of the sum of a standard normal (Fw) and a normal with mean 0 and variance σu,12 (Fu) results in Fq being a normal distribution with 0 mean and variance 1+σu,12. Thus, using Equation 3, we have the well known result (see Zeger et al., 1988, for example) that the induced marginal model is

Pr(Yij=1)=Fq(Δij)=Fq(β0c+xij1β1c)=Φ{β0c+xij1β1c(1+σu,12)1/2},

where Φ denotes the standard normal distribution function. Hence, the marginal link is also a probit, with the marginal effects being scaled versions of the conditional effects, βc/(1+σu,12)1/2.

To illustrate, the fitted values for the Teratology data set are β^0c=1.474, β^1c=.585, σ̂u,1 = .749. Hence −.585 estimates the conditional probit-scale change in the probability of survival comparing the treated to the untreated pups. Correspondingly, β^1c/(1+σ^u,12)1/2=.468 estimates the marginal probit-scale change in the probability of death.

2.3. Marginal Models

Consider again the Teratology probit-normal example from the previous section - i.e. Fw is a standard normal and Fu is a normal with mean 0 and variance σu,12. Had we defined

Δij=(β0m+xij1β1m)(1+σu,12)1/2,

then the marginal probability of success would satisfy

Pr(Yij=1)=Fq(Δij)=Fq{(β0m+xij1β1m)(1+σu,12)1/2}=Φ(β0m+xij1β1m).

Therefore, the estimated slope parameters would have a marginal probit interpretation without rescaling; hence the superscript m. By the invariance of the MLE, the marginal slope estimate is identical to those calculated from the conditional model, β^1m=.468.

The probit-normal example has emphasized that appropriately defining Δij results in parameters with marginal interpretations. In fact, Heagerty and Zeger (2000) showed that this technique can be applied more generally. Specifically, consider defining

Δij=Fq1{Fw(xijtβm)}. (4)

Under this definition for Δij and using (3), the marginal probability of success satisfies

Pr(Yij=1)=Fq(Δij)=Fq[Fq1{Fw(xijtβm)}]=Fw(xijtβm)

That is, under an appropriate modification of Δij, the slope parameters can be given a marginal interpretation with Fw as the link distribution. We summarize the marginal model with

Conditional modelPr(Yij=1Ui=ui)=Fw(Δijui)Transfer functionΔij=Fq1{Fw(xijtβm)}Random effect distributionPr(Uiui)=Fu(ui)Implied marginal modelPr(Yij=1)=Fw(xijtβm).

Consider again the Teratology data set. Allowing Fw to be the logistic distribution yields β^1m=.86, which estimates the marginal log-odds ratio of survival.

Marginalized multilevel models defined as such offer several advantages over competing methods. Unlike generalized estimating equations (GEE, see Liang and Zeger, 1986), they enjoy the benefits of a completely specified model, which includes the ability to plot profile likelihoods, the availability of likelihood ratio tests and Bayesian analysis and the relaxation on assumptions for missing data. Also, these models are more parsimonious and extensible than other marginal likelihood based models (see Lang and Agresti, 1994).

2.4. Mixtures of normals

The distinction between the conditional and marginal approaches is especially interesting for the probit-normal model, because of the fact that the probit-normal model is closed - the conditional, random effect and marginal link distributions all belong to the same family. In this manuscript we present another closed random intercept model for binary data that is considerably more flexible than the probit-normal model. In particular, when Fw and Fu are mixtures of normal distributions, then so is Fq.

To prove this, consider a model of the form

Fw(w)=Σl=1Lwπw,lΦ(wμw,lσw,l)and Fu(u)=Σl=1Luπu,lΦ(uμu,lσu,l),

where, the {πw,l} and {πu,l} are each assumed to be greater than 0 and sum to one. Using simple properties of mixtures of normals and Equation 3, we have that

Fq(q)=Σl=1LwΣl=1Luπw,lπu,lΦ{qμw,lμu,l(σw,l2+σu,l2)1/2}. (5)

That is, under this model, the random effect, conditional and marginal link distributions are all mixtures of normals. We summarize the model as

Conditional modelPr(Yij=1Ui=ui)=Σl=1Lwπw,lΦ(Δijuiμw,lσw,l)Transfer functionΔijdefined by either (2)or (4)Random effect distributionPr(Uiui)=Σl=1Luπu,lΦ(uiμu,lσu,l)Implied marginal modelPr(Yij=1)=Σl=1LwΣl=1Luπw,lπu,lΦ{Δijμw,lμu,l(σw,l2+σu,l2)1/2}. (6)

To summarize, the model of interest in this manuscript combines the conditional and marginal approaches, while adding the constraint that the conditional link and random effect distributions are both mixtures of normals. For completeness, we add that the log-likelihood for (6) is

Σi=1Iloguij=1JiFw(Δijui)yij{1Fw(Δijui)}1yijdFu(ui) (7)

(an equation that holds regardless of whether Fw and Fu are mixtures of normals).

Of course, Model 6 is excessively rich with all of the mixture probabilities, means and variances left unspecified; estimating both the conditional link distribution and the random effect distribution is a hopeless cause for most binary data sets. However, by specifying components of one or both of the free mixture distributions, one can achieve a variety of important models. We emphasize the following uses. First, specifying the mixture components for the conditional link distribution yields the ability to (approximately) fit models with other link functions, such as the logit, while retaining many of the computational benefits of probit links. Secondly, estimating a small number of mixture components on the random effects can be used to protect against canonical forms of misspecification. Finally, we find that using mixtures of normals for the conditional link and the random effect distribution leads to particularly convenient Gibbs samplers for conditional models. While many of these ideas have been explored, to our knowledge, they have not been placed in a unified modeling framework combining both conditional and marginal approaches. In the following literature review we highlight connections with existing research.

3. Literature review

In this section we demonstrate that Model 6 contains several important random intercept models for binary data as special or limiting cases. Clearly if Fu is degenerate at 0 and Δij=xijtβc, then the model yields a GLM for binary data. Extending this setting so that Fu is not degenerate and Lu = 1 and μu,1 = 0 yields a GLMM for binary data with a normally distributed random intercept (see Breslow and Clayton, 1993; Agresti et al., 2000).

To be technical, only those GLM and GLMMs for binary data whose conditional link distribution, Fw, is a mixture of normals are special cases of the model we have suggested. However, all of the common link functions (logit, complementary log-log) can be obtained as limiting cases. In Appendix C we provide an algorithm to solve for πw,l, σw,l and μw,l that yields very accurate approximations for a finite number of mixture components.

As an example, consider a mixture of normals as an approximation of the logistic distribution. The results using the algorithm in Appendix C with 150 quadrature points and {μw,j} = {0} yields the values given in Table 2. Figure 1 shows how accurate the approximation is, by depicting the exact logistic quantiles by a mixture of normals approximation. The mixture of normals approximation, with 5 mixture components, is nearly exact to logits of ± 10. By comparison, the plot also shows the standard normal and T quantiles, both of which are also used as approximations to the logit (see Caffo and Griswold, 2005). The linearity of the probit approximations breaks down at logits of around ± 3, while the T approximation around ± 5. Furthermore, we note that the mixture of normals approximation applies generally, to links other than the logistic, and can be made more accurate by simply adding more mixture components.

Table 2.

Mixing probabilities, standard deviations and means of the mixture components for a mixture-of-normals approximation to the logistic distribution.

π1 π2 π3 π4 π5
0.126840496 0.543170220 0.261711982 0.066181589 0.002066853
σ1 σ2 σ3 σ4 σ5
2.8420536 1.8257138 1.1943048 1.0757749 0.5631853
μ1 μ2 μ3 μ4 μ5
0 0 0 0 0

Fig. 1.

Fig. 1

Quantile-quantile plot of the logistic distribution (vertical axis) by three approximations: the mixture of normals (solid), the probit (dotted), the T (dashed). A reference identity line is depicted in grey. The corresponding probability scale is given on the right and upper axes.

Approximating the logistic distribution with a single normal distribution or mixture of normals has a rich history (see Demidenko, 2004, and the references therein). Perhaps most relevant, Monahan and Stefanski (1992) used weighted Gaussian distributions to explore the logistic-normal integral.

Representing the link function by a latent variable was considered in the proof of Equation 3. In Section 4.3 we consider a much more ambitious latent variable representation of Model 6, using latent variables to represent the normal mixture distributions as well. The general latent variable approach to binary data was considered in Albert and Chib (1993), who also introduced a Gibbs sampler that motivates the one presented in Section 4.3. Relevant extensions to multivariate settings were considered in Chib and Greenberg (1998); however they focused on probit links and more general covariance structures than the random intercept models considered here.

Consider again the instance where Lu = 1, μu,1 = 0 (the random intercept is normally distributed). As described in Section 2.3, Heagerty and Zeger (2000) defined the Δij to be non-linear (see Equation 4), so that the slope parameters have linear interpretations on the marginal link’s scale. These marginalized multilevel models for binary data are a special case of the models presented (by appropriately defining Δij). Moreover, later we demonstrate that using mixtures of normals for the link distribution can greatly facilitate computing for these models.

A potentially negative aspect of this model is that, because Δij is defined non-linearly, the conditional model is non-linear. The degree to which this is true depends on how close to linear Fq1Fw is. However, this may be of no concern whatsoever if only marginal interpretations are required (though see Lee and Nelder, 2004).

A clear generalization of the marginalized model would replace Fw in Equation 4 with any other desired link distribution, thus, allowing the conditional and marginal link functions to be different. This idea was explored in Griswold (2005) and extended to ordered multinomial data in Caffo and Griswold (2005). Again, this approach easily fits into the current framework by appropriately redefining Δij.

There has been a relatively small amount of research using mixtures of normals to estimate the link distribution. Geweke and Keane (1997) used a mixture of normals as a link function for dichotomous choice models. They presented an MCMC algorithm for fitting the model, including estimating the mixture components. In related work, Erkanli et al. (1993) used mixtures of normals to estimate the link function for ordinal data models and also presented an MCMC algorithm for estimating the mixture components. These approaches are conceptually related to the proposed model by forcing the random effect distribution to be degenerate at 0 and estimating {πw,l}, {μw,l} and { σw,l}.

In contrast, using mixtures to estimate the random effect distribution has received much more attention. Perhaps most relevant, Magder and Zeger (1996) used mixtures of normals as the random effect distribution and estimated the mixture parameters with an MCMC algorithm. This corresponds to estimating the {μu,l}, {σu,l} and {πu,l}. Aitkin (1999) and Follmann and Lambert (1989) used discrete mixtures to non-parametrically estimate the random effect distribution using maximum likelihood. Such models are obtained under the current framework as the {σu,j} tend to 0 and {μu,j} and {πu,j} are estimated.

4. Examples

In this section we explore a subset of practical considerations illustrated through four data sets. We explore two marginal and one conditional modeling settings, where computations are significantly simplified by using mixtures of normals. Moreover, we consider a case where mixture modeling of the random effect offers additional protection against model misspecification.

We consider four well studied data sets for illustration:

  1. The Teratology data, introduced in Section 2.1.

  2. The Approval Rating data set given in Table A.1. This 2×2 contingency table cross-classifies approval ratings of the British Prime Minister collected at two occasions. Here, Yij represents approval (1) or not (0) for individual i on occasion j, where j = 1, 2 for the two sampling occasions. The covariate vector, xij = (1, xij)t, contains an intercept term and an indicator function representing occasion, taking the value 1 when j = 2.

  3. The Crossover data, given in Table A.2, concerns a well-studied crossover study from Jones and Kenward (1987). Here, Yij represents an abnormal (1) or a normal (0) response for subject i during period j for j = 1, 2. The objective is to study the response in relation to the treatment and period. Thus, xij = (1, xij1, xij2)t contains an intercept term, a treatment indicator and a period indicator, taking the value 1 for the second period.

  4. The Item Response data, given in Table A.3, concerns subjects’ response to three scenarios (given in the table) on abortion stratified by gender. We let Yij be the response of subject i on question j, where a response of 1 is supportive of legalized abortion (and 0 is not). The covariate vector, xij = (1, xij)1, xij2, xij3)t, contains an intercept term, an indicator for male gender, an indicator for Scenario 1, and an indicator for Scenario 2, respectively. We use the Item Response data to illustrate an instance where a mixture random effect distribution is warranted.

To focus this discussion, we assume that the principal parameter of interest for each data set is: the (marginal or conditional) log odds-ratio comparing treated to controls in the Teratology data, the log odds-ratio comparing time 2 to time 1 for the Approval Rating data, the log odds-ratio comparing treated to controls in the Crossover data and the log odds-ratio comparing males to females in the Item response data. Therefore, in each case the regressor corresponding to the effect of interest is xij1.

4.1. Post-hoc calculation of marginal effects

Given results from a conditional random effect model, an obvious question asks, “What is the corresponding marginal effects and link distribution?”. Such a question is especially relevant in situations such as in interpreting published results, where only effect estimates (and not the original data) are available. Model 6 allows one to approximate the necessary calculations easily.

Consider the conditional logit model

logit{Pr(Yij=1Ui=ui)}=xijtβcuiand UiN(0,σu,12). (8)

If we are willing to accept the approximation that Fw is the 5 component mixtures of normals, then Model 8 is simply a special case of Model 6. Hence, we have that

Pr(Yij=1)=Fq(xijtβ^c)=Σl=1Lwπw,lΦ{x^ijtβ^cμw,l(σw,l2+σu,12)1/2}, (9)

where {πw,l}, {μw,l} and {σw,l} are from Table 2.

Below we use this approximation to obtain marginal logit interpretations from conditional logit models. However, before doing so, we emphasize the benefits of Equation 9 over Monte Carlo and numerical integration, which can also give very accurate approximations of marginal effects. For example, unlike numerical integration or Monte Carlo approximations, the approximation (9) can be performed quickly and easily. In addition, obtaining delta method estimates of standard errors is also easy. Furthermore, the method applies to any conditional link function, provided the relevant mixture components are known. Finally, and perhaps most importantly, we note that this method leads to an accurate and simple approximation to the marginal link distribution, Fq, whereas quadrature or Monte Carlo approximations only yield Fq for specific values of the covariates.

Table 4 gives estimated marginal logit effects for the four data sets calculated using (9). To illustrate the calculations, consider the Teratology dataset. The fitted values (SE) from Model 8 using the SAS procedure NLMIXED are β^0c=2.63(0.48), β^1c=1.08(0.63) and σ̂u,1 = 1.35 (0.33). Plugging the estimated parameters into (9) yields a marginal probability of survival of 0.76 for the treated and 0.88 for the untreated. Then, the marginal log odds ratio of survival (SE) comparing the treated to the control litters is logit(0.76) − logit(0.88) = −0.86 (0.51) (see Appendix D for details about obtaining standard errors).

Table 4.

Marginal logit estimates (standard errors) for the examples from Section 4.1.

Data Set Marginal Estimate ( β1m^)
Teratology −0.86 (0.51)
Approval −0.16 (0.04)

Crossover
Period 1
0.59 (0.31)
Period 2
10.58 (0.29)

Item Response
Question 1
−0.002 (0.03)
Question 2
−0.002 (0.03)
Question 3
−0.002 (0.03)

Table 4 applies these techniques to the three other data sets as well, each time taking the conditional estimates output by SAS (Table 3). Because of the additional covariates in the Crossover and Item Response data sets, the estimated marginal logit effects are reported within strata.

Table 3.

Conditional Estimates (standard errors).

Parameter
Data Set σ̂u,1 β^0c β^1c β^2c β^3c
Teratology 1.35 (0.33) 2.63 (0.48) −1.08 (0.63)
Approval 5.16 (0.35) 1.24 (0.19) −0.56 (0.14)
Crossover 4.94 (1.91) 2.22 (1.17) 1.86 (0.93) −1.04 (0.82)
Item Response 8.75 (0.54) −0.61 (0.34) −0.013 (0.49) 0.83 (0.16) 0.29 (0.16)

4.2. Easier marginalized multilevel models

The previous section addressed the issue of obtaining marginal effects from conditional results, which is useful when interpreting published results without access to the underlying data. However, when the data are available and marginal interpretations are desired, direct fitting is preferable. This section illustrates how the mixture of normals modeling framework can ease the calculations required to directly obtain marginal estimates.

We consider the marginal Model 6 where Δij is given by Equation 4. Furthermore, assume that the UiN(0,σu,12) and Fw is the 5 component mixtures of normals approximation to the logistic distribution function.

The benefit of using the Fw as a mixture of normals rather than the exact logistic distribution is that there is a closed form for Fq (see Equation 4); also its quantiles can easily be calculated using Newton’s method. Hence, representing the logistic distribution as such eliminates the difficult task of numerically approximating the convolution integral defining Fq and its inverse. It should be emphasized that while defining Fw as a mixture of normals eases the calculation of Fq and hence Δij, calculation of the likelihood (7) still requires numerical integration, for which we employed Gauss/Hermite quadrature.

We implemented this model for the four data sets. We highlight the use of profile likelihoods - the functions obtained by maximizing the likelihood for each value of the parameter of interest. See Royall (1997) for more information regarding the benefits and interpretation of profile likelihoods.

The results of the model fits are given in Table 5. For example, for the Teratology data, −0.86 (the estimate for β1m) estimates the change in the marginal log-odds of survival comparing a treated pup to a control. For each of the data sets, Figure 2 shows the profile likelihood with 1/8 and 1/16 reference lines see (see Royall, 1997) for the parameter of interest ( β1m) and the variance component (σu,1) for each of the four models.

Table 5.

Estimates (standard errors) for marginalized multilevel models from Section 4.2.

Parameter
Data Set σ̂u,1 β^0m β^1m β^2m β^3m
Teratology 1.35 (0.33) 2.03 (0.39) −0.87 (0.51)
Approval 5.16 (0.35) 0.36 (0.05) −0.16 (0.04)
Crossover 4.94 (1.91) 0.68 (0.28) 0.58 (0.23) −0.32 (0.23)
Item Response 8.71 (0.54) −0.048 (0.054) 0.004 (0.074) 0.150 (0.028) 0.053 (0.028)

Fig. 2.

Fig. 2

Profile likelihood plots with 1/8 and 1/16 reference lines, see (Royall, 1997) for β1m and σu,l for the marginalized multilevel model from 4.2. The rows from top to bottom correspond to the Teratology, Approval, Crossover and Item Response data sets respectively.

4.3. Bayesian analysis

In this section, we illustrate how specific instances of Model (6) are particularly well suited for Bayesian analysis via MCMC. We note that similar methods utilizing latent variables have been proposed to simulate from the posterior distributions of parameters for binary and multinomial responses (see Albert and Chib, 1993; McCulloch and Rossi, 1994; Chib et al., 1998; Imai and van Dyk, 2005). In addition, close variants of the sampling schemes can be used for the Monte Carlo EM algorithm (see Chib et al., 1998; Natarajan et al., 2000).

We apply these methods to binary responses with random effects, using the mixture of normals link approximation (similar to Geweke and Keane, 1997; Erkanli et al., 1993). Consider the latent variable representation of Model (6) given by

  1. {Du,i} are iid discrete random variables with support 1, …, Lu so that Pr(Du,i = l) = πu,l,

  2. the {Ui} given that the {Du,i = du,i} are independent N (μu,du,i, σu,du,i2),

  3. the {Dw,ij} are discrete iid random variables with support 1, …, Lw so that Pr(Dw,ij = l) = πw,l,

  4. the {Mij} given that the {Dw,ij = dw,ij} and {Ui = ui} are independent Normals with mean μw,dw,ij + ui − Δij and variance σw,dw,ij2,

  5. the {Yij} are 1 iff Mij ≤ 0 and 0 otherwise,

  6. each Δij=xijtβc,

To summarize the model, items 1 and 2 yield the mixture model for Fu, items 3–5 yield the conditional model for the yij and item 6 forces a conditional interpretation for the βc. To prove that 3–5 induces the mixture of normals model for the yij, consider

Pr(Yij=1Ui=ui)=Pr(Mij0Ui=ui)=Σl=1LwPr(Mij0Ui=ui,Dw,ij=l)Pr(Dw,ij=l)=Σl=1LwΦ(Δijμw,luiσw,l)πw,l.

We complete the Bayesian model by specifying that βc ~ Normal(μβc, Σ), σu,l2IG(ν,τ). In the examples where the random effect mixture distribution had more than one component, the {μu,l} were independent normals with mean η and variance θ and {πu,l} were Dirichlet with shape parameters α. We note a small complication is that the mean of the random effect distribution is aliased with an intercept parameter. Therefore, throughout this section we assume that the intercept term is excluded and instead the random effect mean is estimated. A second complication could potentially arise when the random effect mixture distribution has more than one component and the {πu,l}, {μu,l} and the {σu,l} are of primary interest, because of the non-identifiability of the parameters due to permutation invariance of the likelihood. In the settings we present this is not an issue, because we focus on the label invariant parameters, such as βc. In our investigations of the label-dependent parameters, we addressed this issue by imposing an ordering on the {μu,l}. See Jasra et al. (2005) and the references therein for further information on the “label-switching” problem.

The benefit of this model specification is that all of the full conditionals are common distributions and an elegant Gibbs sampler, which does not employ any Metropolis/Hastings steps, is available for exploring the posterior. The full conditionals are as follows:

  1. the full conditional for Du,i is discrete so that the probability Du,i takes value l is
    σu,l1exp{(uiμu,l)2/2σu,l2}πu,lΣkσu,k1exp{(uiμu,k)2/2σu,k2}πu,k;
  2. the full conditional for Ui is normal with mean
    (Σjσw,dw,ij2+σu,du,i2)1(Σjmijμw,dw,ij+Δijσw,dw,ij2+μdu,iσu,du,i2)
    and variance
    (Σjσw,dw,ij2+σu,du,i2)1;
  3. the full conditional for Dw,ij is discrete so that the probability Dw,ij takes value l is
    σw,l1exp{(mijμw,lui+Δij)2/2σw,l2}πw,lΣkσw,k1exp{(mijμw,kui+Δij)2/2σw,k2}πw,k;
  4. the full conditional for Mij is truncated normal with mean μw,dw,ij+ ui − Δij and variance σw,dw,ij2 with Mij ≤ 0 when yij = 1 and Mij > 0 when yij = 0; that is, the distribution function is
    Φ{(mijμw,dw,ijui+Δij)/σw,dw,ij}Φ{(μw,dw,ijui+Δij)/σw,dw,ij}I(mij0)
    when yij = 1 and
    Φ{(mijμw,dw,ijui+Δij)/σw,dw,ij}Φ{(μw,dw,ijui+Δij)/σw,dw,ij}1Φ{(μw,dw,ijui+Δij)/σw,dw,ij}I(mij0)
    when yij = 0;
  5. the full conditional for βc is multivariate normal with mean
    (Σ1+XtW1X)1(Σ1μβc+XtW1ξ)
    where X is the design matrix, W is a diagonal matrix of the σw,dw,ij2 and ξ is a vector with elements μw,dw,ij+ uimij and variance
    (Σ1+XtW1X)1;
  6. the full conditional for σu,l2 is inverted gamma with shape parameter
    ν+ΣiI(du,i=l)/2
    and rate parameter
    τ+ΣiI(du,i=l)(uiμu,l)2/2,
  7. the full conditional for μu,l is normal with mean
    {ΣiI(du,i=l)1σu,du,i2+1θ}1{ΣiI(du,i=l)uiσu,du,i2+nθ}
    and variance
    {ΣiI(du,i=l)1σu,du,i2+1θ}1,
  8. the full conditional for the {πu,l} is Dirichlet with shape parameter
    α+Σi{I(du,i=1),,I(du,i=Lu)}t.

We apply the Gibbs sampler to the four datasets employing diffuse priors with a single normal random intercept. Throughout we assume that Fw is the five component mixture of normals approximation to the logistic distribution. Figure 3 shows the estimated posterior distributions for the parameter of interest after 20, 000 simulations and 1, 000 burn-in samples, for each of the data sets employing 1, 2 and 3 mixture components for the random effect distribution. For each of the examples we set ν = 10−4, τ = 10−6, η = 0, θ = 104, α = (1, …, 1)t, μβc = (0, …, 0)t, and Σ as a diagonal matrix with entries 10. Starting values were obtained in an ad-hoc manner, though using the maximum likelihood estimates when available. Diagnostic checks were made by examining trace plots and plots of some sequential quantiles for parameters of interest (see Figure B.1 for the Item Response data as an example). The hyperparameter specification had little impact for a range of reasonably diffuse priors (results not reported). To investigate the impact of occasional extremely large values of σu,l, the gamma priors were truncated, which again did not impact results.

Fig. 3.

Fig. 3

Estimated posterior densities using for the examples from Section 4.3 using one (solid), two (dashed) and three (dotted) component mixtures for the random effect distributions.

For the random effect distribution, the use of large numbers of mixture components or estimating the number of mixture components, is generally not advisable in this setting, because there is typically insufficient information to identify this distribution in such detail. Instead we emphasize the use of a small number of mixture components, specifically 2 and 3, to protect against some canonical types of misspecification in the form of skewness or a small number of modes (see Agresti et al., 2004). This is particularly interesting for the Item Response data, since a random effect distribution with three modes makes practical sense in this situation. Specifically, it is likely that three populations, one opposed to abortion under any circumstance, one in favor of abortion rights regardless of the circumstance, and a more heterogeneous group, dominate the random effect distribution.

Regardless, for the parameter of interest for these four data sets, misspecification of the random effect distribution does not appear to be impacting results. The estimated posterior densities appear to be the same regardless of the number of mixture components implemented (Figure 3).

5. Discussion

In this manuscript, we discussed the conceptual and computational benefits of using mixtures of normals as the conditional link distribution and random effect distribution for random intercept models for binary outcomes. The use of mixtures of normals could be exploited for further generalizations of the random intercept model. In particular, the extension to multivariate random effects, using mixtures of multivariate normals, is plausible. Furthermore, this mixture approach is potentially very useful for jointly modeling discrete and continuous outcomes. Finally, further work may also explore how the mixture approach facilitates description of the “bridge” random effect distribution as introduced by Wang and Louis (2003) and Wang and Louis (2004).

In closing we note that we have put all of the relevant code to reproduce all of the results, and the derivations of the Bayesian full conditionals at the corresponding author’s web site.

Appendix A. Data sets

Table A.1.

Prime minister approval rating. Source Agresti (2002).

First Survey Second Survey
Approve Disapprove
Approve 794 150
Disapprove 86 570

Table A.2.

Crossover data, frequency of responses by treatment regimen. Source Jones and Kenward (1987).

Response Treatment sequence
Period 1 Period 2 Drug-Placebo Placebo-Drug
Normal Normal 22 18
Abnormal Normal 0 4
Normal Abnormal 6 2
Abnormal Abnormal 6 9

Table A.3.

Response to questions on abortion stratified by gender from Agresti (2002). A response of “1” was in favor of legalized abortion in a specific scenario while a response of “0” was not. The scenarios are i if the family has a very low income ii the woman is not married and does not want to marry the man iii for any reason.

Sequence of Responses
Gender 111 110 011 010 101 100 001 000
male 342 26 6 21 11 32 19 356
female 440 25 14 18 14 47 22 457

Appendix B. Example diagnostic plots

Fig. B.1.

Fig. B.1

Diagnostic trace plots for β1c for the Item Response data. The upper right plot displays the (.1, .2, .4, .6, .8, .9) sequential quantiles.

Appendix C. Approximating link functions with mixtures of normals

In this section we give an estimation procedure for approximating a distribution with a mixture of normals. For a given number of mixture elements, we chose to minimize the Kullback-Leibler divergence (Kullback and Leibler, 1951) between the mixture approximation and the true density. That is, if g is the density associated with the link function of interest and f is the mixture approximation, we minimize Eg[log{f (X)/g(X)}]. The algorithm was obtained as the limit of the standard EM algorithm for estimating normal mixture components as the number of observed data points goes to infinity.

Let πj(t),σj(t) and μj(t) be the current estimates,

Pj(t)(x)=πj(t)φ{(xμj(t))/σj(t)}/σj(t)Σlπl(t)φ{(xμl(t))/σl(t)}/σl(t)πj(t+1)=Eg[Pj(t)(X)]μj(t+1)=Eg[XPj(t)(X)]/πj(t+1)σj(t+1)={Eg[X2Pj(t)(X)]/πj(t+1)(μj(t+1))2}1/2.

The expected values generally need to be evaluated numerically. In this manuscript we use Gauss/Hermite quadrature (see Lange, 1999).

Appendix D. Obtaining standard error estimates of marginal parameters using the Multivariate Delta Method

In this section, we detail how to obtain the standard error estimate for β^1m when there is one binary covariate. Note that β^1m is a function of β^0c and β^1c:

β^1m=g(β0cβ1c)=log{Fq(β^0c+β^1c)1Fq(β^0c+β^1c)}log{Fq(β^0c)1Fq(β^0c)},

with gradient

gt=[fq(β^0c+β^1c)Fq(β^0c+β^1c){1Fq(β^0c+β^1c)}fq(β^0c)Fq(β0c){1Fq(β0c)}fq(β^0c+β^1c)Fq(β^0c+β^1c){1Fq(β^0c+β^1c)}]

Since (β^0c,β^1c)t is normally distributed with covariance matrix Vβ, say, we can apply the multivariate Delta Method to obtain a standard error estimate of β1m is SE(β^1m)=gVβgt.

Footnotes

Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

References

  1. Agresti A. Categorical Data Analysis. 2. Wiley; 2002. [Google Scholar]
  2. Agresti A, Caffo B, Ohman-Strickland P. Examples in which misspecification of a random effects distribution reduces efficiency, and possible remedies. Computational Statistics and Data Analysis. 2004;47 (3):639–653. [Google Scholar]
  3. Agresti AA, Booth J, Hobert J, Caffo BS. Random effects modeling of categorical response data. Sociological Methodology. 2000;30:27–80. [Google Scholar]
  4. Aitkin M. A general maximum likelihood analysis of variance components in generalized linear models. Biometrics. 1999;55:117–128. doi: 10.1111/j.0006-341x.1999.00117.x. [DOI] [PubMed] [Google Scholar]
  5. Albert J, Chib S. Bayesian analysis of binary and polychotomous response data. Journal of the American Statistical Association. 1993;88 (422):669–678. [Google Scholar]
  6. Breslow NE, Clayton DG. Approximate inference in generalized linear mixed models. Journal of the American Statistical Association. 1993;88:9–25. [Google Scholar]
  7. Caffo B, Griswold M. Tech rep. Johns Hopkins University: Department of Biostatistics; 2005. A user-friendly tutorial on link-probit-normal models. [Google Scholar]
  8. Chib S, Greenberg E. Analysis of multivariate probit normals. Biometrika. 1998;85 (2):347–361. [Google Scholar]
  9. Chib S, Greenberg E, Chen Y. MCMC methods for fitting and comparing multinomial response models. Economics Working Paper Archive Econ WPA: Econometrics. 1998 Http://econwpa.wustl.edu:80/eps/em/papers/9802/9802001.pdf.
  10. Demidenko E. Mixed Models Theory and Applications. Wiley: 2004. [Google Scholar]
  11. Erkanli A, Stangl D, Mueller P. A bayesian analysis of ordinal data using mixtures. American Statistical Association Proceedings of the Section on Bayesian Statistical Science. 1993:51–56. [Google Scholar]
  12. Follmann DA, Lambert D. Generalizing logistic regression by nonparametric mixing. Journal of the American Statistical Association. 1989;84:295–300. [Google Scholar]
  13. Geweke J, Keane M. Tech Rep. Vol. 237. Federal Reserve Bank; Minneapolis: 1997. Mixture of normals probit model. [Google Scholar]
  14. Griswold M. PhD thesis. Johns Hopkins University; 2005. Complex distributions, hmmmm... hiearchical mixtures of marginalized multilevel models. [Google Scholar]
  15. Heagerty PJ, Zeger SL. Marginal regression models for clustered ordinal measurements. Journal of the American Statistical Association. 1996;91:1024–1036. [Google Scholar]
  16. Heagerty PJ, Zeger SL. Marginalized multilevel models and likelihood inference. Statistical Science. 2000;15 (1):1–26. [Google Scholar]
  17. Imai K, van Dyk D. A Bayesian analysis of the multinomial probit model using marginal data augmentation. Journal of econometrics. 2005;124:311–334. [Google Scholar]
  18. Jasra A, Holmes C, Stephens D. Markov chain Monte Carlo methods and the label switching problem in Bayesian mixture modeling. Statistical Science. 2005;20 (1):50–61. [Google Scholar]
  19. Jones B, Kenward M. Modelling binary data from a three ponit cross-over trial. Statistics in Medicine. 1987;6:555–564. doi: 10.1002/sim.4780060504. [DOI] [PubMed] [Google Scholar]
  20. Kullback S, Leibler RA. On information and sufficiency. The Annals of Mathematical Statistics. 1951;22 (1):79–86. [Google Scholar]
  21. Lang JB, Agresti A. Simultaneously modeling joint and marginal distributions of multivariate categorical responses. Journal of the American Statistical Association. 1994;89:625–632. [Google Scholar]
  22. Lange K. Numerical Analysis for Statisticians. Springer-Verlag; 1999. [Google Scholar]
  23. Lee Y, Nelder J. Conditional and marginal models: Another view. Statistical Science. 2004;19 (2):219–238. [Google Scholar]
  24. Liang K, Hanfelt J. On the use of the quasi-likelihood method in teratolgy experiments. Biometrics. 1994;50:872–880. [PubMed] [Google Scholar]
  25. Liang K, Zeger SL. Longitudinal data analysis using generalized linear models. Biometrika. 1986;73:13–22. [Google Scholar]
  26. Magder L, Zeger S. A smooth nonparametric estimate of a mixing distribution using mixtures of Gaussians. Journal of the American Statistical Association. 1996;91:1141–1151. [Google Scholar]
  27. McCullagh P, Nelder JA. Generalized Linear Models. 2. Chapman & Hall; London: 1989. [Google Scholar]
  28. McCulloch R, Rossi P. An exact likelihood analysis of the multinomial probit model. Journal of Econometrics. 1994;64:207–240. [Google Scholar]
  29. Monahan J, Stefanski L. Normal scale mixture approximations to f* (z) and computation of the logistic-normal integral. In: Balakrishnan, editor. Handbook of the Logistic Distribution. Marcel Dekker; 1992. pp. 529–540. [Google Scholar]
  30. Natarajan R, McCulloch Kiefer N. A Monte Carlo EM method for estimating multinomial probit models. Computational Statistics and Data Analysis. 2000;34:33–50. [Google Scholar]
  31. Nelder JA, Wedderburn RWM. Generalized linear models. Journal of the Royal Statistical Society, Series A, General. 1972;135:370–384. [Google Scholar]
  32. Royall R. Statistical Evidence: A Likelihood Paradigm. Chapman and Hall; 1997. [Google Scholar]
  33. Wang Z, Louis T. Matching conditional and marginal shapes in binary random intercept models using a bridge distribution function. Biometrika. 2003;90 (4):765–775. [Google Scholar]
  34. Wang Z, Louis T. Marginalized binary mixed-effects with covariate-dependent random effects and likelihood inference. Biometrics. 2004;60 (4):884–891. doi: 10.1111/j.0006-341X.2004.00243.x. [DOI] [PubMed] [Google Scholar]
  35. Weil C. Selection of the valid number of sampling units and a consideration of their combination in toxicological studies involving reproduction, teratogenisis or carcinogenisis. Food and cosmetics toxicology. 1970;8:177–182. doi: 10.1016/s0015-6264(70)80337-6. [DOI] [PubMed] [Google Scholar]
  36. Zeger S, Liang K, Albert P. Models for longitudinal data: a generalized estimating equation approach. Biometrics. 1988;44:1049–1060. [PubMed] [Google Scholar]

RESOURCES