A General Class of Pattern Mixture Models for Nonignorable Dropout with Many Possible Dropout Times

Jason Roy; Michael J Daniels

doi:10.1111/j.1541-0420.2007.00884.x

. Author manuscript; available in PMC: 2009 Dec 10.

Published in final edited form as: Biometrics. 2007 Sep 26;64(2):538–545. doi: 10.1111/j.1541-0420.2007.00884.x

A General Class of Pattern Mixture Models for Nonignorable Dropout with Many Possible Dropout Times

Jason Roy ¹, Michael J Daniels ²

PMCID: PMC2791415 NIHMSID: NIHMS143240 PMID: 17900312

Summary

In this article we consider the problem of fitting pattern mixture models to longitudinal data when there are many unique dropout times. We propose a marginally specified latent class pattern mixture model. The marginal mean is assumed to follow a generalized linear model, whereas the mean conditional on the latent class and random effects is specified separately. Because the dimension of the parameter vector of interest (the marginal regression coefficients) does not depend on the assumed number of latent classes, we propose to treat the number of latent classes as a random variable. We specify a prior distribution for the number of classes, and calculate (approximate) posterior model probabilities. In order to avoid the complications with implementing a fully Bayesian model, we propose a simple approximation to these posterior probabilities. The ideas are illustrated using data from a longitudinal study of depression in HIV-infected women.

Keywords: Bayesian model averaging, Incomplete data, Latent variable, Marginal model, Random effects

1. Introduction

Dropout is a common occurrence in longitudinal studies. Missingness induced by dropout that depends only on the observed data is called missing at random (MAR) or random dropout. If missingness depends on the unobserved response at the time of dropout or at future times, even after conditioning on the observed data, then the missingness is called nonignorable or informative dropout (Little, 1995). There are many model-based approaches to deal with informative dropout that are characterized by how they factor the joint distribution of missingness and the response. We will focus on the pattern mixture approach. Pattern mixture models (PMM) are a flexible and transparent way to analyze incomplete longitudinal data where the missingness is nonignorable (Little, 1994; Hogan and Laird, 1997). The typical approach taken in PMM is to stratify on dropout time (i.e., the pattern) and assume that missing data within a pattern are MAR. Consider the case of T unique dropout times and define D_i to be the dropout time and Y_i to be the response vector for subject i. PMM account for nonignorable missingness by allowing the distribution of Y_i to differ by dropout time, that is, f(y_i|D_i) ≠ f(y_i). So, models are built for [Y_i|D_i], but inferences are based on f(y) = ∑_Df(y|D)p(D). One issue in this formulation, addressed in Fitzmaurice, Laird, and Shneyer (2001) and Wilkins and Fitzmaurice (2006), is that for nonlinear link functions connecting the means, E[Y_i|D_i] to covariates, that is, g(E[Y_it|D_i, X_it]) = X_it × β(D_i), the marginal mean, E[Y_it], is such that, in general, g(E[Y_it|X_it]) ≠ X_it ∑_Dβ(D)p(D). This is one issue we will address in our model.

The other issue we will address are situations where the number of unique dropout times T is large. In this setting stratification by dropout pattern may lead to sparse patterns, which will lead to unstable parameter estimates (or unidentified parameters) in those patterns. There are several ways to remedy this including allowing parameters to be shared across patterns (Hogan and Laird, 1997) or to group the dropout times into m<T groups in an ad hoc fashion (Hogan, Roy and Korkontzelou, 2004). Roy (2003) proposed an automated way to do the latter using a latent variable approach within the context of normal models for continuous data. This approach assumes the existence of a discrete latent variable that explains the dependence between the response vector and the dropout time and allows incorporation of uncertainty about the groupings, conditional on a fixed number of groups. We will extend the approach of Roy (2003) by incorporating uncertainty in the number of classes through (approximate) Bayesian model averaging.

A common way to account for the longitudinal correlation in the vector of responses for subject i, Y_i is to introduce random effects. However, for nonlinear link functions, similar to the above discussion, the link no longer holds for marginal covariate effects (Diggle et al., 2002). We will use the ideas in Heagerty (1999) within our model to directly model the marginal covariate effects. We briefly review Heagerty's approach below.

Let Y_it denote the response for the ith subject (i = 1, ..., n) at time t (t = 1,..., T). Heagerty (1999) specifies marginalized logistic models in the following ways. First, the marginal mean of Y_it is specified as

logit {P (Y_{i t} = 1 ∣ β)} = X_{i t}^{T} β .

(1)

Then the dependence among the Y_it is specified via a conditional model that is consistent with (1),

logit {P (Y_{i t} = 1 ∣ b_{i})} = Δ_{i t} + b_{i},

(2)

where b_i ~ N(0, θ). The quantity Δ_it is determined by the other parameters in the model and can be computed by solving the following convolution equation,

P (Y_{i t} = 1) = \int P (Y_{i t} = 1 ∣ b_{i}) d F (b_{i}) .

Note that Δ_it is a function of X_itβ and θ. The overall objective in our approach will be to propose a model that marginalizes over the random effects and the dropout distribution to directly model the marginal covariate effects of interest.

This work is widely applicable, but was motivated by an HIV natural history study of depression. The HIV Epidemiology Research Study (HERS; Smith et al., 1997) was a longitudinal study of women with, or at high risk for, HIV infection. Data were collected from 1310 women at baseline. Investigators then attempted to collect data from each subject every 6 months for a total of 6 years. Thus, 12 total visits from each subject would be obtained if there were no missing data. Our interest is in studying the course of depression in the 849 women who had HIV infection at baseline. Depression was treated as a binary, yes/no, variable (Cook et al., 2004). A challenge with the analysis of these data is that less than half of these women remained in the study until the end. It is not hard to imagine a scenario where the course of depression over time might vary as a function of dropout time. Because there are many unique dropout times (12), some of which include very few subjects, we apply the latent class pattern mixture modeling approach to the analysis of these data.

In Section 2 we introduce the model. We provide computational details in Section 3. The example is analyzed in Section 4. A brief simulation study is given in Section 5. We conclude with a discussion in Section 6.

2. Model

Before we introduce the model, we first go through some additional notation needed for the latent class component. Define S_i = (S_i₁,..., S_iM )^T to be a vector of latent indicators, where S_ij is defined as an indicator for class j, j = 1,..., M (M<T; e.g., if subject i is in class j, then S_ij = 1 and S_ij′ = 0 for all j ≠ j′). The idea here will be to “group” the dropout times into the M classes as in Roy (2003).

All of the parameters in the following specification are a function of the number of latent classes, M; for example, β^(M). However, we suppress the superscripts without loss of clarity in the following. First, we specify the marginal mean as

g {E (Y_{i t} ∣ β)} = X_{i t}^{T} β .

(3)

By marginal, we mean marginalized over subject-specific random effects and over the latent class distribution (implicitly over the dropout distribution as well). If the number of classes M were known, then the parameters β would be of primary interest. We address the issue of M being unknown below.

In order to fully account for correlation due to repeated observations and informative censoring, we specify a conditional model in addition to the marginal model. Recall that we are taking a pattern mixture modeling approach to account for dropout. We assume that the relevant information in D is captured by the latent variable S. We therefore specify a mixture distribution over these latent classes, as opposed to over D itself. Before proceeding to describe the model, however, we first make two points. First, the parameters from the conditional model are not of scientific interest, and in fact are viewed as nuisance parameters; we are not interested in estimating subject-specific effects (i.e., effects conditional on the random effects) or class-specific covariate effects (i.e., effects of covariates on Y given a particular dropout class). Second, we must specify the conditional model in a way that is compatible with the marginal model (3). As we will see below, this leads to a somewhat complicated model. Specifying this conditional model is necessary, however, in order to account for the two types of dependencies (within-subject correlation and dependency between the outcome and dropout time).

We assume the data Y_it, conditional on random effects b_i and latent class S_i, are from an exponential family with distribution

f (Y_{i t} ∣ b_{i}, S_{i}) = exp [{Y_{i t} η_{i t} - ψ (η_{i t})} ∕ (m_{i} ϕ) + h (Y_{i t}, ϕ)],

where E(Y_it|b_i, S_i) = g^–1 (η_it) = ψ′ (η_it), η_it is the linear predictor, ψ(·) is a known function, φ is a scale parameter, and m_i is the prior weight. This family includes normal (ψ(x) = x²/2), binomial (ψ(x) = log (1 + e^x)), and Poisson (ψ(x) = e^x) distributions, among others. The conditional mean is specified as

g {E (Y_{i t} ∣ b_{i}, S_{i})} = Δ_{i t} + b_{i} + \sum_{j = 1}^{M} S_{i j} Z_{i t}^{T} α^{(j)},

(4)

where, in the most general form of the model we allow the variance of b_i to depend on the latent class, that is, [b_i|S_ij = 1] ~ N(0, θ_j). For identifiability, we use a sum-to-zero constraint on the α's, namely, $α^{(M)} = - \sum_{j = 1}^{M - 1} α^{(j)}$ . In this conditional model, each subject has its own intercept, and the effect of each covariate, Z_itj (Z_it ⊂ X_it), is allowed to differ by dropout class via the regression coefficients, α^(j).

The probabilities of the latent classes given the dropout time are specified as a proportional odds model,

\begin{matrix} logit {P (\sum_{j = 1}^{k} S_{i j} = 1 ∣ D_{i})} & = λ_{0 k} + λ_{1} D_{i}, \\ k & = 1, \dots, M - 1, \end{matrix}

(5)

where λ₀₁ ≤ λ₀₂ ≤ ··· ≤ λ_0,M–1 and λ₁ are unknown parameters. From this regression (5) it is clear that the class probabilities are a monotone function of dropout time (in fact, linear on the logit scale). Finally, the dropout times, D_i, follow a multinomial distribution with mass at each of the possible dropout times, parameterized by γ.

We point out that in the above formulation, Y_it is independent of D_i given S_i. This is a key assumption with this approach, which we will examine in Section 3.4.

The intercept Δ_it in (4) is determined by the relationship between (3) and (4), namely, the solution to

E (Y_{i t} ∣ β) = \sum_{D} \sum_{S} p (S_{i} ∣ D_{i}) p (D_{i}) \int E (Y_{i t} ∣ b_{i}, S_{i}) p (b_{i} ∣ S_{i}) d b_{i} .

The main target of inference typically will be the covariate effects averaged over classes, that is, β^(M) averaged over M. We denote this as β* = ∑_mβ^(m)p(m|y). We discuss computation of p(m|y) in Section 3.3 and the corresponding computation of $var ({\hat{β}}^{⋆})$ .

3. Computational Details

We provide details on computation of maximum likelihood (ML) estimates conditional on m, computation of the approximate posterior model probabilities, and model averaging.

3.1 The Likelihood and ML Inference

Denote the set of all parameters by ω = (β^T, α^T, θ^T, φ, λ^T, γ^T)^T. We partition the complete response data for subject i, $Y_{i}^{c}$ , into observed and missing components. Denote by Y_i the observed part of the vector (i.e., values of Y^c prior to dropout) and by $Y_{i}^{m}$ the response after dropout. In the following presentation, assume X_i and M are conditioned throughout.

The likelihood contribution for subject i corresponding to the models described in Section 2 is

\begin{matrix} L_{i} (Y_{i}, D_{i}; ω) \propto \int \sum_{j = 1}^{M} L_{i} (Y_{i} ∣ S_{i j} = 1, b_{i}; α^{(j)}, ϕ) \\ \times p (S_{i j} = 1 ∣ D_{i}; λ) p (D_{i} ∣ γ) d F (b_{i} ∣ S_{i j}, θ_{j}), \end{matrix}

(6)

where

\begin{matrix} L_{i} (Y_{i} ∣ S_{i j} = 1, b_{i}; α^{(j)}, ϕ) \\ = exp [{Y_{i}^{T} η_{i} - ψ (η_{i})} ∕ (m_{i} ϕ) + 1^{T} h (Y_{i}, ϕ)], \end{matrix}

with $η_{i} = Δ_{i} + b_{i} 1 + \sum_{j = 1}^{M} S_{i j} Z_{i} α^{(j)}, p (S_{i j} = 1 ∣ D_{i}; λ)$ is defined in (5), and p(D_i|γ) is the distribution of D_i, which might depend on covariates, and is parameterized by γ. Proportionality in (6) holds because we assume that the missing and observed responses from subject i are independent, given S_i and b_i $(i . e ., [Y_{i}^{m} ∣ Y_{i}, b_{i}, S_{i}] = [Y_{i}^{m} ∣ b_{i}, S_{i}])$ .

Maximization of $log {\prod_{i = 1}^{n} L_{i} (Y_{i}, D_{i}; ω)}$ with respect to the parameters ω is complicated by the possibly intractable integral in (6), and the need to calculate Δ_it at each iteration in the algorithm for every record in the data set. We provide details of the maximization algorithm in the Appendix.

3.2 Posterior Model Probabilities

The models introduced in Section 2 are indexed by the number of latent classes m (m = 1,..., M, M < T). Given that our main interest is in the regression parameters β, it would be sensible to properly account for the uncertainty in the regression coefficients by averaging over the number of classes as opposed to conditioning the most likely number of classes. To do this, we need to first specify a prior distribution on the number of latent classes, m. We recommend specifying a prior to favor parsimony and/or to be consistent with subject matter considerations (if available). A convenient specification is a truncated Poisson distribution with rate parameter, μ, and truncated at an integer between 1 and T. Denote this prior as p(m). The posterior probability of m classes is given by the expression,

p (m ∣ y, x) = \frac{p (y ∣ m, x) p (m)}{p (y ∣ x)},

where p(y|x) = ∑_mp(y|m, x)p(m) and p(y|m, x) are the integrated likelihood, that is,

\begin{matrix} p (y ∣ m, x) = & \int p (y ∣ m, x, β^{(m)}, α^{(m)}, λ, γ, θ) p (λ) \\ \times p (α^{(m)} ∣ m) p (β^{(m)} ∣ m) p (γ) \\ \times p (θ) d β^{(m)} d α^{(m)} d λ d γ d θ, \end{matrix}

where p(y|m, x, β^(m), α^(m), λ, γ, θ) = ∑_s ∑_Dp(y|m, x, β^(m), α^(m), θ)p(S|m, x, D, λ)p(D|m, x, γ). Unfortunately, this integral is not available in closed form. We propose to use a Laplace approximation to evaluate this integral,

\hat{p} (y ∣ m, x) = {(2 π)}^{d ∕ 2} ∣ \hat{Σ} ∣^{1 ∕ 2} p (y ∣ m, x, {\hat{β}}^{(m)}, {\hat{α}}^{(m)}, \hat{λ}, \hat{γ}, \hat{θ}),

(7)

where d = dim(β,α,λ,γ,θ) and $({\hat{β}}^{(m)}, {\hat{α}}^{(m)}, {\hat{λ}}^{(m)}, {\hat{γ}}^{(m)}, {\hat{θ}}^{(m)})$ are the joint ML estimates of β^(m), α^(m), λ^(m), γ^(m), θ^(m)) for the model with m classes, p(y|m, x, ${\hat{β}}^{(m)}$ , ${\hat{α}}^{(m)}$ , $\hat{λ}$ , $\hat{γ}$ , $\hat{θ}$ ) is the value of the maximized integrated likelihood, and $\hat{Σ}$ is the inverse of the observed information matrix for (β, α, λ, γ, θ) based on the integrated likelihood (6). These estimates are obtained using the algorithm described in the Appendix. It is clear that in (7) we have ignored the contribution of the prior, p(λ)p(α^(m)|m)p(β^(m)|m)p(γ)p(θ), evaluated at the joint ML estimates. This is justified (asymptotically) because the maximized likelihood term, p(y|m, x, ${\hat{β}}^{(m)}$ , ${\hat{α}}^{(m)}$ , $\hat{λ}$ , $\hat{γ}$ , $\hat{θ}$ ) is O_p(n) whereas the prior is typically O_p(1). Thus, the approximate posterior probabilities take the form,

\hat{p} (m ∣ y, x) = \frac{\hat{p} (y ∣ m, x) p (m)}{\sum_{m} \hat{p} (y ∣ m, x) p (m)} .

(8)

3.3 Model Averaging and Approximate Posterior Inference

Once the posterior distribution p(m|y) is estimated, we can then estimate the covariate effects averaged across class sizes. As described previously, we denote the average covariate effect over classes as β*, which can be estimated as ${\hat{β}}^{⋆} = \sum_{m} {\hat{β}}^{(m)} \hat{p} (m ∣ y)$ . The variation of ${\hat{β}}^{*}$ is

\begin{matrix} var ({\hat{β}}^{*}) & = E [var ({\hat{β}}^{*} ∣ M)] + var (E [{\hat{β}}^{*} ∣ M]) \\ = \sum_{m} var ({\hat{β}}^{*} ∣ m) p (m ∣ y) + var (E [{\hat{β}}^{(m)} ∣ M]), \end{matrix}

which can be estimated as

\begin{matrix} \hat{var} ({\hat{β}}^{⋆}) = & \sum_{m} var ({\hat{β}}^{(m)} ∣ m) \hat{p} (m ∣ y) \\ + \sum_{m} {({\hat{β}}^{(m)} - {\hat{β}}^{*})}^{\otimes 2} \hat{p} (m ∣ y) . \end{matrix}

Note that if we conditioned the most likely value for the number of classes, m, the variance of the estimated regression coefficients would likely be too small due to ignoring the second term in the variance expression above.

3.4 Model Checking

Conditional independence between Y and D given S and X is a key assumption with this modeling approach. A simple method for checking the conditional independence assumption for a given class size M is as follows: this approach was originally proposed by Lin, McCulloch, and Rosenheck (2004), as a modification to the test proposed by Bandeen-Roche et al. (1997). The goal is to test the null hypothesis that model (4) holds versus the alternative that the true model is

\begin{matrix} g {E (Y_{i t} ∣ b_{i}, S_{i}, D_{i})} \\ = Δ_{i t} + b_{i} + \sum_{j = 1}^{M} S_{i j} Z_{i t}^{T} α^{(j)} + \sum_{j = 1}^{J} h_{j} (D_{i}) ϕ_{j}, \end{matrix}

(9)

where each h_j(·) is a known function and the φ's are parameters. The null hypothesis is that φ₁ = ··· = φ_J = 0. A simple example with J = 1 is h(D_i) = D_i, which would assume a linear effect of D_i. If class membership S were known, then we could simply fit both the full model (9) and reduced model (3) using ML, andcarry out a likelihood ratio test with J degrees of freedom. Because S is unknown, Lin et al. (2004) proposed the following approach.

First, fit the null model (3). We can then estimate the posterior probability of class membership for each subject as

\begin{matrix} \hat{P} (S_{i j} = 1 ∣ D_{i}, Y_{i}, X_{i}; \hat{ω}) \\ = \frac{\int L_{i} (Y_{o b s, i} ∣ S_{i j} = 1, b_{i}; {\hat{α}}^{(j)}, \hat{ϕ}) p (S_{i j} = 1 ∣ D_{i}; \hat{λ}) p (D_{i} ∣ \hat{γ}) d F (b_{i} ∣ S_{i j}, {\hat{θ}}_{j})}{L_{i} (Y_{i}, D_{i}; \hat{ω})}, \end{matrix}

where L_i(Y_i, D_i; ω) was defined in (6). The next step is to create M replicate pseudo data sets for each record, setting the latent class variable equal to j for the jth replicate of that record. In other words, the entire data set will be replicated M times, and the latent class variable will be set to j for every record in the jth replicate of the data set. Each record is then assigned a case weight based on the corresponding posterior probability of S. For example, a case weight of $\hat{P} (S_{i j} = 1 ∣ D_{i}, Y_{i}, X_{i}; \hat{ω})$ will be assigned to the jth replicate of subject i's data. We can then fit models (9) and (3) using the weighted likelihood, and carry out the likelihood ratio test.

4. Example

As briefly described in the Introduction, we were interested in analyzing data on the longitudinal course of depression of 850 HIV-infected women from the HERS. Depression was measured using the Center for Epidemiologic Studies Depression Scale (CES-D). The CES-D includes 20 questions related to mood, each of which can take a value from 0 (symptom rarely present) to 3 (symptom almost always present). Larger scores indicate the presence of more symptoms, and scores range from 0 to 60. A score of 16 or greater is frequently used as a depression cutoff (e.g., Cook et al., 2004). We therefore defined our outcome Y_it as the indicator of depression at visit t, meaning it took a value of 1 if subject i had a CES-D ≥ 16 at visit t, and took a value of 0 otherwise. Our goal was to describe changes in depression over time as a function of baseline characteristics, such as race/ethnicity, number of HIV-related symptoms, injection drug use (IDU), and number of recent adverse events (such as homelessness, violence, and death of a close person).

The observed proportion of depression decreased over time. However, the sample mean is only a valid estimate of the prevalence at each visit if the missing data were missing completely at random (MCAR); it would not be surprising if depression status was related to dropout. There was a substantial amount of dropout. By visit 12, less than half of the original sample remained in the study. We would like to account for the possibility that the prevalence of depression over time might be related to the dropout time.

4.1 Models

We first fitted a marginally specified logistic regression model under the MAR assumption. This could also be thought of as a special case of the proposed latent class model, but with M = 1 class. We assumed models (1) and (2) hold, where the covariate vector includes an intercept, indicator of black race (black), an indicator of Hispanic ethnicity (latina), an indicator of other race/ethnicity (other), number of HIV-related symptoms during the 6 months prior to the baseline visit (symptoms), an indicator that the subject has been an IDU, number of adverse events in 6 months prior to the baseline visit (adverse), and the HERS visit number (visit). Only visit was a time-varying covariate. White race was the reference category for the race/ethnicity variable.

We next fitted models (3–5), with M equal to classes 2, 3, and 4. The covariate vector X_it was the same as used in the previous model. We also set Z_it = X_it, meaning that every covariate was allowed to have an effect that varied by dropout class. In order to carry out the model averaging, we needed to estimate the posterior probability for the number of classes. We considered two prior distributions p(m): a discrete uniform prior and a truncated Poisson prior distribution for M – 1, with mean equal to 0.5. The truncated Poisson prior placed more prior weight on smaller classes; specifically, the probabilities were 0.6076, 0.3038, 0.0759, 0.0127 for M = 1, 2, 3, and 4, respectively. The posterior distribution of the number of classes for the uniform and truncated Poisson priors was estimated using equation (8). Once the posterior probabilities of the number of classes were calculated, we were able to estimate β* as described in Section 2. All models were fitted using R 2.2.1 software (http://www.r-project.org). We wrote functions to calculate each type of likelihood, and used the generic optimization function optim to maximize these likelihoods. More details are given in the Appendix.

4.2 Results

The results are given in Tables 1 and 2. In Table 1, we compared the four models based on the components of the Laplace approximation of the marginal distribution (7) and the corresponding approximate posterior distribution of the number of classes. First, we examined the maximized likelihood, $p (y ∣ \hat{ω})$ . There was a substantial increase in the likelihood (relative to the increase in the number of parameters) by going from 1 to 2 classes. Similarly, there was a modest gain in the likelihood by going from 2 to 3 classes. The likelihood for the four-class model was almost identical to that in the three-class model. The four-class model provided essentially the same fit as the three-class model, but with nine extra parameters. Besides the maximized likelihood, the term (d/2)log(2π) always increases as the number of parameters (d) increases. However, the determinant of the estimated covariance matrix, $∣ \hat{Σ} ∣$ typically decreases as the number of parameters increases; this acts as a “penalty” term for adding parameters. In particular, consider the comparison between models 3 and 4. In model 4 we added nine new parameters. These parameters did little to improve the fit to the data, as the likelihood only increased by a small amount. These parameters were not well identified by the model, and tended to have large variances and high correlation with other parameters. This caused the determinant of the estimated covariance matrix to be considerably smaller than from the three-class model.

Table 1.

The components of the Laplace approximation to the marginal likelihood and the corresponding approximate posterior model probabilities under two priors for the number of classes: a discrete uniform prior and a truncated Poisson prior

	Number of classes
	1	2	3	4
Number of parameters	9	19	28	37
log likelihood	–3571.829	–3501.31	–3489.768	–3489.751
$(1 ∕ 2) \log ∣ \hat{Σ} ∣$	–24.51	–44.07	–55.84	–78.22
d/2 log (2π)	8.27	17.46	25.73	34.00
P(m\|y), uniform prior	0	0	1	0
P(m\|y), truncated Poisson prior	0	0	1	0

Open in a new tab

Table 2.

Estimates and standard errors of marginal coefficients β^(m). The estimated covariate effects averaged over classes, β*, are also given in the column for M = 3.

	Number of classes
	1		2		3		4
Parameter	Estimate	SE	Estimate	SE	Estimate	SE	Estimate	SE
Intercept	1.36	0.08	–1.11	0.32	–1.01	0.26	–1.01	0.28
Black	–0.79	0.28	0.67	0.35	0.56	0.25	0.56	0.25
Latina	0.39	0.30	1.37	0.33	1.26	0.27	1.26	0.30
Other	1.06	0.30	0.85	0.38	0.71	0.26	0.71	0.27
Idu	0.59	0.29	0.47	0.19	0.34	0.11	0.35	0.11
Symptom	0.20	0.13	0.45	0.06	0.48	0.05	0.48	0.05
Adverse	0.31	0.04	0.31	0.05	0.32	0.04	0.32	0.05
Visit	–0.04	0.007	–0.02	0.009	–0.03	0.011	–0.03	0.011

Open in a new tab

The posterior distribution of M was insensitive to the choice of the prior (p(M = 3|y, x) = 0.9997 with the uniform prior, and p(M = 3|y, x) = 0.9987 with the truncated Poisson prior). The three-class model was the clear “winner” based on the posterior model probabilities; no reasonable prior would change this conclusion. Due to the closeness of the posterior probability of the three-class model to 1, there was no need to carry out the model averaging. In particular, recall that ${\hat{β}}^{⋆} = \sum_{m} {\hat{β}}^{(m)} \hat{p} (m ∣ y)$ . Because, from Table 1, p̂(M = 3|y) = 1, then the estimated parameters from the three-class model, ${\hat{β}}^{(3)}$ , were equivalent to the estimated parameters that were averaged over the number of classes ${\hat{β}}^{*}$ .

The marginal regression coefficient estimates are presented in Table 2 for each model. The parameter estimates from the one-class model were quite different from the models with multiple classes. For example, based on the one-class model, we might conclude that the prevalence of depression was lower for blacks. However, once we account for dropout using the latent class model, we conclude the opposite.

Because the posterior probabilities overwhelmingly favored the three-class model, we will now focus on this model for our conclusions. Blacks, Latinas, and other non-white racial and ethnic groups were estimated to have a significantly higher prevalence of depression as compared with whites. IDU, the number of adverse events, and HIV-related symptoms were associated with higher prevalence of depression. There was a significant, but somewhat gradual, decline in depression over time. We also considered interactions between race/ethnicity and visit number, but these interactions did not appear to be important in describing the data.

Table 3 displays estimated latent class probabilities as a function of dropout time, using the estimated values of λ, the ordinal regression parameters in (5). Individuals who dropped out early (after visit 1) were very likely to be in class 1. Individuals who remained in the study until the end were most likely to be in class 2. Class 3 consisted of a small subpopulation of the subjects who dropped out in the final few visits of the study.

Table 3.

Comparison of the estimated latent class probabilities as a function of the dropout time for the three-class model

	Dropout time (visit number of last observed value)
Class	1	2	3	4	5	6	7	8	9	10	11	12
1	0.89	0.86	0.81	0.75	0.68	0.61	0.53	0.44	0.36	0.29	0.22	0.17
2	0.09	0.13	0.17	0.21	0.27	0.32	0.38	0.43	0.47	0.49	0.50	0.48
3	0.01	0.02	0.03	0.04	0.05	0.07	0.09	0.12	0.17	0.22	0.28	0.35

Open in a new tab

4.3 Checking the Conditional Independence Assumption

We used the method described in Section 3.4 to test the null hypothesis of conditional independence. For each value of M (1–4), we fitted model (9), with $\sum_{j = 1}^{J} h_{j} (D_{i}) ϕ_{j} = D_{i} ϕ$ . The test statistic, which, under the null hypothesis follows an approximate $χ_{1}^{2}$ distribution, had values of 7.81, 2.64, 0.41, and 0.41 for M = 1 to M = 4, respectively. Thus, with respect to the specific alternative of a linear effect of dropout time, the conditional independence assumption appeared to be reasonable for M = 3.

5. Simulation Study

We carried out a brief simulation study, primarily to examine the effectiveness of the approximation to fully Bayesian inference. For covariates, we used variables from the HIV data described in the previous section. In particular, the X matrix included an intercept, the indicator of IDU, and visit number. The true values of the β parameters were –1.1, 0.45, and –0.02 for the intercept, IDU, and visit, respectively.

We first generated the response for the case where M = 1 (where the MAR assumption holds). The missing data pattern was just the observed pattern from the HIV data. The response was generated from models (1) and (2). We also generated data for the case where M = 2. The missing data pattern was the same, but now the response depended on class membership. The latent class variable was generated from model (5) with λ₀₁ = 4 and λ₁ = –0.7. We then set α⁽¹⁾ = (0.003, – 0.16, 0.24)^T in (4) and generated the response. In each case, the variance of the random intercept was θ = 4. These parameter values are equal to their estimated values from the two-class model fitted in the previous section.

For each generated data set, we fitted a marginally specified logistic regression model under the MAR assumption (M = 1). We also fitted the latent class model proposed in the manuscript. In that case, we fitted a one-, two-, and three-class model, and carried out model averaging assuming a discrete uniform prior over the three classes. One hundred simulated data sets were analyzed under each scenario. The percentage bias, average estimated standard error (SE), the estimated standard deviation of the estimates (ESD), as well as coverage probability were recorded. For model averaging, ${\hat{β}}^{*}$ was reported. The results are given in Table 4.

Table 4.

Results from simulation study. Percentage bias, average of the estimated standard errors (SE), empirical standard deviation (ESD), and 95% coverage probabilities are reported for the estimated marginal regression coefficients.

	Fittedmodel: MAR				Fittedmodel: latent class
Parameter	% bias	SE	ESD	coverage	% bias	SE	ESD	coverage
	True model: MAR
Intercept	0.1	0.07	0.08	0.86	3.5	0.07	0.08	0.87
IDU	–1.1	0.12	0.14	0.91	–1.9	0.12	0.14	0.91
Visit	0.0	0.01	0.01	0.98	–3.2	0.01	0.01	0.93
	True model: latent class (M = 2)
Intercept	7.6	0.07	0.07	0.74	8.0	0.07	0.07	0.98
IDU	–0.9	0.13	0.16	0.87	–1.6	0.12	0.12	0.94
Visit	–582	0.01	0.01	0.00	–26	0.02	0.02	0.91

Open in a new tab

When the data were generated under the MAR assumption (M = 1), both modeling approaches worked reasonably well. The estimates had very little bias. The SEs tended to be slightly underestimated. Coverage was below the nominal for the intercept. We did not expect the coverage and ERs to be exact as we used large sample results for inference here.

When data were generated from the two-class model (MAR assumption violated), the model that relied on the MAR assumption (M = 1) no longer performed well. In general, coverage probabilities were too low. In particular, the estimated coefficient of visit number had a large negative bias (582%) and no coverage. The model averaging approach yielded better results. The coefficient of visit number had negative bias (26%) with coverage probability of 0.91. The bias comes from putting some weight on the incorrect model (MAR); the coefficient of visit number conditional on M = 2 had a bias of just 3%.

For data generated from the one-class model (MAR), the one-class model had the highest posterior probability in 44% of samples. Here, the two-class model was slightly favored, which is only an incorrect model in the sense that it has more parameters than necessary. For data generated from the two-class model, the two-class model had the highest posterior probability in 81% of samples. The one-class model (MAR) only had the highest probability in 2% of samples.

To confirm that the model probabilities would converge to the correct values as the sample size increased, we simulated data from the same model as described above, but with a sample size of 3400 (4 copies of the covariate data from 850 subjects were used). We fitted five simulated data sets from the one-class model (MAR) and from the two-class model. In each case, the posterior model probability for the correct M was greater than 0.99.

6. Discussion

We have proposed a new model for dealing with nonignorable missing data that parsimoniously addresses data sets with many possible dropout times (in an automated fashion) and directly models the marginal covariate effects of interest. Via approximate posterior model probabilities for the number of latent classes, this approach properly takes into account uncertainty in the unknown number of classes.

We fitted the model using approximate Bayesian methods. Reversible jump Markov chain Monte Carlo methods (Green, 1995) would be required to fit a fully Bayes model because the dimension of the parameter space changes with the number of latent classes.

For the model proposed here, we have assumed a simple within-class longitudinal dependence structure through the introduction of a random intercept. More flexible specifications of the dependence structure could be obtained by replacing the scalar random effect b_i with a set of correlated random effects b_i = (b_i₁,...,b_iT; though this will necessitate higher dimensional numerical integrations) or by allowing dependence through a Markov transition structure within class (Heagerty, 2002).

Alternative methods for specifying marginal effects for correlated binary data have been proposed. Caffo, An, and Rohde (2006) proposed a model for binary data with random effects, which uses mixtures of normals. Their approach is less computationally intensive than the Heagerty (1999) approach that we implemented here. However, extending their approach to also average over the discrete latent dropout distribution would likely prove challenging. In particular, the additional step of averaging over the latent dropout classes would make it difficult to preserve the marginal probit interpretation. Wang and Louis (2003) proposed a bridge distribution function for binary random intercept models. However, extending their approach to our setting would likely pose similar problems for mixture of normals approach of Caffo et al. (2006).

The model proposed here assumes conditional independence between the outcome and dropout processes, given the latent class and covariates. We tested this assumption against a very simple alternative hypothesis (linear effect of dropout time). A more complicated approach would be to leave the functional form of the dependence unspecified. Specifically, we could assume

g {E (Y_{i t} ∣ b_{i}, S_{i}, D_{i})} = Δ_{i t} + b_{i} + \sum_{j = 1}^{M} S_{i j} Z_{i t}^{T} α^{(j)} + f (D_{i)},

where f(·) is a smooth, but otherwise unspecified function. The null hypothesis of conditional independence would be f(D_i) = 0. We plan to explore a score-type test similar to that proposed by Zhang and Lin (2003) and Lin, Zhang, and Davidian (2006) and examine its asymptotic distribution for the models proposed here.

Acknowledgements

Dr Roy's research was supported by National Institutes of Health grant R01-HL-79457. Dr Daniels received National Institutes of Health grants R01-HL-79457 and R01-CA-85295. Data from the HER Study were collected under grant U64-CCU10675 from the U.S. Centers for Disease Control and Prevention. The authors thank the associate editor and a referee for their helpful comments and suggestions.

Appendix

Computational Details of ML

We propose the following approach to compute the ML estimates. First, we obtain initial values of the parameters. Initial values of β and θ could be obtained from ML estimates of a model that assumes an ignorable missing data mechanism. The parameters λ initially should be selected in a way that leads to marginal probabilities not too close to zero for any latent class. Initial values of α could be obtained by fitting a pattern mixture model with M groups of dropout times that have fixed boundaries. Given the data and parameters ω, we next calculate Δ_it for all i and t. We accomplish this using Newton Raphson with numerical differentiation and integration. Specifically, we solve $h (Δ_{i t}) - g^{- 1} (X_{i t}^{T} β) = 0$ for Δ_it, where

\begin{matrix} h (Δ_{i t}) \\ = \sum_{d = 1}^{T} \sum_{j = 1}^{M} {\int g^{- 1} (Δ_{i t} + b_{i} + Z_{i t}^{T} α^{(j)}) p (b_{i} ∣ S_{i j} = 1) d b_{i}} \\ \times p (S_{i j} = 1 ∣ D_{i} = d) p (D_{i} = d) \end{matrix}

and p(b_i|S_ij = 1) is N(0, θ_j), and p(S_ij = 1|D_i = d) can be found using equation (5). A 10-point Gauss–Hermite quadrature is used to integrate out the random effects b_i from the above equation. The derivative of h(Δ_it) with respect to Δ_it is h′ (Δ_it), which is found using standard numerical techniques. We then find the value of Δ_it by repeatedly calculating $Δ_{i t}^{new} = Δ_{i t}^{old} - {h (Δ_{i t}^{old}) - g^{- 1} (X_{i t}^{T} β)} ∕ h^{'} (Δ_{i t}^{old})$ until convergence. Once we have values of Δ_it for the current set of parameters ω, we can then evaluate the likelihood(6), where again Gauss–Hermite quadrature is used to evaluate the integral. Many possible algorithms could then be used to find the ML estimates. For example, one could use a Newton Raphson approach, which would require calculating the likelihood at various points to get numerical estimates of the score and Hessian at each step. However, the log likelihood for many latent class models tends to be poorly behaved (e.g., more than one local maximum). Algorithms such as Newton Raphson or Fisher scoring may not perform well. Our recommendation is start with a more stable, robust algorithm, such as Nelder–Mead, and then switch to a faster algorithm such as Newton Raphson for the final steps.

Contributor Information

Jason Roy, Center for Health Research, Geisinger Health System, Danville, Pennsylvania 17822, U.S.A. jaroy@geisinger.edu.

Michael J. Daniels, Departments of Epidemiology and Biostatistics and Statistics, University of Florida, Gainesville, Florida 32611-8545, U.S.A. mdaniels@stat.ufl.edu

References

Bandeen-Roche K, Miglioretti DL, Zeger SL, Rathouz PJ. Latent variable regression for multiple discrete outcomes. Journal of the American Statistical Association. 1997;92:1375–1386. [Google Scholar]
Caffo B, An M-W, Rohde CA. A flexible general class of marginal and conditional random intercept models for binary outcomes using mixtures of normals. Johns Hopkins University; 2006. Available at: http://www.bepress.com/jhubiostat/paper98. Department of Biostatistics Working Papers. Working Paper 98.
Cook JA, Grey D, Burke J, Cohen MH, Gurtman AC, Richardson JL, Wilson TE, Young MA, Hessol NA. Depressive symptoms and AIDS-related mortality among a multisite cohort of HIV-positive women. American Journal of Public Health. 2004;94:1133–1140. doi: 10.2105/ajph.94.7.1133. [DOI] [PMC free article] [PubMed] [Google Scholar]
Diggle PJ, Heagerty PJ, Liang K-Y, Zeger SL. The Analysis of Longitudinal Data. 2nd edition Oxford University Press; New York: 2002. [Google Scholar]
Fitzmaurice GM, Laird NM, Shneyer L. An alternative parameterization of the general linear mixture model for longitudinal data with non-ignorable dropouts. Statistics in Medicine. 2001;20:1009–1021. doi: 10.1002/sim.718. [DOI] [PubMed] [Google Scholar]
Green PJ. Reversible jump MCMC computation and Bayesian model determination. Biometrika. 1995;82:711–732. [Google Scholar]
Heagerty PJ. Marginally specified logistic-normal models for longitudinal binary data. Biometrics. 1999;55:688–698. doi: 10.1111/j.0006-341x.1999.00688.x. [DOI] [PubMed] [Google Scholar]
Heagerty PJ. Marginalized transition models and likelihood inference for longitudinal categorical data. Biometrics. 2002;58:342–351. doi: 10.1111/j.0006-341x.2002.00342.x. [DOI] [PubMed] [Google Scholar]
Hogan JW, Laird NM. Mixture models for the joint distribution of repeated measures and event times. Statistics in Medicine. 1997;16:239–257. doi: 10.1002/(sici)1097-0258(19970215)16:3<239::aid-sim483>3.0.co;2-x. [DOI] [PubMed] [Google Scholar]
Hogan JW, Roy J, Korkontzelou C. Handling dropout in longitudinal data. Statistics in Medicine. 2004;23:1455–1497. doi: 10.1002/sim.1728. [DOI] [PubMed] [Google Scholar]
Lin H, McCulloch CE, Rosenheck RA. Latent pattern mixture models for informative intermittent missing data in longitudinal studies. Biometrics. 2004;60:295–305. doi: 10.1111/j.0006-341X.2004.00173.x. [DOI] [PubMed] [Google Scholar]
Lin J, Zhang D, Davidian M. Smoothing spline-based score tests for proportional hazards models. Biometrics. 2006;62:803–812. doi: 10.1111/j.1541-0420.2005.00521.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
Little RJA. A class of pattern mixture models for normal missing data. Biometrika. 1994;81:471–483. [Google Scholar]
Little RJA. Modeling the drop-out mechanism in repeated-measures studies. Journal of the American Statistical Association. 1995;90:1112–1121. [Google Scholar]
Roy J. Modeling longitudinal data with non-ignorable dropouts using a latent dropout class model. Biometrics. 2003;59:829–836. doi: 10.1111/j.0006-341x.2003.00097.x. [DOI] [PubMed] [Google Scholar]
Smith DK, Warren DL, Vlahov D, Schuman P, Stein MD, Greenberg BL, Holmberg SD. Design and baseline participant characteristics of human immunodeficiency virus epidemiology research (HER) study: A prospective cohort study of human immunodeficiency virus infection in US women. American Journal of Epidemiology. 1997;146:459–469. doi: 10.1093/oxfordjournals.aje.a009299. [DOI] [PubMed] [Google Scholar]
Wang Z, Louis T. Matching conditional and marginal shapes in binary random intercept models using a bridge distribution function. Biometrika. 2003;90:765–775. [Google Scholar]
Wilkins KJ, Fitzmaurice GM. A hybrid model for nonignorable dropout in longitudinal binary responses. Biometrics. 2006;62:168–176. doi: 10.1111/j.1541-0420.2005.00402.x. [DOI] [PubMed] [Google Scholar]
Zhang D, Lin X. Hypothesis testing in semiparametric additive mixed models. Biostatistics. 2003;4:57–74. doi: 10.1093/biostatistics/4.1.57. [DOI] [PubMed] [Google Scholar]

[R1] Bandeen-Roche K, Miglioretti DL, Zeger SL, Rathouz PJ. Latent variable regression for multiple discrete outcomes. Journal of the American Statistical Association. 1997;92:1375–1386. [Google Scholar]

[R2] Caffo B, An M-W, Rohde CA. A flexible general class of marginal and conditional random intercept models for binary outcomes using mixtures of normals. Johns Hopkins University; 2006. Available at: http://www.bepress.com/jhubiostat/paper98. Department of Biostatistics Working Papers. Working Paper 98.

[R3] Cook JA, Grey D, Burke J, Cohen MH, Gurtman AC, Richardson JL, Wilson TE, Young MA, Hessol NA. Depressive symptoms and AIDS-related mortality among a multisite cohort of HIV-positive women. American Journal of Public Health. 2004;94:1133–1140. doi: 10.2105/ajph.94.7.1133. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R4] Diggle PJ, Heagerty PJ, Liang K-Y, Zeger SL. The Analysis of Longitudinal Data. 2nd edition Oxford University Press; New York: 2002. [Google Scholar]

[R5] Fitzmaurice GM, Laird NM, Shneyer L. An alternative parameterization of the general linear mixture model for longitudinal data with non-ignorable dropouts. Statistics in Medicine. 2001;20:1009–1021. doi: 10.1002/sim.718. [DOI] [PubMed] [Google Scholar]

[R6] Green PJ. Reversible jump MCMC computation and Bayesian model determination. Biometrika. 1995;82:711–732. [Google Scholar]

[R7] Heagerty PJ. Marginally specified logistic-normal models for longitudinal binary data. Biometrics. 1999;55:688–698. doi: 10.1111/j.0006-341x.1999.00688.x. [DOI] [PubMed] [Google Scholar]

[R8] Heagerty PJ. Marginalized transition models and likelihood inference for longitudinal categorical data. Biometrics. 2002;58:342–351. doi: 10.1111/j.0006-341x.2002.00342.x. [DOI] [PubMed] [Google Scholar]

[R9] Hogan JW, Laird NM. Mixture models for the joint distribution of repeated measures and event times. Statistics in Medicine. 1997;16:239–257. doi: 10.1002/(sici)1097-0258(19970215)16:3<239::aid-sim483>3.0.co;2-x. [DOI] [PubMed] [Google Scholar]

[R10] Hogan JW, Roy J, Korkontzelou C. Handling dropout in longitudinal data. Statistics in Medicine. 2004;23:1455–1497. doi: 10.1002/sim.1728. [DOI] [PubMed] [Google Scholar]

[R11] Lin H, McCulloch CE, Rosenheck RA. Latent pattern mixture models for informative intermittent missing data in longitudinal studies. Biometrics. 2004;60:295–305. doi: 10.1111/j.0006-341X.2004.00173.x. [DOI] [PubMed] [Google Scholar]

[R12] Lin J, Zhang D, Davidian M. Smoothing spline-based score tests for proportional hazards models. Biometrics. 2006;62:803–812. doi: 10.1111/j.1541-0420.2005.00521.x. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R13] Little RJA. A class of pattern mixture models for normal missing data. Biometrika. 1994;81:471–483. [Google Scholar]

[R14] Little RJA. Modeling the drop-out mechanism in repeated-measures studies. Journal of the American Statistical Association. 1995;90:1112–1121. [Google Scholar]

[R15] Roy J. Modeling longitudinal data with non-ignorable dropouts using a latent dropout class model. Biometrics. 2003;59:829–836. doi: 10.1111/j.0006-341x.2003.00097.x. [DOI] [PubMed] [Google Scholar]

[R16] Smith DK, Warren DL, Vlahov D, Schuman P, Stein MD, Greenberg BL, Holmberg SD. Design and baseline participant characteristics of human immunodeficiency virus epidemiology research (HER) study: A prospective cohort study of human immunodeficiency virus infection in US women. American Journal of Epidemiology. 1997;146:459–469. doi: 10.1093/oxfordjournals.aje.a009299. [DOI] [PubMed] [Google Scholar]

[R17] Wang Z, Louis T. Matching conditional and marginal shapes in binary random intercept models using a bridge distribution function. Biometrika. 2003;90:765–775. [Google Scholar]

[R18] Wilkins KJ, Fitzmaurice GM. A hybrid model for nonignorable dropout in longitudinal binary responses. Biometrics. 2006;62:168–176. doi: 10.1111/j.1541-0420.2005.00402.x. [DOI] [PubMed] [Google Scholar]

[R19] Zhang D, Lin X. Hypothesis testing in semiparametric additive mixed models. Biostatistics. 2003;4:57–74. doi: 10.1093/biostatistics/4.1.57. [DOI] [PubMed] [Google Scholar]

PERMALINK

A General Class of Pattern Mixture Models for Nonignorable Dropout with Many Possible Dropout Times

Jason Roy

Michael J Daniels

Summary

1. Introduction

2. Model

3. Computational Details

3.1 The Likelihood and ML Inference

3.2 Posterior Model Probabilities

3.3 Model Averaging and Approximate Posterior Inference

3.4 Model Checking

4. Example

4.1 Models

4.2 Results

Table 1.

Table 2.

Table 3.

4.3 Checking the Conditional Independence Assumption

5. Simulation Study

Table 4.

6. Discussion

Acknowledgements

Appendix

Computational Details of ML

Contributor Information

References

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

A General Class of Pattern Mixture Models for Nonignorable Dropout with Many Possible Dropout Times

Jason Roy

Michael J Daniels

Summary

1. Introduction

2. Model

3. Computational Details

3.1 The Likelihood and ML Inference

3.2 Posterior Model Probabilities

3.3 Model Averaging and Approximate Posterior Inference

3.4 Model Checking

4. Example

4.1 Models

4.2 Results

Table 1.

Table 2.

Table 3.

4.3 Checking the Conditional Independence Assumption

5. Simulation Study

Table 4.

6. Discussion

Acknowledgements

Appendix

Computational Details of ML

Contributor Information

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases