Abstract
Longitudinal ordinal data are common in many scientific studies, including those of multiple sclerosis (MS), and are frequently modeled using Markov dependency. Several authors have proposed random-effects Markov models to account for heterogeneity in the population. In this paper, we go one step further and study prediction based on random-effects Markov models. In particular, we show how to calculate the probabilities of future events and confidence intervals for those probabilities, given observed data on the ordinal outcome and a set of covariates, and how to update them over time. We discuss the usefulness of depicting these probabilities for visualization and interpretation of model results and illustrate our method using data from a phase III clinical trial that evaluated the utility of interferon beta-1a (trademark Avonex) to MS patients of type relapsing–remitting.
Keywords: Markov model, Ordinal response, Prediction, Transition model
1. INTRODUCTION
Multiple sclerosis (MS) is a chronic inflammatory disease of central nervous system myelin. In time, the disease causes disability which is measured on the ordinal expanded disability status scale (EDSS). In MS studies, disability is often evaluated semiannually with the aim of estimating the probability of progression, as defined on the EDSS. The natural assumption of Markov dependency provides a convenient framework for the estimation of probabilities of various time-dependent events that are of biological interest (prediction). Examples of such events are reaching a certain level, being in a certain level for 2 consecutive visits, and reaching a group of levels. Mandel and others (2007) developed methodology for analyzing these kinds of events using fixed-effects transition models and applied it to MS data. However, their first-order models suffered from lack of fit, partially due to the heterogenous nature of the disease.
Transition models for longitudinal data (Diggle and others, 2002) express the joint distribution of repeated measures as a product of conditional distributions and are especially useful when the ultimate goal of the analysis is that of prediction. However, when important subject-specific covariates are not measured, the model explains only part of the true heterogeneity in the data, and moreover the Markov assumption may be violated. One remedy is to incorporate random effects into the Markov transition model. The basic assumption is that conditional on an observed set of covariates and an unobserved latent variable, the sequence of the ordinal variables follows a Markov chain.
Two-state Markov transition models with random effects have been studied by several authors (Cook and Ng, 1997; Albert and Waclawiw, 1998; Albert and Follmann, 2003). These papers assume implicitly that the distribution of the first (baseline) state is independent of the random effect given covariates. This strong and somewhat unnatural assumption is relaxed in a series of papers (i.e. Aitkin and Alfó, 1998, 2003; Alfó and Aitkin, 2000) that suggest models for the influence of the first state on the random-effects distribution. A related framework in which subjects make transitions between states according to a continuous-time Markov process but are only observed at n time points was studied by Kalbfleisch and Lawless (1985). Random effects were introduced to this model by Cook (1999) and Cook and others (2004).
The current paper focuses on the problem of prediction from random-effects Markov models. It is based on estimation results derived in earlier work to make inferences about the future of the process given its past. Specifically, let Y0,Y1,…,Yn be the sequence of the ordinal variables observed for a subject at the equally spanned time points 0,1,…,n and let X be a vector of covariates. We are interested in estimating P(Yv = y|Y0 = y0,Y1 = y1,…,Yn = yn,X = x) for some v > n, or more generally, P(Av|Y0 = y0,Y1 = y1,…,Yn = yn,X = x), where the event Av is determined by (Y0,Y1,…,Yv). Only a few papers have dealt with inferences of this kind in the framework of transition models. Albert (1994) and Mandel and others (2007) studied parameters such as mean first passage time and time-to-event probabilities but only for fixed-effects models. Albert and Waclawiw (1998) also estimated mean first passage time in random-effects models, but only at the population level, and not based on subject-specific history.
In this paper, we extend the methodology of Mandel and others (2007) by developing methods for estimation of time-to-event probabilities and associated confidence intervals under random-effects Markov models. Our estimates take into account subject-specific history and can be updated over time when more data are collected. In the presence of random effects, much more computational effort is required for deriving estimates and confidence intervals and, more importantly, careful interpretation of estimated coefficients and predicted values is necessary. However, proper interpretation of model results with the aid of graphical tools presented below enables important insights to the longitudinal process studied, for example, to the natural history of MS.
An alternative, more direct, method for estimating P(Yv = y|Y0 = y0,Y1 = y1,…,Yn = yn,X = x) is through its empirical analog, namely, the proportion of subjects having Yv = y among those having (Y0 = y0,Y1 = y1,…,Yn = yn,X = x). It is clear, however, that this proportion can be well estimated only with a large number of independent subjects and for data containing only few covariates. A seemingly natural approach is to use survival analysis methods to predict time-to-event probabilities. However, these methods are not adequate because both time and outcome are discrete and because the event of interest is defined at several time points (see Mandel and others, 2007, for a detailed discussion). Markov models address both these concerns, and thus are attractive. Introducing random effects to Markov models has 2 advantages. First, it extends to processes that depend on the whole history, but still keeps the model parsimonious (unlike, e.g. models that increase the order of the Markov chain). Second, it accounts for and provides measures of the heterogeneity of disease courses, which is typical of many diseases such as MS. That is, heterogeneity is described through both variance component estimates and easily interpretable graphical displays, as shown below.
It is important to distinguish between our usage of the term “prediction” to refer to P(Yv = y|Y0 = y0,Y1 = y1,…,Yn = yn,X = x) and its usage in the linear and generalized linear mixed models (GLMMs) framework (Robinson, 1991; Jiang and Lahiri, 2006). The latter usage of prediction is in a subject-specific sense; either the random effect is estimated or a conditional distribution given the random effect is estimated. Thus, letting U denote the latent random effect, the analog of prediction in the generalized linear models framework is the estimation of U|{Y0 = y0,Y1 = y1,…,Yn = yn,X = x} or P(Yt = y|Y0 = y0,Y1 = y1,…,Yn = yn,X = x,U) and their associated mean-squared errors. Because we do not have many observations on each subject, and subjects are quite heterogeneous, neither of these quantities is suitable in our setting. Instead, we integrate over the random effect to obtain an estimate for future events that is based on observed data alone. This very important distinction is illustrated on the MS data in Section 4.
The paper is organized as follows. Section 2 defines the model and reviews estimation based on previous papers mentioned above. Section 3 deals with prediction under the mixed-effects model. It describes generation of probability estimators and their variances as a function of time conditional on subject-specific covariates and history. Section 4 applies the method to data from a phase III clinical trial of MS patients. It discusses several important points regarding interpretation of the models and provides important insights into the natural history of MS. Section 5 presents results of a simulation study. Section 6 completes the paper with a discussion.
2. PRELIMINARIES
This section gives a brief review of the construction and estimation of the Markov transition mixed model. It also describes an identification problem that affects estimation of model parameters, but not prediction.
2.1. The model
Consider a discrete-time Markov process over the ordinal states {1,2,…,J}. Let Yiv be the state of subject i at visit (time) v, let Ui∼G (with density g) denote a subject-specific latent variable, and let Xi denote a vector of covariates. Our inference will be conditional on the vector of covariates and the baseline state (Xi,Yi0). Using lowercase letters for realizations of random variables, the data for subject i having ni transitions are (yi0,yi1,…,yini,xi) and, omitting the subscript i, its contribution to the (conditional) likelihood, ℒi, is
| (2.1) |
The 2 components of the likelihood, the transition probabilities and the distribution of the latent variable, should be modeled. Following Aitkin and Alfó (1998), we assume that g(u|Y0 = y0,X = x) = σ − 1g0([u − β0,y0]/σ), with β0,0 = 0 and for some known g0, for example, the standard normal density. Thus, given the initial state, the random effect is independent of the covariates, and the initial state affects g only through its location. Then, after changing variables, the likelihood contribution reduces to
| (2.2) |
To model the transition probabilities, we follow Mandel and others (2007) and use the partial proportional odds model for ordinal data
![]() |
(2.3) |
where α = (αkj) is a vector of constants that for each k satisfy the constraints of the proportional odds model (Agresti, 2002). Two important assumptions are embedded in (2.3). First, the Markov model is time homogeneous. Second, the transition probabilities depend on X and U only through their linear combination. The first assumption can be relaxed by considering time-varying models for the parameters αkj or β. The second assumption reflects our view of the random effect as representing unmeasured covariates that, if observed, would be modeled as we do the observed covariates.
Combining (2.2) and (2.3), the likelihood contribution becomes
| (2.4) |
2.2. Identifiability
It is clear from (2.4) that (γ,σ,β0) is not identifiable, where β0 = (β02,…,β0J). This, however, does not present a problem for estimation of β or for prediction, hence γ will be set to 1 in the sequel, giving the final working model
| (2.5) |
Suppose that x contains the initial state y0 as one of its elements. Then, it is clear from (2.5) that the effect of y0 as a covariate (i.e. the effect of y0 given U = u) is not identifiable. Thus, interpretation of β0 should be made with care. It seems more reasonable to tie β0 to the random-effects distribution rather than to the transition probabilities when the initial state is somewhat arbitrary, relating to the sampling time. In that case, interpretation of β0 is as the center of the random-effects distribution and not in terms of an odds ratio. Moreover, there is nothing special about y0 in the discussion above, and the same reasoning applies for any other covariate; β is not identifiable if the random-effects density takes the form g(u|Y0 = y0,X = x) = σ − 1g0([u − β0,y0 − ζ′x]/σ). This is quite a reasonable form for g(u|Y0 = y0,X = x) when viewing U as an unmeasured covariate and assuming a multivariate normal distribution for (X,U). Although interpretation of model results is problematic, for prediction purposes this identification issue raises no difficulties since prediction is based on β′x + γ[σu + β0,y0], which is identifiable (for given x,u, and y0).
2.3. Estimation
The likelihood of N independent subjects is given by
![]() |
(2.6) |
Estimation of nonlinear random-effects model is done via maximization of (2.6). This is a standard though difficult task. A convenient and flexible routine for maximization of the likelihood is the procedure NLMIXED in SAS, which contains several methods for integration and optimization when g0 is normal (Littell and others, 2006). The expectation–maximization algorithm is an alternative way of maximizing ℒ (Aitkin and Alfó, 1998) without requiring specification of the distribution of the random effects. However, Agresti (2002, pp 547–548) found that the random-effects distribution has to be extremely nonnormal for the normal GLMM to suffer from bias or inefficiency. Thus, the normal model for the random-effects distribution seems reasonable in most circumstances in terms of simplicity and interpretability.
2.4. Notation
The following notation will be used in the sequel: θ≡(β′,β0′)′ is the vector of fixed effects, ϑ≡(α′,θ′,σ)′ is a vector of length m of all unknowns, z≡(x′,I{y0 = 2},…,I{y0 = J})′ is the vector of observed predictors, where I{·} is the indicator function and w≡θ′z + σu is the linear predictor. Depending on the context, the transition probabilities will be denoted by pyv − 1,yv(θ′z + σu;α), pyv − 1,yv(x,u,y0;ϑ), or pyv − 1,yv(w;α). The superscript (s) will be added to denote transitions in s steps, for example, pk,j(s)(x,u,y0;ϑ) = P(Yv + s = j|Yv = k,X = x,U = u,Y0 = y0;ϑ).
3. PREDICTION
The ultimate goal of our analysis is that of prediction of a future observation given the past observations and covariates
| (3.1) |
Note that here, in contrast to the fixed-effects case,
because the distribution of (Y0,Y1,…,Yv) has the Markov property only conditional on U. Thus, the process itself provides information on the latent U which, in turn, is used to predict future events.
Using
![]() |
(3.2) |
(3.1) is estimated by
![]() |
(3.3) |
where
is an estimate of ϑ. This last integral can be calculated using numerical methods. A natural simple way is to conduct Monte Carlo integration with respect to the assumed known density g0. Alternatives are Markov chain Monte Carlo (MCMC), which eliminates the burden of approximating the integral in the denominator, and general numerical integration methods.
Of special interest for us is the prediction for a subject without any observed transitions. This represents a patient at diagnosis and is analogous to prediction in a model without random effects. For such a case, (3.3) reduces to
| (3.4) |
As mentioned in Section 1, it is important to distinguish (3.1) from
| (3.5) |
The problem of estimating quantities similar to (3.5) is referred to as prediction in the mixed model literature (Jiang and Lahiri, 2006). In using (3.5), one aims at estimating the “subject-specific” transition probability which is a random variable, even in a frequentist's point of view. A point predictor for (3.5) is
where
is the mean or the mode of U|{Yt = yt}0 ≤ t ≤ n,X = x under
(Booth and Hobert, 1998). Jiang (2003) suggests the empirical best predictor 𝔼{P(Yv = yv|{Yt = yt}0 ≤ t ≤ n,X = x)|U = u} which is exactly (3.3). Thus, the point estimators of the 2 prediction problems are the same but the estimands differ. With small numbers of observations on each subject, the utility of subject-specific parameters, such as those in (3.5), is questionable. This point will be illustrated further in the data analysis in Section 4.
3.1. Prediction variance
Letting
be the estimator of ϑ based on N independent subjects and assuming that
, we can calculate the asymptotic variance of (3.3) and (3.4) by the delta method. Let P(x,u,y0;ϑ) be the transition matrix evaluated at (x,u,Y0 = y0;ϑ), and let p*(v − n)(yn,yv,P(x,u,y0;ϑ)) = pyn,yv(v − n)(x,u,y0;ϑ) be the (v − n)-step transition probability as a function of the 1-step transition matrix. We have that
| (3.6) |
where vec(P) is the vector representation of the matrix P. The first term on the right-hand side of (3.6) can be calculated by a simple matrix multiplication as shown by Mandel and others (2007), and the rows of the second term are
where p·,·(w;α) is the generic form of the transition probabilities evaluated at w and α. Calculation of these derivatives for the partial proportional odds model (2.3) is given in Appendix C of the supplementary material (available at Biostatistics online, http://www.biostatistics.oxfordjournals.org). To calculate the derivative of (3.4), one should average (3.6) with respect to g0, which can be done again by numerical integration.
Differentiation of (3.3) is more complicated but can still be carried out analytically as shown in Appendix D of the supplementary material, available at Biostatistics online. An alternative approach for estimating the variance is based on a simulation that replaces the analytic differentiation but makes use of the asymptotic properties of the parameters’ estimators:
Calculate
and
, the estimates of ϑ and Σ.Sample B vectors ϑ1,…,ϑB from the normal distribution with parameters
and
/N.Calculate (3.3) with ϑb instead of
(b = 1,…,B).Calculate the variance of the estimates in the previous step or calculate confidence intervals using their distribution.
Note that this algorithm requires numerical integration in Step 3 for each of the simulated samples. The calculations of prediction variances for models having fixed effects only are considerably simpler (Mandel and others, 2007).
3.2. Choice of parameters
Simple manipulations of the estimated transition matrix enable estimation of different parameters of interest. For example, one may be interested in estimating time until the process first visits a certain set of states 𝒮 (hitting time) or time until the first 2 consecutive visits to 𝒮. The second parameter is of great interest in MS since EDSS in one visit may indicate a temporary progression that is much less important than sustained progression (see Section 4). As an example, consider time until the first 2 consecutive visits to 𝒮 = {k:k > j}, with the aim of ultimately estimating the probability of 2 consecutive visits to states larger than j before time v. To estimate these probabilities, one should replace P = (pij) with the working (J + 1)×(J + 1) transition matrix
![]() |
(3.7) |
Thus, an additional absorbing state is added to indicate the event of interest (see Mandel and others, 2007, for other modifications). The (k,J + 1)th cell of Qj +v, the vth power of Qj +, is the probability that 2 consecutive visits to 𝒮 occurred during the first v transitions when the process started in k.
Prediction and calculation of variances or confidence intervals are based on the modified transition matrix, Qj +. Letting ql,k(v)(x,u,y0;ϑ) denote the (l,k) element of Qj +v under ϑ for (X = x,U = u,Y0 = y0), the probability of 2 consecutive visits in states > j before time v, given the process’ history up to time n, is estimated by
| (3.8) |
Equation (3.8) assumes that the event {two consecutive visits in states > j} has not occurred before visit n (otherwise the probability is 1), or alternatively, that we are interested in the probability of that event from the present time to time v. The variance is calculated as described in Section 3.1 (see Appendix D of the supplementary material for some technical details, available at Biostatistics online).
3.3. Models without random effects
A model without random effects is obtained as a special case by setting u≡0 and considering g0 as degenerate at 0. The averaging over the random effects is not needed, making the calculation much simpler. Also, (3.3) reduces to (3.4), where the last observed state carries all important information of the history of the process. Interpretation of β0 is now as the coefficients of the covariate y0 and may be omitted according to the specification of the model.
4. PROGRESSION OF MS
The data set analyzed here is part of a double-blinded phase III clinical trial that evaluated the utility of interferon beta-1a (trademark Avonex) for MS patients with relapsing–remitting disease (Jacobs and others, 1996). It includes the subset analyzed by Rudick and others (1999) of all patients who were accrued early enough to complete 2 years of follow-up by the end of the study and who had brain magnetic resonance imaging scans at baseline and yearly thereafter. As seen in Table 3 of Jacobs and others (1996), the distribution of the time to sustained progression for this subgroup was the same as that of all study subjects. Visits were scheduled to be every 6 months, but actual visits deviated slightly from the schedule. We used all visits that had a maximum discrepancy of 30 days from the schedule. This resulted in only 16 missed visits (2.5% of the total scheduled visits). In our analysis, we consider these visits to be missing at random.
Most MS clinical trials use the ordinal EDSS to define progression. The EDSS ranges from 0 (normal neurologic exam) to 10 (death due to MS) in 0.5 point steps (see http://www.mult-sclerosis.org/ expandeddisabilitystatusscale.html, for a description of the scale). In the current study, the outcome of interest was time to sustained progression, defined as the time to 2 consecutive visits with EDSS of at least one point greater than baseline (Jacobs and others, 1996).
The data set contains 68 individuals with a total of 290 1-step transitions in the Avonex group and 72 individuals with 317 1-step transitions in the placebo group, with a maximum follow-up of 3 years (baseline + 6 visits). Due to the small number of transitions, we collapsed the EDSS values into 3 categories: category 1, EDSS ≤ 1.5 (no disability); category 2, EDSS of 2 or 2.5 (mild disability); and category 3, EDSS ≥ 3 (moderate to severe disability). The total number of transitions is summarized in Table 1.
Table 1.
Observed transitions of MS patients between EDSS categories. Numbers in parentheses are 2-step transitions corresponding to the 16 missed visits
| Placebo |
Avonex |
|||||
| 1 | 2 | 3 | 1 | 2 | 3 | |
| 1 | 49 (1) | 23 (0) | 6 (0) | 50 (1) | 26 (1) | 3 (1) |
| 2 | 15 (0) | 45 (2) | 30 (0) | 33 (0) | 52 (1) | 25 (2) |
| 3 | 4 (0) | 21 (0) | 124 (5) | 1 (0) | 21 (0) | 79 (2) |
The partial proportional odds model (2.5) was fitted to the data with J = 3 and g0 the standard normal density. This model does not constrain the baseline transition matrix but assumes proportional odds of covariates among all transitions. Parameters were estimated by SAS procedure NLMIXED, using the dual quasi-Newton algorithm with integrals evaluated by the default adaptive Gaussian quadrature. Convergence problems were solved by first fitting models without random effects and using the resulting estimates as initial values for the mixed-effects models.
We first estimated the model with a treatment indicator as the only component of x. Estimates and their standard errors are listed in Appendix A of the supplementary material (available at Biostatistics online, http://www.biostatistics.oxfordjournals.org). The estimated coefficient for treatment is 1.17, with estimated standard error of 0.475. Thus, Avonex significantly decreases the probability of worsening in disability. This finding is consistent with the results of the original study. We then estimated the model for each arm separately, testing for the influence of patient-specific covariates: age, disease duration, sex, brain lesion volume, and brain parenchymal fraction, the first 2 of which were introduced as time-dependent covariates (see Section 6). None of these covariates showed a significant effect. We thus continued our analysis fitting 2 separate models, one for each arm, without any covariates, that is assuming different values for the parameters α, β0, and σ2 for the 2 arms. Results are listed in Appendix A of the supplementary material, available at Biostatistics online.
Using the estimated transition matrices, we calculated the probability of 2 consecutive visits with EDSS higher than the baseline value as a function of time. We used the modification of the transition matrix given in (3.7) for the analysis. In principle, probabilities of progression can be calculated for any future time point, using the appropriate power (n) in (3.8). The utility of such calculations, however, depends on the validity of the model. In the sequel, we present probabilities up to 5 years to demonstrate the usefulness of the method and discuss and compare probabilities up to 2 years, which was the end point of the original study.
Figure 1 depicts the probability of progression for a subject at the first visit as a function of time (6-month units) stratified by arm and baseline EDSS. After 2 years, the probability of sustained progression among those who had EDSS of 1 at baseline is estimated as 0.46 and 0.55 for the Avonex and placebo arms, respectively. For those having EDSS of 2 at baseline, the difference is more pronounced, being 0.30 for the Avonex and 0.60 for the placebo patients. It appears that Avonex prevents progression for people with mild disability better than for those having no disability. However, this may be related to the nature of the scale; a change from EDSS of 1 to 2 is considered a smaller step than a change from 2 to 3.
Fig. 1.
Probability of progression. Solid lines for the placebo arm and dashed lines for the Avonex arm. The 2 curves for each arm show different level of baseline EDSS (1 and 2). 1
Jacobs and others (1996) estimated time to progression without conditioning on baseline EDSS. To generate a similar estimate, one can weigh the curves according to the probability of baseline EDSS. For example, in the Avonex arm, there were 20 and 30 individuals with baseline EDSS of 1 and 2, respectively, and the overall estimate of the probability of progression would use the weights 2/5 and 3/5. Estimating progression of individuals with EDSS of 3 or higher is impossible because EDSS values greater than 3 were combined.
For comparison, models without random effects were fitted to the same data. Figure 2 presents the estimated progression curves and 95% pointwise confidence intervals for patients in the Avonex arm who had baseline EDSS of 2. The 2 curves represent models with (pluses) and without (circles) baseline EDSS as a covariate (see Section 3.3) and their estimated probabilities are quite similar. However, the probabilities are considerably larger than those predicted by the random-effects model (dashed line). Under the random-effects model, the estimated σ2 values are 4.96 and 5.83 for the placebo and Avonex arms, respectively. The respective likelihood ratio statistics that contrast the models with and without the random effects are 30.7 and 29.2; these are large values for a chi-squared distribution with 1 degree of freedom (in fact, the test statistic under the null hypothesis σ2 = 0 is a 50:50 mixture of a χ02 and a χ12 distribution, Self and Liang, 1987, which is stochastically smaller than χ12). These indicate that the heterogeneity in the data is large and support the choice of the random-effects model. Various other publications have found empirically that progression is slower than that predicted by the models without random effects (e.g. Jacobs and others, 1996; Weinshenker and others, 1989).
Fig. 2.
Probability of progression starting in state 2 of the Avonex arm under the fixed models. Pluses and circles denote estimates with and without baseline EDSS as a covariate. Lines are the corresponding estimates of the 95% pointwise confidence intervals. The dashed line is the estimate under the mixed model (lowest curve of Figure 1).
To estimate prediction curves for patients for whom EDSS history is available, realizations of curves can be generated using the posterior distribution of the random effect (given history and covariates), as seen in (3.2). Such realizations represent the hypothetical population of curves that the patient-specific curve comes from, and their mean is the probability of progression given the data, that is unconditional on the random effect. Depicting curves from the posterior distribution is a useful descriptive tool that helps with interpretation. To illustrate this, Figure 3 depicts 100 curves for 2 hypothetical subjects in the Avonex group. The left panel represents a subject with baseline EDSS of 2 and without follow-up data (i.e. at the first visit). The curve on the right represents a hypothetical subject after 10 visits with EDSS history (2, 2, 3, 2, 3, 2, 1, 2, 2, 2) and can be considered as a 5-year update of the progression curve for the subject on the left. Because of a short follow-up, the validity of the model cannot be assessed for 5 years, hence the curves may not reflect the real probabilities of progression for such a history. The graph is depicted to demonstrate the utility of the method for valid models. The mean curves and 95% pointwise confidence intervals are depicted too. The variability of the curves on the right is much smaller than that on the left as a result of the additional information. The confidence intervals are not much smaller since they contain the sampling variability of the coefficient estimators. With increasing follow-up data on the same individual, the variability of the gray curves disappears and the graph shows the predicted (or estimated) probability of progression of a specific subject.
Fig. 3.
Distribution of curves of the probability of progression. One hundred realizations of the estimated model for the Avonex arm (gray lines) with the estimated mean curve and 95% confidence intervals based on the simulation method. Left: baseline EDSS of 2 no follow-up data; right: 10 visits with observed data (2, 2, 3, 2, 3, 2, 1, 2, 2, 2).
Figure 3 illustrates the heterogeneity in the course of the disease and indicates that subject-specific prediction is very difficult. It provides a nice platform for understanding the distinction between (3.1) and (3.5). The quantity (3.5) is essentially one of the gray curves appearing in the figure, while (3.1) is the average of the curves. Thus, (3.1) is a functional of the distribution of (3.5) and can be estimated consistently from the data.
To generate realizations of progression curves for Figure 3, a sampling algorithm from the posterior g(u|{Yt = yt}0 ≤ t ≤ n,X = x) is required. A natural choice is the MCMC algorithm. However, for the current purpose, a rather small number of independent realizations is needed and a simple alternative is to use a direct sampling. Let c > supuP({Yt = yt}0 ≤ t ≤ n|X = x,u). In practice, it is enough to approximate c by calculating P({Yt = yt}0 ≤ t ≤ n|X = x,u) on a fixed grid. The rejection/acceptance algorithm (e.g. Evans and Swartz, 2000) for generating one realization is then as follows.
Generate independently u from g0 and v from U(0,1).
If P({Yt = yt}0 ≤ t ≤ n|X = x,u) > cv stop and return u. Otherwise, go back to Step 1.
Independent applications of this algorithm produce independent realizations from the posterior distribution, and these are used in the graph.
Finally, we comment on model selection. Choosing the correct model is always a difficult task in parametric analysis. As discussed above, we contrasted the models with and without random effects, and the analysis strongly supported the inclusion of random effects. We also embedded the random-effects model into larger models, such as models with baseline and time-dependent covariates, and models with time-varying vectors of constants, α. Using likelihood ratio tests, all these extended models were rejected when compared to our final model. As pointed out by an associate editor, probabilities such as (3.1) can be read directly from the data when the sample size is very large. Unfortunately, we do not have a large sample size and thus do need to impose strong parametric assumptions. Even though the sample sizes are small, we calculated Kaplan–Meier curves for the probability of progression in order to compare informally the data to the models’ results. We imputed values for missing visits by the last observation carried forward approach. Table 2 lists the results and shows good agreement between the data and the model.
Table 2.
Kaplan–Meier (KM) and model-based estimates of the probability of sustained progression during the first 2 years; CI, confidence interval
| Arm | Baseline EDSS | KM (95% CI) | Model |
| Avonex | 1 | 0.489 (0.117, 0.577) | 0.465 |
| 2 | 0.338 (0.143, 0.488) | 0.298 | |
| Placebo | 1 | 0.520 (0.278, 0.681) | 0.554 |
| 2 | 0.550 (0.269, 0.723) | 0.600 |
Since the event of interest (sustained increase in one point) is nonstandard in survival analysis and necessitated imputation of missing values, we further compared the model's estimates of 2-year transition probabilities to empirical data. We again found good agreement between the 2 sets of estimates (Appendix A of the supplementary material, available at Biostatistics online, http://www.biostatistics.oxfordjournals. org). These findings, and the sensitivity analysis described in Section 5, support the validity of the chosen random-effects model and the adequacy of the resulted progression curves.
5. SIMULATION
We conducted a small simulation study to examine the performance of the confidence intervals. We considered 2 settings, both of which result in 300 transitions. The first was of 100 subjects each with 3 transitions, and the second was of 20 subjects each with 15 transitions. We compared the confidence intervals based on the analytical delta method to those obtained by the simulation method described at the end of Section 3.1. We also calculated intervals for models without random effects to illustrate the consequences of model misspecification. Detailed description of the models used and tables of results appear in Appendix B.1 of the supplementary material (available at Biostatistics online, http://www.biostatistics. oxfordjournals.org).
We found that confidence intervals with 100 subjects perform well, whereas those for 20 subjects were anticonservative. This was probably a result of poor normal approximation for the distribution of the fixed-effect estimators in data sets with few subjects. Another feature of the confidence intervals was their uneven distribution on the left and right of the true value, where most intervals that did not include the true parameter assigned values that were too small. This was less pronounced in the percentile method and suggests that it was mostly related to the linear approximation of the delta method. The similarity between the analytical delta method and the simulation-based approach for the 100 subjects’ setting was remarkable. In summary, the simulation approach demands more computer time but saves the burden of calculating and programming complicated derivatives. It also has the additional merit of eliminating the linear approximation of the delta method.
Confidence intervals based on models without random effects perform poorly. In the setting of 100 subjects, there is a tendency toward overestimation which is consistent with the finding of the data analysis in Section 4 (see Figure 2).
We also tested the sensitivity of the model to time-varying transition probabilities. For that, we generated data using log(time) as a covariate but fitted time-homogeneous models. The results, given in Appendix B.2 of the supplementary material (available at Biostatistics online, http://www.biostatistics. oxfordjournals.org), show moderate sensitivity which, as expected, increases with the effect of log(time). For example, when the coefficient of log(time) is − 0.5, the bias at time 4 (corresponding to 2 years) is 0.012, and the bias at time 10 (5 years) is − 0.039. The true probabilities are 0.593 and 0.897, so the bias is relatively small.
6. DISCUSSION
We have presented prediction and confidence interval estimation in the mixed-effects Markov model framework and have provided several graphical tools that help to interpret model results. These graphs, and especially Figure 3, indicate that MS is indeed a heterogeneous disease and provide useful tools for better understanding the course of the disease. Although we have introduced our models for time-independent covariates, they can be extended to time-varying covariates. Letting xv be the values of the covariates as measured at visit v, (2.5) is replaced by
| (6.1) |
and estimation of ϑ follows, similar to this extension in models without random effects (Mandel and others, 2007). If the covariate process is external to the Markov process and its future is known up to time v, then prediction can be done as in (3.3) by
![]() |
where
is the (yn,yv) element of the transition matrix P(xn,u,y0;ϑ)×P(xn + 1,u,y0;ϑ)×⋯×P(xv − 1,u,y0;ϑ). The variance can be estimated by the simulation algorithm discussed in Section 3.1.
Our study involves ordinal responses and assumes the partial proportional odds model (2.3), but extensions to other models and to nominal responses are quite straightforward. A general treatment of modeling and estimating random effects for categorical data is given by Hartzel and others (2001). One can specify such models for each row of the transition matrix or define one model for all the rows, similar to the approach taken in this work. Prediction is then conducted by integrating over the random effects, exactly as is done here.
MS is known to be a heterogenous disease. Some patients experience no progression for 10 or more years (benign MS), whereas others experience a fast and continuous progression from onset (primary progressive MS). The patients analyzed in this paper are relatively homogeneous, since enrollment was subjected to the strict criteria of a phase III clinical trial. Nonetheless, our results indicate that heterogeneity is present and is significant. Prediction of the random effect U would be too ambitious with the typical short follow-up data on each subject, as is well illustrated by Figure 3. In other contexts, predicting U and calculating the mean-squared error of prediction, using the approach of Booth and Hobert (1998), is of both theoretical and applied interest and is a topic of current research. When predicting U, it is of interest to calculate also “plug-in” prediction intervals (see Lawless and Fredette, 2005, and references therein). These are calculated by the pointwise percentiles of the realizations of the curves at each time point and can be included in Figure 3.
FUNDING
National Institutes of Health (CA075971 to M.M., R.A.B.); Harvard Center for Neurodegeneration and Repair (to M.M., R.A.B.); Partners Multiple Sclerosis Center (to M.M.).
Supplementary Material
Acknowledgments
We thank Dr. Richard Rudick, Director of the Mellen Center for Multiple Sclerosis at the Cleveland Clinic Foundation, for providing us the Avonex trial data on behalf of the Multiple Sclerosis Collaborative Research Group (Principal Investigator, Lawrence Jacobs, MD, deceased). We are also grateful to the editor, an associate editor, and a referee for helpful comments and suggestions. Conflict of Interest: None declared.
References
- Agresti A. Categorical Data Analysis. Second Edition. Hoboken, NJ: Wiley & Sons; 2002. [Google Scholar]
- Aitkin M, Alfó M. Regression models for binary longitudinal responses. Statistics and Computing. 1998;8:289–307. [Google Scholar]
- Aitkin M, Alfó M. Longitudinal analysis of repeated binary data using autoregressive and random effect modelling. Statistical Modelling. 2003;3:291–303. [Google Scholar]
- Albert PS. A Markov model for sequences of ordinal data from a relapsing-remitting disease. Biometrics. 1994;50:51–60. [PubMed] [Google Scholar]
- Albert PS, Follmann DA. A random effects transition model for longitudinal binary data with informative missingness. Statistica Neerlandica. 2003;57:100–111. [Google Scholar]
- Albert PS, Waclawiw MA. A two-state Markov chain for heterogeneous transitional data: a quasi-likelihood approach. Statistics in Medicine. 1998;17:1481–1493. doi: 10.1002/(sici)1097-0258(19980715)17:13<1481::aid-sim858>3.0.co;2-h. [DOI] [PubMed] [Google Scholar]
- Alfó M, Aitkin M. Random coefficient models for binary longitudinal responses with attrition. Statistics and Computing. 2000;10:279–287. [Google Scholar]
- Booth JG, Hobert JP. Standard errors of prediction in generalized linear mixed models. Journal of the American Statistical Association. 1998;93:262–272. [Google Scholar]
- Cook RJ. A mixed model for two-state Markov processes under panel observation. Biometrics. 1999;55:915–920. doi: 10.1111/j.0006-341x.1999.00915.x. [DOI] [PubMed] [Google Scholar]
- Cook RJ, Ng ETM. A logistic-bivariate normal model for overdispersed two-state Markov processes. Biometrics. 1997;53:358–364. [PubMed] [Google Scholar]
- Cook RJ, Yi GY, Lee KA, Gladman DD. A conditional Markov model for clustered progressive multistate processes under incomplete observation. Biometrics. 2004;60:436–443. doi: 10.1111/j.0006-341X.2004.00188.x. [DOI] [PubMed] [Google Scholar]
- Diggle P, Heagerty P, Liang KY, Zeger SL. Analysis of Longitudinal Data (Second Edition) Oxford: Oxford University Press; 2002. [Google Scholar]
- Evans M, Swartz T. Approximating Integrals via Monte Carlo and Deterministic Methods. Oxford: Oxford University Press; 2000. [Google Scholar]
- Hartzel J, Agresti A, Caffo B. Multinomial logit random effects models. Statistical Modelling. 2001;1:81–102. [Google Scholar]
- Jacobs LD, Cookfair DL, Rudick RA, Herndon RM, Richert JR, Salazar AM, Fischer JS, Goodkin DE, Granger CV, Simon JH. Intramuscular interferon beta-1 alpha for disease progression in relapsing multiple sclerosis. Annals of Neurology. 1996;39:285–294. doi: 10.1002/ana.410390304. and others. [DOI] [PubMed] [Google Scholar]
- Jiang JM. Empirical best prediction for small-area inference based on generalized linear mixed models. Journal of Statistical Planning and Inference. 2003;111:117–127. [Google Scholar]
- Jiang J, Lahiri P. Mixed model prediction and small area estimation (with discussion) Test. 2006;15:1–96. [Google Scholar]
- Kalbfleisch JD, Lawless JF. The analysis of panel data under a Markov assumption. Journal of the American Statistical Association. 1985;80:863–871. [Google Scholar]
- Lawless JF, Fredette M. Frequentist prediction intervals and predictive distributions. Biometrika. 2005;92:529–542. [Google Scholar]
- Littell RC, Milliken GA, Stroup WW, Wolfinger RD, Schabenberber O. SAS for Mixed Models. Second Edition. Cary, NC: SAS Publishing; 2006. [Google Scholar]
- Mandel M, Gauthier SA, Guttmann CRG, Weiner HL, Betensky RA. Estimating time to event from longitudinal categorical data: an analysis of multiple sclerosis progression. Journal of the American Statistical Association. 2007;102:1254–1266. doi: 10.1198/016214507000000059. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Robinson GK. That BLUP is a good thing: the estimation of random effects (with discussion) Statistical Science. 1991;6:15–32. [Google Scholar]
- Rudick RA, Fisher E, Lee J-C, Simon J, Jacobs L. Use of the brain parenchymal fraction to measure whole brain atrophy in relapsing-remitting MS. Neurology. 1999;53:1698–1704. doi: 10.1212/wnl.53.8.1698. [DOI] [PubMed] [Google Scholar]
- Self SG, Liang KY. Asymptotic properties of maximum-likelihood estimators and likelihood ratio tests under nonstandard conditions. Journal of the American Statistical Association. 1987;82:605–610. [Google Scholar]
- Weinshenker BG, Bass B, Rice GPA, Noseworthy J, Carriere W, Baskerville J, Ebers GC. The natural history of multiple sclerosis: a geographically based study 2. Predictive value of the early clinical course. Brain. 1989;112:1419–1928. doi: 10.1093/brain/112.6.1419. [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.









