Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2013 Sep 22.
Published in final edited form as: Biometrics. 2009 Jan 23;65(3):728–736. doi: 10.1111/j.1541-0420.2008.01173.x

Estimation in Semiparametric Transition Measurement Error Models for Longitudinal Data

Wenqin Pan 1,*, Donglin Zeng 2,**, Xihong Lin 3,***
PMCID: PMC3779699  NIHMSID: NIHMS500730  PMID: 19173696

SUMMARY

We consider semiparametric transition measurement error models for longitudinal data, where one of the covariates is measured with error in transition models, and no distributional assumption is made for the underlying unobserved covariate. An estimating equation approach based on the pseudo conditional score method is proposed. We show the resulting estimators of the regression coefficients are consistent and asymptotically normal. We also discuss the issue of efficiency loss. Simulation studies are conducted to examine the finite-sample performance of our estimators. The longitudinal AIDS Costs and Services Utilization Survey data are analyzed for illustration.

Keywords: Asymptotic efficiency, Conditional score method, Functional modeling, Measurement error, Longitudinal data, Transition models

1. Introduction

Longitudinal data are common in health science research, where repeated measures are obtained for each subject over time. One class of longitudinal models is the transitional model, where the conditional mean of an outcome at the current time point is modeled as a function of the past outcomes and covariates (Diggle et al., 2002, Chapter 10). This class of models is particularly useful when one is interested in predicting the future response given the past history, or when past history contains important adjustor variables. The within-subject correlation is automatically accounted for by conditioning on the past responses, and the model can be easily fit within the generalized linear model framework. Transition models and their wide practical applications have been well demonstrated (e.g., Young et al. 1999, Have and Morabia 2002, Heagerty 2002, Roy and Lin 2005).

Measurement error in covariate is a common problem in longitudinal data, due to equipment limitation, longitudinal variation, or recall bias. In one study from the AIDS Costs and Services Utilization Survey (ACSUS) (Berk, Maffeo and Schur 1993), which consisted of subjects from 10 randomly selected U.S. cities with the highest AIDS rates, a series of quarterly interviews were conducted for each participant enrolled between 1991 and 1992. A question of interest was to study how CD4 count predicted the risk of future hospitalization given a subject’s past history of hospitalizations. Thus, a natural model for analyzing this data set is to fit a prediction model with the outcome being whether a participant had a hospital admission (yes/no) in the past quarter. However, CD4 count is known to be subject to considerable measurement error due to its substantial variability, e.g., its coefficient of variation within the same subject was found to be 50% (Tsiatis et al. 1995). Another source of measurement error in CD4 count in this study was due to the fact that CD4 count was not measured at the time of each interview but abstracted from each respondent’s most recent medical record.

The methods for handling measurement error for independent outcomes are comprehensively reviewed in Fuller (1987) and Carroll, et al. (2006). For longitudinal data, Wang, et al. (1998) among others considered measurement error in mixed effects models. Schmid, Segal and Rosner (1994) and Schmid (1996) studied measurement error in first-order autoregressive models for continuous longitudinal outcomes. There is a vast amount of work in the econometrics literature on panel data with errors in variables. For example, Griliches and Hausman (1986) and Biorn and Klette (1998) proposed estimating the effect of the error-prone covariates using the generalized moment method but their method required that longitudinal outcomes be linearly related to covariates and the residue terms be non-autocorrelated. In the literature of structural equations models, longitudinal covariates subject to measurement errors are treated as latent variables and are modelled longitudinally and explicitly (c.f. Duncan, Duncan and Strycker, 2006). The maximum likelihood estimation is used for inference. Additionally, using the same idea of latent modelling, Pan, Lin and Zeng (2006) considered estimation in generalized transitional measurement error models for general outcomes. However, these approaches require that the normality assumption and the correlation structure of the un-observed covariate be correctly specified. The normality assumption is often too strong in reality, and the correlation structure of the unobserved covariate may be difficult to be specified correctly. One can show that when a first-order autoregressive structure for the unobserved covariate is misspecified as an independent structure, the effect of this covariate in transition model is attenuated and the effect of the past outcome is the same as the one ignoring the measurement error (Pan, 2002). Therefore, it is necessary to develop a method which leaves the distribution of the unobserved covariate fully unspecified. On the other hand, since the repeated measures of the unobserved covariate are usually correlated and have at least three waves, the attempt to estimate their joint distribution nonparametrically, for example, using the kernel method in Carroll and Wand (1991), breaks down due to the curse of dimensionality.

This paper aims to develop a semiparametric method for transition measurement error models without specifying the distribution of the unobserved covariate. Our approach is to construct an estimating equation based on the pseudo conditional score method, originally proposed for independent data by Stefanski and Carroll (1987). However, its generalization to transition models is not trivial in presence of repeatedly measured unobserved covariates. In the second part of this paper, we further discuss the efficiency issue in the proposed method.

The rest of the paper is structured as follows. In §2, we present the general form of the semiparametric transition measurement error model for longitudinal data. In §3, we derive the pseudo conditional score estimating equation and study the theoretical properties of the resulting estimator. In §4, we illustrate the method using simulation studies and apply the proposed method to analyze the ACSUS data. The issue of efficiency loss is also studied. Discussions are given in §5.

2. Semiparametric Transition Measurement Error Model

Suppose each of the n subjects has m repeated measures over time. Let Yij be the outcome at time j (j = 1, ⋯, m) of subject i (i = 1, ⋯, n). Let Wij be a scalar observed error-prone covariate, which measures the unobserved covariate Xij with error. Let Zij be a vector of covariates that are accurately measured. A transition model assumes the conditional distribution of Yij given the history of the outcome and the history of the covariates satisfies the (q, r)-order Markov property (Ch 10, Diggle et al., 2002) and belongs to the exponential family.

Specifically, for j > s, where s = (r − 1)⋁q = max(r − 1, q), the conditional distribution of Yij is

f(Yij|Hij)=exp{(Yijηijb(ηij))/aϕ+c(Hij,ϕ)}, (1)

where Hij = {Yi,j−1, ⋯, Yi,jq, Xij, …, Xi,jr+1, Zij, …, Zi,jr+1}, f(·) denotes a density function, a is a prespecified weight, ϕ is a scale parameter, and b(·) and c(·) are specific functions associated with the exponential family. We assume a canonical generalized linear model (McCullagh and Nelder, 1989) for μij = E(Yij|Hij) = b′(ηij) as

h(μij)=ηij=β0+k=1qαkYi,jk+l=1r{βxlXi,jl+1+βzlTZi,jl+1}, (2)

where h(·) is the canonical link function satisfying h−1(·) = b′(·), β0, αk (k = 1, …, q), βl=(βxl,βzlT)T (l = 1, …, r) are regression coefficients. In addition, we treat Yi1, …, Yi,s as initial states which the subsequent inference will be conditioned on. One note is that when Z covariates do not change with time, we tacitly keep only one Z term in equation (2).

We assume that Xij is subject to measurement error and the measurement error is additive, i.e.,

Wij=Xij+Uij, (3)

where the measurement errors Uij are independent of the Xij and are independently and identically distributed from a normal distribution with a known variance σu2. The variance σu2 usually needs to be estimated beforehand, either from replications or from validation data (Carroll, et al, 2006). We assume that the joint distribution of {Xi1, ⋯, Xim} is fully unspecified.

We suppose that measurement error is non-differential, i.e.,

f(Yij,Wij|Hij)=f(Yij|Hij)f(Wij|Xij),

where Hij was defined in (1). This means that conditional on the true covariates, the observed error-prone covariate does not contain additional information about Yij.

3. Inference Procedures

3.1 Pseudo conditional score equation

Let θ denote (β0,α1,,αq,β1T,,βrT,ϕ)T. In this section, we propose a pseudo conditional score method to estimate θ. The idea is to pretend θ to be known but treat the Xij as fixed parameters by writing Xij as xij.

In a classical conditional score approach (Stefanski and Carroll, 1987), one would aim to derive simple sufficient summary statistics for (xi1, …, xim) and construct an estimating equation based on the conditional likelihood function of the observed data given the sufficient statistics. Unfortunately, due to the transition structure and the possibly nonlinear link function in (2), obtaining the summary sufficient statistics for xij based on the distribution of the observed data is usually difficult. For example, the likelihood function for a first-order transition model for dichotomous Y2 and Y3 with X1 = X2 = X3 = X given initial state Y1 is

exp{Y2(β0+α1Y1+βx1X)log(1+eβ0+α1Y1+βx1X)+Y3(β0+α1Y2+βx1X)log(1+eβ0+α1Y2+βx1X)}

and it does not belong to any exponential family.

Instead, we note that for each j = s + 1, …, m, the conditional density of (Yij, Wij, …, Wi,jr+1) given (Yi,j−1, ⋯, Yi,jq, Zij, …, Zi,jr+1) and (xij, xi,j−1, …, xi,jr+1) is given by

exp[Yij(β0+k=1qαkYi,jk+l=1r{βxlxi,jl+1+βzlTZi,jl+1})/aϕb(β0+k=1qαkYi,jk+l=1r{βxlxi,jl+1+βzlTZi,jl+1})/aϕ+c(Yi,j1,,Yi,jq,xij,,xi,jr+1,Zij,,Zi,jr+1,ϕ)l=1r(Wi,jl+1xi,jl+1)2/2σu2rlog2πσu2].

We recognize that this conditional density still belongs to an exponential family. The sufficient statistics for xi,jl+1, l = 1, …, r, are

Ti1(j)=βx1aϕYij+1σu2Wij,Ti2(j)=βx2aϕYij+1σu2Wi,j1,,Tir(j)=βxraϕYij+1σu2Wi,jr+1. (4)

Therefore, the distribution of Yij given (Yi,j−1, …, Yi,jq, Zij, …, Zi,jr+1) and (Ti1(j),,Tir(j)) only depends on θ but not (xij, …, xi,jr+1). For convenience, we abbreviate this distribution as (Yij|Vij(θ); θ), where Vij(θ) denotes the statistics that Yij are conditioned on. Clearly,

Eθ0[θlog(Yij|Vij(θ0);θ)|θ=θ0]=Eθ0[Eθ0{θlog(Yij|Vij(θ0);θ)|Vij(θ0)}|θ=θ0]=0

where θ0 is the true value of θ, Eθ denotes the expectation given the parameter θ, and ∇θ denotes the gradient with respect to θ. We then construct the following estimating equation

i=1nj=s+1mg(Yij|vij=Vij(θ);θ)=0, (5)

where g(yij|vij; θ) denotes the gradient of log (yij|vij; θ) with respect to θ. Note that calculations of this gradient are done by viewing vij as fixed instead of a function of θ.

Essentially, our idea is to construct some conditional score functions based on the conditional density given the past history at each time then take the summation of all these scores as estimating function. Since the above construction is no based on the full likelihood function, we call our proposed estimating equation the pseudo conditional score equation. The Newton-Raphson iteration can be used to solve the equation; however, multiple solutions may exist. Thus, the following theorem gives the asymptotic property of a solution to (5) in a neighborhood of θ0.

Theorem 1. Assume that with probability one, in a neighborhood of θ0, ∇θg(Yij|Vij(θ); θ) is Lipschitz continuous with respect to θ and moreover,

Eθ0[j=s+1mθg(Yij|Vij(θ);θ)|θ=θ0]is non-singular.

Then there exists a solution, θ̂n, to equation (5) such that n(θ̂nθ0) converges in distribution to a normal distribution with mean zero and covariance

Σ(θ0)={Eθ0[j=s+1mθg(Yij|Vij(θ);θ)|θ=θ0]}1×Eθ0[{j=s+1mg(Yij|Vij(θ0);θ0)}{j=s+1mg(Yij|Vij(θ0);θ0)}T]×{Eθ0[j=s+1mθg(Yij|Vij(θ);θ)T|θ=θ0]}1.

The proof follows the usual argument for estimating equations. Clearly, a consistent estimator for Σ(θ0) is

Σ̂n=n[i=1nj=s+1mθg(Yij|Vij(θ);θ)|θ=θ̂n]1×[i=1n{j=s+1mg(Yij|Vij(θ̂n);θ̂n)}{j=s+1mg(Yij|Vij(θ̂n);θ̂n)}T]×[i=1nj=s+1mθg(Yij|Vij(θ);θ)T|θ=θ̂n]1.

3.2 Examples

We illustrate our method using two examples.

Example 1. We consider a linear transition model with r = 1 and q = 1:

Yij=β0+αYi,j1+βxXij+βzTZij+εij,εij~N(0,σy2),j=2,,m. (6)

Then it is easy to calculate that the sufficient statistic for xij is Ti1(j)=βxYij/σy2+Wij/σu2 and (Yij|Vij(θ); θ) is the conditional density of Yij given Ti1(j), Yi,j−1 and Zij. This density is the same as the conditional density of Yij given Qij=βx(Yijβ0=αYi,j1βzTZij)/σy2+Wij/σu2, Yi,j−1 and Zij, whose logarithm is equal to

log2πσy*2(2σy*2)1(Yijβ0αYi,j1βzTZijQijβx*)2,j=2,,m,

where βx*=βx/(βx2/σy2+1/σu2) and σy*2=(βx2/σy2+1/σu2)1σy2/σu2. Differentiating the above function with respect to all the parameters then substituting the expression of Qij, we obtain the following pseudo conditional score equations

0=i=1nj=2m(1Yi,j1Zij){Yijβ0αYi,j1βzTZijβxWij},0=i=1nj=2m{(Yijβ0αYi,j1βzTZij)βx+Wijσy2/σu2}×(Yijβ0αYi,j1βzTZijβxWij),0=i=1nj=2m{(Yijβ0αYi,j1βzTZijβx+Wij)2(βx2σu2+σy2)}.

Clearly, each term for i and j is the conditional score obtained for subject i at time j given the past history. Moreover, the first equation correspond to parameters (β0,α,βzT), the second equation corresponds to βx, and the last equation is for σy2.

Example 2. In this example, we consider a logistic transition model with r = q = 1, where Yij is a Bernoulli variable and satisfies

logitP(Yij|Hij)=β0+αYi,j1+βxXij+βzTZij. (7)

We can easily calculate that the sufficient statistic for xij is Ti1(j)=βxYij+Wij/σu2 and that the logarithm of the conditional density (Yij|Ti1(j),Yi,j1,Zij;θ) is

(Ti1(j)Yijβx)2σu22+Yij(β0+βzTZij+αYi,j1)log[exp{(Ti1(j)βx)2σu22+(β0+βzTZij+αYi,j1)}+exp{Ti1(j)2σu22}].

After differentiating the above function with respect to all the parameters then substituting the expression of Ti1(j), we obtain the following pseudo conditional score equations

0=i=1nj=2m(1Yi,j1Zij)×[Yij11+exp{(1/2Yij)βx2σu2βxWij(β0+βzTZij+αYi,j1)}],0=i=1nj=2m[YijWij(Yijβx+Wij/σu2βx)σu21+exp{(1/2Yij)βx2σu2Wijβx(β0+βzTZij+αYi,j1)}].

3.3 Method for selecting transition orders

In practice, the transition orders (q, r) in the Y model (2) are often unknown. As our model is a semiparametric model, a full likelihood does not exist. Hence standard model selection methods are not directly applicable. We propose to choose (q, r) based on the pseudo log-likelihood function

i=1nln(Yim|Vim(q,r)(θ0);θ),

where (Yim|Vim(q,r)(θ0);θ) is defined right after equation (4), i.e.,

Vim(q,r)(θ0)=(Yi,m1,,Yi,mq,Zim,,Zi,mr+1,Ti1(m)(θ0),,Tir(m)(θ0)).

Here θ0 denotes the parameter value under the true model and Vim(q,r)(θ0) is Vim(·) evaluated at the true value θ0 under the model with transition orders (q, r).

The function (Yim|Vim(q,r)(θ0);θ) is the true density when (q, r) is equal to the true transition orders. Therefore, we are able to transform the selection of (q, r) in the original model (2) to the model selection in the new regression model given by (Yim|Vim(q,r)(θ0);θ). Note that using Vim(q,r)(θ0) instead of Vim(q,r)(θ) in the new model ensures that the covariate values do not vary with different (q, r). However, since Vim(q,r)(θ0) is unknown, we propose to estimate Vim(q,r)(θ0) at Vim(q,r)(θ̂F), where θ̂F is the parameter estimator under the full model with q = m − 1 and r = m using the conditional score approach. Since θ̂F is consistent, Vim(q,r)(θ̂F) is a good approximation of Vim(q,r)(θ0).

Finally, we treat the pseudo log-likelihood function i=1nln(Yim|Vim(q,r)(θ̂F);θ) like a “likelihood,” and select (q, r) by minimizing the pseudo Akaike information criterion (P_AIC) defined as

i=1n2ln(Yim|Vim(q,r)(θ̂F);θ̂)+2Card(θ̂),

or the pseudo Bayesian information criterion (P_BIC) defined as

i=1n2ln(Yim|Vim(q,r)(θ̂F);θ̃)+Card(θ̂)logn,

where θ̂ is the estimate maximizing the pseudo likelihood function and Card(θ̂) denotes the number of parameters in the model.

The proposed method has been demonstrated to perform well in our numerical studies. However, it is not fully theoretically justified.

4. Numerical Results

4.1 Simulation studies

Corresponding to the two examples illustrated in the previous section, two simulation studies are conducted to examine the finite-sample performance of the proposed pseudo conditional score approach. Specifically, in the first simulation study, the longitudinal response Yij is generated from

Yij=1+0.4Yi,j1+3Xij+0.8Zi+εij,εij~N(0,1),i=1,,n,j=2,,m,

where Zi is a Bernoulli variable with P(Zi = 1) = 0.5 and Xij follows the first order transition model

Xij=0.5+0.8Xi,j1+εxij,εxij~N(0,1),i=1,,n,j=2,,m. (8)

Here we assume the number of repeated measures per subject m = 6. We use Xi1 = 0.25 and Yi1 = −5/12 + 5Zi/3 as values at time one. The measurement error distribution in (3) has a variance 0.5. In the second simulation study, we generate binary responses from a logistic transition model with mean

exp{1+0.5Yi,j1+Xij+0.8Zi}1+exp{1+0.5Yi,j1+Xij+0.8Zi},i=1,,n,j=2,,m,

where Zi is generated from a Bernoulli distribution with P(Zi = 1) = 0.5 and Xij follows

Xij=0.4+0.5Zi+0.6Xi,j1+εxij,εxij~N(0,0.5),i=1,,n,j=2,,m.

The measure error variance is set to be 0.5. The initial states are given as Xi1 = 0.25 and Yi1 from the Bernoulli distribution with probability 0.5. In both simulation studies, we solve the pseudo conditional score equations as given in Examples 1 and 2 to obtain the estimators and their asymptotic variances are estimated using the formula of Σ̂n. Table 1 summarizes the results from 1000 repetitions with sample sizes n = 100 or 200. The results show that in finite samples, the pseudo conditional score estimators have virtually no bias and the estimated standard errors agree well with the true standard errors.

Table 1.

Simulation results for the pseudo conditional score estimates based on 1000 repetitions

Sample Size Parameter True Value EST EST_SE EMP_SE CP MSE
Linear transition model
100 βx 3.0 3.023 0.217 0.225 0.94 0.051
βz 0.8 0.804 0.322 0.329 0.95 0.108
α 0.4 0.396 0.039 0.039 0.94 0.0016
200 βx 3.0 3.017 0.152 0.150 0.95 0.023
βz 0.8 0.797 0.227 0.226 0.95 0.052
α 0.4 0.397 0.027 0.027 0.95 0.0007
Logistic transition model
100 βx 1.0 1.067 0.283 0.283 0.97 0.084
βz 0.8 0.796 0.384 0.398 0.95 0.158
α 0.5 0.455 0.311 0.319 0.94 0.103
200 βx 1.0 1.024 0.185 0.186 0.96 0.035
βz 0.8 0.812 0.262 0.258 0.96 0.067
α 0.5 0.481 0.216 0.214 0.95 0.046

Note: EST is the mean of the estimates; EST_SE is the mean of the estimated standard errors; EMP_SE is the empirical standard error of the estimators; MSE is the mean square error; CP denotes the coverage proportion of the 95% confidence intervals.

We next conduct a simulation study to compare the robustness of the semiparametric pseudo conditional score method with the parametric maximum likelihood method as given in Pan et al. (2006) when the X model is misspecified. We use the same setting as in the first simulation study with m = 6. We consider three distribution scenarios for the X: (a) Xij follows the first order transition model (8) with error εxij following a normal mixture, 0.5N (−0.5, 1) + 0.5N (0.5, 1); (b) Xij follows the first order transition model (8) with error εxij following the extreme-value distribution; (c) Xij follows a second-order transition model Xij = 0.5 + 0.8Xi,j−2 + N (0, 1). For all three scenarios, the parametric maximum likelihood estimation(MLE) method treats Xij from a first-order transition model with normal error distribution. We hence expect that the parametric MLE method would be biased because it misspecifies either the transition pattern or the error distribution.

Table 2 summarizes the robustness simulation results from 1000 repetitions with n = 100 and 200. The results show that the parametric MLE approach gives biased estimates of the regression coefficients, especially α. The bias ranges from 3% to 10%. When the error distribution in the X model deviates slightly from normality as a normal mixture, the bias is small but the coverage probability can be poor. However, when the transition order in the X model is misspecified, the bias is more pronounced and is close to 10%, and the coverage probability becomes very poor. On the contrary, the pseudo conditional score approach always yields small bias and accurate coverage.

Table 2.

Robustness comparison between the pseudo conditional score method and the parametric maximum likelihood method when the X model is mis-specified

Pseudo conditional score Parametric method
n Par. True Rel. Bias(%) EMP_SE CP Rel. Bias(%) EMP_SE CP
X from 1st-order transition model with mixture normal error
100 βx 3 0.47 0.116 0.95 −2.50 0.099 0.88
βz 0.8 −0.87 0.231 0.96 −3.87 0.217 0.95
α 0.4 −0.75 0.025 0.94 3.75 0.022 0.89
200 βx 3 0.23 0.082 0.95 −2.63 0.069 0.79
βz 0.8 0.62 0.166 0.94 −2.37 0.158 0.95
α 0.4 −0.25 0.018 0.95 4.00 0.016 0.82
X from 1st-order transition model with extreme-value error
100 βx 3 1.53 0.221 0.94 −2.83 0.127 0.93
βz 0.8 2.75 0.251 0.94 −3.75 0.229 0.95
α 0.4 −1.75 0.049 0.93 7.75 0.033 0.58
200 βx 3 0.53 0.160 0.96 −3.27 0.094 0.88
βz 0.8 0.25 0.173 0.95 −5.75 0.158 0.94
α 0.4 −0.50 0.034 0.94 8.25 0.023 0.31
X from 2nd-order transition model with normal error
100 βx 3 0.40 0.102 0.97 1.87 0.105 0.89
βz 0.8 0.00 0.235 0.95 −6.37 0.238 0.93
α 0.4 0.00 0.023 0.95 9.75 0.024 0.84
200 βx 3 0.20 0.071 0.95 1.80 0.073 0.80
βz 0.8 −0.12 0.169 0.95 −6.62 0.170 0.94
α 0.4 0.00 0.016 0.95 9.75 0.017 0.72

Note: see Table 1.

To evaluate the method in selecting transition orders as proposed in Section 3.3, we conduct another simulation study with dichotomous outcome. The setting is similar to our second simulation study except that the mean probability is

exp{1+Yi,j2+Xi,j1+0.8Zi}1+exp{1+Yi,j2+Xi,j1+0.8Zi},i=1,,n,j=3,4.

That is, the true transition order is q = r = 2. We apply the proposed method to fit models for all possible combinations of transition orders (q, r) with q = 1, 2, 3 and r = 1, 2, 3, 4. The pseudo AIC and the pseudo BIC are used for selecting the final orders. The result from 1000 repetitions with sample sizes 200 and 400 is given in Table 3. The result shows that the proposed method works well. Overall, the pseudo BIC outperforms the pseudo AIC, especially when sample size is large.

Table 3.

Frequency table of transition orders selected using the pseudo-likelihood function from 1000 repetitions

selection use P_AIC selection use P_BIC
n = 200 r = 1 r = 2 r = 3 r = 4 r = 1 r = 2 r = 3 r = 4
q = 1 5 87 42 55 43 340 64 35
q = 2 28 318 129 161 90 334 39 35
q = 3 9 79 37 50 2 14 1 3
n = 400 r = 1 r = 2 r = 3 r = 4 r = 1 r = 2 r = 3 r = 4
q = 1 0 14 10 19 3 160 28 16
q = 2 1 462 200 151 12 678 61 18
q = 3 0 83 30 30 0 14 2 2

4.2 Numerical study on efficiency loss

The pseudo conditional score equation approach relies on the conditional likelihood function, so it does not utilize the full data information. Hence it may not give the efficient estimators. It is useful to know how much efficiency is lost when using such an approach. Since deriving the asymptotic efficiency bound for model (1) is generally difficult, we focus our discussion on the situation where Yij is a normal outcome and r = 1 and q = 1 as in (6). Furthermore, we assume Zij and Xij are independent but allow the repeated measures of Xij to be correlated.

From Example 1, we have known that the Qij=βx(Yijβ0=αYi,j1βzTZij)/σy2+Wij/σu2, j = 2, …, m are sufficient statistics for xij, j = 2, …, m. In fact, they are also complete and sufficient statistics. Therefore, following Bickel et al. (1993, Chap 4, pp.130), one can explicitly calculate the semiparametric efficiency bound (see the appendix). Thus, the efficiency loss in the pseudo conditional score estimator can be evaluated by comparing such efficiency bound versus Σ as given in Theorem 1.

We utilize a concrete example to illustrate the efficiency loss. Suppose that (Yij, Wij) follows

Yij=1+0.5Yi,j1+Xij+0.6Zi+N(0,2),Wij=Xij+N(0,0.5),

where Zi is a Bernoulli variable with P(Zi = 1) = 0.5 and Xij is generated from the following transition model

Xij=0.4+0.5Xi,j1+N(0,σx2).

For different choices of σx2=0.3 or 0.15 and different cluster sizes m = 3 or 4, we compute the asymptotic efficiency of the pseudo conditional score estimators for βx, βz, α relative to the semiparametric efficient bound. The results show that the efficiency loss increases with the decrease of σx2; it varies from 10% to 20% in estimating βx and α as m increases from 3 to 4; however, no efficiency is lost in estimating βz.

4.3 Application to the ACSUS data

We apply our method to analyze the ACSUS data. Specifically, we restricted our attention to 533 patients who completed the first year interview. The participants were interviewed every 3 months for four times. The outcome was whether they had hospital admissions (yes/no) during the three months between two consecutive interviews. It is of scientific interest to study the effect of CD4 counts in predicting future hospitalization given the past history of hospitalization. As discussed in the introduction, CD4 counts were subject to considerable measurement error. Thus, a natural model for analyzing this data set is a prediction model by accounting for measurement errors in CD4 counts. A logistic transition model is used to fit the data with covariate W = log(CD4/100), a transformed variable that reduces the marked skewness of CD4 counts (Figure 1). We note that even after a log-transformation, the commonly used transformation for CD4 counts, CD4 counts still do not look normally distributed. This motivates us to leave the distribution of the true CD4 counts fully unspecified by considering the pseudo conditional score method. Other covariates include age (10 categories coded as 1–10), antiretroviral drug use, HIV-symptomatic at baseline, race, and gender. Additionally, the past hospitalization history is also adjusted for in the analysis. The size of the measurement error for W, σu2=0.38, is set to be 1/3 of the variance of baseline W. This value is close to the estimated measurement error variance 0.39 by Wulfsohn and Tsiatis (1997), using data from another AIDS study conducted by Burroughs-Wellcome. In addition, we also fit model using σu2=0.18 to obtain parameter estimates under a more conservative measurement error setting.

Figure 1.

Figure 1

Histogram of log-transformed CD4 counts in the ACSUS data

To select the best transition order (q, r), we apply the pseudo BIC method proposed in Section 3.3. The result shows that q = 1 and r = 1 give the smallest value under the pseudo BIC criterion. This finding agrees with the result obtained from testing the significance of the extra terms when the highest order transitional model is fit: specifically, we fit the largest transition model with q = 3 and r = 4 and test for the significance of the higher than first-order terms, and we find they are highly insignificant. Hence our final model has the transition order q = 1 and r = 1. The parameter estimation result is given in Table 4, where the reported estimates are the estimated log-odds ratios of the covariates. Women have a significantly higher risk of future hospitalization than men. The effect of CD4 counts on the risk of future hospitalization is significant, given the previous hospitalization status. Subjects who had a previous hospital admission history and who had lower CD4 counts would be more likely to be hospitalized in the future.

Table 4.

Application of the pseudo conditional score method to analysis of the ACSUS data

σu2=0.38
σu2=0.18
Naive Estimate
Parameter Estimate SE Estimate SE Estimate SE
log(CD4/100) (βx) −0.460 0.072 −0.416 0.067 −0.383 0.063
age 0.030 0.056 0.031 0.055 0.032 0.054
antireviral drug use 0.051 0.235 0.077 0.232 0.097 0.230
HIV symptomatic 0.086 0.191 0.069 0.188 0.058 0.187
race 0.208 0.214 0.209 0.211 0.209 0.210
sex (female vs. male) 0.621 0.243 0.577 0.239 0.545 0.237
previous hospitalization (α) 1.838 0.253 1.865 0.250 1.885 0.248

We also fit the model by letting the measurement error σu2 be 0.18, which corresponds to the situation when the coefficient of variation for the baseline W is 50%. The findings are similar but the estimated effect of W is slightly attenuated. We also present in Table 4 the naive estimators that are obtained by ignoring measurement error. The naive estimator of the CD4 count effect tends to bias towards zero.

5. Discussion

We consider in this paper transition measurement error models for longitudinal data. We propose a pseudo conditional score approach that does not require specifying the distribution of the unobserved covariate. Both numerical calculations and simulation studies show that the estimator using the pseudo conditional score method performs well.

The approach extends the classical conditional score approach in Stephanski and Carroll (1987) in the following aspects. First, the classical conditional score approach relies on extracting sufficient statistics for the error-prone covariates in the full likelihood function so is impossible for the transition models; instead, our approach works on the conditional likelihood at each time point. Second, because the conditional scores from different timepoints are correlated, a sandwich variance estimator must be used for inference. Third, one specific question to the transition model is how to choose the transition orders and we have provided an innovative way for this purpose based on the pseudo-likelihood function. Furthermore, we note that the proposed approach is always applicable to the situations when Xi,jk enters expression (2) linearly no matter how Yi,jk or its transformed value enters expression (2). Therefore, our approach can also be used for other transition models such as the ones proposed for count data in Diggle et al. (2002). Assigning different weights to the conditional scores from different timepoints might improve efficiency, but we have not as yet explored this refinement.

One important issue in fitting a transition model is the selection of transition orders (q, r). If one is willing to assume a parametric model for X, (q, r) can be selected using various model selection criteria, such as AIC and BIC. However, under the semiparametric model considered in this paper, there does not exist any literature on choosing q and r. In this paper, we propose to select (q, r) using the pseudo likelihood function and a small simulation study indicates that the method works pretty well. Theoretical justification of the proposed method needs more work.

Another important issue is to determine the size of measurement error, σu2, which can be estimated using replication or validation data. In this case, Theorem 1 needs to be slightly modified to account for the variability due to estimating σu2. Particularly, following the same proof for Theorem 1, we obtain that the asymptotic variance equals the variance of

Eθ0[j=s+1mθgσ0u2(Yij|Vij(θ);θ)|θ=θ0]1j=s+1mgσ0u2(Yij|Vij(θ0);θ0)+Eθ0[j=s+1mθgσu2(Yij|Vij(θ);θ)|θ=θ0]1σu2|σu2=σ0u2Eθ0[j=s+1mgσu2(Yij|Vij(θ0);θ0)]Sσ0u2,

where gσu2(Yij|·) is the same as defined in Theorem 1 but indexed by σu2,σ0u2 denotes the true value of σu2, and Sσ0u2 is the influence from estimating σ0u2 using the validation sample. Clearly, the second part of the above expression reveals the influence on estimating θ0 when σ0u2 is estimated. When neither validation data nor replications are available, one possible strategy is to conduct sensitivity analysis (e.g., Li and Lin, 2000) by varying the sizes of measurement error in a reasonable range.

ACKNOWLEDGMENT

Lin’s research was supported by National Cancer Institute grant CA–76404 and National Heart, Lung and Blood Institute grant HL–58611.

APPENDIX

Calculation of semiparametric efficiency bound in (6)

From Bickel et al. (1993), the semiparametric efficiency bound in (6) is given by Σe={E[θ*(Yi,Wi,Zi;θ,G)2]}1, where a⊗2 = aaT and

ℓ̇θ*(Yi,Wi,Zi;θ,G)=E[ℓ̇θc(Yi,Wi,Zi,Xi;θ)|Yi,Wi,Zi]E[ℓ̇θc(Yi,Wi,Xi,Zi;θ)|Qi,Zi].

Here, Yi = (Yi2, …, Yim)T, Wi = (Wi2, …, Wim)T, Zi = (Zi2, …, Zim)T, Xi = (Xi2, …, Xim)T, Qi = (Qi2, …, Qim)T, ℓ̇θc is the score function for θ with the complete data (Yi, Xi, Zi), and G(·) denotes the joint distribution of Xi. Particularly, direct calculations give ℓ̇θ*(Yi,Wi,Zi;θ,G) equal to

1σy2j=2m(ε̃ijE[ε̃ij|Qij]Zij(ε̃ijE[ε̃ij|Qij])Yi,j1ε̃ijE[Yj1ε̃ij|Qij]βx(Yi,j1E[Yi,j1|Qij])E[Xij|Qij](ε̃ijE[ε̃ij|Qij])E[Xij|Qij](ε̃ij2E[ε̃ij2|Qij]2βx(ε̃ijE[ε̃ij|Qij])E[Xij|Qij])/(2σy2)),

where ε̃ij=Yijβ0αYi,j1βzTZij.

For specific example, the above semiparametric efficiency bound can be calculated explicitly in terms of the first two moments of E[Yi|Qi], E[Xi|Qi], E[ε̃ij|Qi]. For example, assume

  • (M.1) (Yi, Wi) follows Yij = β0 + βzZij + βxXij + αYi,j−1 + εij, Wij = Xij + Uij;

  • (M.2). X is generated from the transition model Xij = γ0 + γxXi,j−1 + εxij;

  • (M.3) Zij = … = Zi1 has mean mz and variance υz and it is independent of Xi;

  • (M.4) Yi1 has mean my and variance υy and Xi1 has mean mx and variance υx;

  • (M.5) (εij, Uij, εxij) are independently from normal distribution with mean zero and variance σy2,σu2,σx2 respectively.

Then under conditions (M.1) to (M.5), Xi given Qi is a multivariate-normal distribution with mean [Σx1+(βx2/σy2+1/σu2)Im×m]1(Σx1μx+Qi), where μx = E[Xi] and Σx is the covariance matrix of Xi and both can be calculated from condition (M.2). Additionally, E[ε̃ij|Qi]=βx(βx2/σy2+1/σu2)1Qij and E[Yi,j1|Qi]=k=1j1αj1k(β0+βzmz+βx(βx2/σy2+1/σu2)1Qik)+αj1my. Therefore, the moments of E[Yi|Qi], E[Xi|Qi], and E[ε̃ij|Qi] can be further calculated from the fact

Qi~Multinormal((βx2/σy2+1/σu2)μx,(βx2/σy2+1/σu2)Im×m+(βx2/σy2+1/σu2)2Σx).

REFERENCES

  1. Berk ML, Maffeo C, Schur CL. AIDS Cost and Services Utilization Survey Report No. 1. Rockville, MD: Agency for Health Care Policy and Research; 1993. Research Design and Analysis Objectives. [PubMed] [Google Scholar]
  2. Bickel PJ, Klaassen CAI, Ritov Y, Wellner JA. Efficient and Adaptive Estimation for Semi-parametric Models. John Hopkins University Press; 1993. [Google Scholar]
  3. Biorn E, Klette TJ. Panel data with errors-in-variables: essential and redundant orthogonality conditions in GMM-estimation. Economics Letters. 1998;59:275–282. [Google Scholar]
  4. Carroll RJ, Ruppert D, Stefanski LA, Crainiceanu C. Measurement Error in Nonlinear Models. Second edition. London: Chapman and Hall; 2006. [Google Scholar]
  5. Carroll RJ, Wand MP. Semiparametric estimation in logistic measurement error models. Journal of Royal Statistical Society B. 1991;53:573–385. [Google Scholar]
  6. Cook JR, Stefanski LA. Simulation-extrapolation estimation in parametric measurement error models. Journal of American Statistical Association. 1994;89:1314–1328. [Google Scholar]
  7. Diggle PJ, Liang K, Heagerty P, Zeger SL. Analysis of Longitudinal Data. Oxford Statistical Science; 2002. [Google Scholar]
  8. Duncan TE, Duncan SC, Strycker LA. An Introduction to Latent Variable Growth Curve Modelling. Routledge: 2006. [Google Scholar]
  9. Fuller WA. Measurement Error Models. New York: John Wiley & Sons; 1987. [Google Scholar]
  10. Griliches Z, Hausman JA. Errors in variables in panel data. Journal of Econometrics. 1986;31:93–118. [Google Scholar]
  11. Have TR, Morabia A. An assessment of non-randomized medical treatment of long-term schizophrenia relapse using bivariate binary response transition models. Biostatistics. 2002;3:119–131. doi: 10.1093/biostatistics/3.1.119. [DOI] [PubMed] [Google Scholar]
  12. Heagerty PJ. Marginalized transition models and likelihood inference for longitudinal categorical data. Biometrics. 2002;58:342–351. doi: 10.1111/j.0006-341x.2002.00342.x. [DOI] [PubMed] [Google Scholar]
  13. Liang KY, Zeger SL. Longitudinal data analysis using generalized linear models. Biometrika. 1986;73:13–22. [Google Scholar]
  14. Li Y, Lin X. Covariate measurement errors in frailty models for clustered survival data. Biometrika. 2000;87:849–866. [Google Scholar]
  15. McCullagh P, Nelder JA. Generalized Linear Models. 2nd edition. London: Chapman and Hall; 1989. [Google Scholar]
  16. Pan W. Transition Measurement Error Models for Longitudinal Data. Ph.D. Dissertation, University of Michigan; 2002. [Google Scholar]
  17. Pan W, Lin X, Zeng D. Structural inference in transition measurement error models for longitudinal data. Biometrics. 2006;62:402–412. doi: 10.1111/j.1541-0420.2005.00446.x. [DOI] [PubMed] [Google Scholar]
  18. Roy J, Lin X. Missing covariates in longitudinal data with informative dropouts: Bias analysis and inference. Biometrics. 2005;61:837–846. doi: 10.1111/j.1541-0420.2005.00340.x. [DOI] [PubMed] [Google Scholar]
  19. Schmid CH. An EM algorithm fitting first-Order conditional autoregressive models to longitudinal data. Journal of American Statistical Association. 1996;91:1322–1330. [Google Scholar]
  20. Schmid CH, Segal MR, Rosner B. Incorporating measurement error in the estimation of autoregressive models for longitudinal data. Journal of Statistical Planning and Inference. 1994;42:1–18. [Google Scholar]
  21. Spiegelman D, Rosner B, Logan R. Estimation and inference for logistic regression with covariates misclassification and measurement error in main study/validation study design. Journal of American Statistical Association. 2000;95:51–61. [Google Scholar]
  22. Stefanski LA, Carroll RJ. Conditional scores and optimal scores for generalized linear measurement-error models. Biometrika. 1987;74:703–716. [Google Scholar]
  23. Tsiatis AA, De Gruttola V, Wulfsohn MS. Modeling the relationship of survival to longitudinal data Measured with error applications to survival and CD4 counts in patients with AIDS. Journal of American Statistical Association. 1995;90:27–37. [Google Scholar]
  24. Wang N, Lin X, Gutierrez RG, Carroll RJ. Bias analysis and SIMEX approach in generalized linear mixed measurement error models. Journal of American Statistical Association. 1998;93:249–261. [Google Scholar]
  25. Wulfsohn MS, Tsiatis AA. A joint model for survival and longitudinal data measured with error. Biometrics. 1997;53:330–339. [PubMed] [Google Scholar]
  26. Young PJ, Weeden S, Kirwan JR. The analysis of a bivariate multi-state Markov transition model for rheumatoid arthritis with an incomplete disease history. Statistics in Medicine. 1999;18:1677–1690. doi: 10.1002/(sici)1097-0258(19990715)18:13<1677::aid-sim154>3.0.co;2-n. [DOI] [PubMed] [Google Scholar]

RESOURCES