Skip to main content
Biostatistics (Oxford, England) logoLink to Biostatistics (Oxford, England)
. 2008 Jan 16;9(3):501–512. doi: 10.1093/biostatistics/kxm054

A simulation-based marginal method for longitudinal data with dropout and mismeasured covariates

Grace Y Yi 1
PMCID: PMC3294321  PMID: 18199691

Abstract

Longitudinal data often contain missing observations and error-prone covariates. Extensive attention has been directed to analysis methods to adjust for the bias induced by missing observations. There is relatively little work on investigating the effects of covariate measurement error on estimation of the response parameters, especially on simultaneously accounting for the biases induced by both missing values and mismeasured covariates. It is not clear what the impact of ignoring measurement error is when analyzing longitudinal data with both missing observations and error-prone covariates. In this article, we study the effects of covariate measurement error on estimation of the response parameters for longitudinal studies. We develop an inference method that adjusts for the biases induced by measurement error as well as by missingness. The proposed method does not require the full specification of the distribution of the response vector but only requires modeling its mean and variance structures. Furthermore, the proposed method employs the so-called functional modeling strategy to handle the covariate process, with the distribution of covariates left unspecified. These features, plus the simplicity of implementation, make the proposed method very attractive. In this paper, we establish the asymptotic properties for the resulting estimators. With the proposed method, we conduct sensitivity analyses on a cohort data set arising from the Framingham Heart Study. Simulation studies are carried out to evaluate the impact of ignoring covariate measurement error and to assess the performance of the proposed method.

Keywords: Estimating equations, Longitudinal data, Measurement error, Missing data, Simulation and extrapolation method

1. INTRODUCTION

Longitudinal studies are commonly conducted in the health sciences, biochemical, and epidemiology fields. Although longitudinal studies are designed to collect data on every individual in the studies at each assessment, missing observations often arise due to various reasons. There has been increasing interest in discussing valid inference methods for longitudinal data with missing values. Yet, there is relatively little work on investigating the effects of covariate measurement error on estimation of the response parameters, especially on simultaneously accounting for the biases induced by both missing values and mismeasured covariates. Measurement error in covariates is, however, a typical feature of longitudinal data. Sometimes, covariates of interest may be difficult to observe precisely due to physical location or cost. Sometimes, it is impossible to measure covariates accurately due to the nature of the covariates. In other situations, a covariate may represent an average of a certain quantity over time (e.g. cholesterol level (CHOL)) and any practical way of measuring such a quantity necessarily features measurement error.

It has been recognized and well documented that, in other contexts, ignoring covariate measurement error may lead to severe biased results. For example, Fuller (1987) pointed out that the slope in a simple linear regression model may be attenuated if covariate measurement error is ignored. For survival data analysis, Prentice (1982), Li and Lin (2003), Yi and He (2006), and Yi and Lawless (2007), among others, investigated measurement error effects and developed inference methods to correct for the bias resulted from measurement error in covariates. For an overview of measurement error problems, see Carroll and others (2006).

In this paper, we investigate the impact of covariate measurement error on longitudinal data analysis. This work is motivated by the need of methods to simultaneously address both missingness and measurement error that are often possessed by longitudinal data. For example, a data set arising from the Framingham Heart Study contains error-prone covariates and a portion of subjects who drop out of the study during the follow-up period. An objective of this study is to understand how obesity is associated with covariates such as age, blood pressure, and CHOL. It is well known that individual measurement for blood pressure and CHOL involves substantial measurement error. Here, the true measurements of these covariates are defined as their long-term average values. The measurement at a specific time point would fluctuate with time, seasonal variation, and other confounding factors. The features of measurement error in covariates and dropout present a challenge to the existing inference methods. In this paper, we develop an inference method for analyzing longitudinal data that have both dropout and error-contaminated covariates. We utilize marginal methods to modulate the response process. A functional method for the measurement error process is employed. Such a method is appealing because it does not require the specification of the covariate distribution.

The remainder is organized as follows. Notation and model setup are introduced in Section 2. In Section 3, we discuss a simulation–extrapolation (SIMEX) method to account for both dropout and covariate measurement error. A data set arising from the Framingham Heart Study is analyzed with the proposed method and the results are reported in Section 4. In Section 5, we conduct simulation studies to assess the performance of the proposed method as well as the impact of ignoring measurement error in covariates. General discussion is included in Section 6.

2. NOTATION AND MODEL SETUP

2.1. Response process

Longitudinal data analysis may typically be conducted based on marginal, random-effects, and transitional models (Diggle and others, 2002). In this paper, we focus on marginal analysis with the primary interest centered on the marginal mean parameters. Let Yij be the response variable for subject i at time point j, xij be the covariate vector subject to error, and zij be the vector of error-free covariates, i=1,2,…,n and j=1,2,…,m. Denote Yi=(Yi1,Yi2,…,Yim)′, xi=(xi1′,xi2′,…,xim′)′, and zi=(zi1′,zi2′,…,zim′)′. Let μij=E(Yij|xi,zi) and vij=var(Yij|xi,zi) be the conditional expectation and variance of Yij, respectively, given the covariates xi and zi.

We model the influence of the covariates on the marginal response mean by means of a regression model

2.1. (2.1)

where β=(βx′,βz′)′ is the vector of regression parameters of dimension p, say, and g(·) is a known monotone function. If necessary, the intercept may be included in βz by adding a unit vector to covariates zi. Furthermore, assume vij=h(μij;φ), where h(·;·) is a known function and φ is the dispersion parameter that is known or may be estimated. We treat φ as known here with emphasis on estimation of β.

Here, we assume that the dependence of mean μij on the subject-level covariates xi and zi is completely reflected by the time-specific covariates xij and zij, that is, E(Yij|xi,zi)=E(Yij|xij,zij). This assumption has been widely adopted in modeling longitudinal data, see Diggle and Kenward (1994), Robins and others (1995), Cook and others (2004), and Yi and Thompson (2005), for example. This assumption was noted in Pepe and Anderson (1994) and was justified from the viewpoint of formulating unbiased estimating functions. Model (2.1) may consist of baseline covariates such as gender, age, and treatment status or time-varying covariates. With an exogenous covariate process (i.e. a time-varying covariate that is not predicted by past outcomes), properly including current or lagged values of the covariates may meet this assumption (e.g. Miglioretti and Heagerty, 2004). Both cross-sectional and longitudinal effects of time-varying covariates may be featured in model (2.1). See Diggle and others (2002, Chapter 12) for more detailed discussion.

2.2. Missing data process

Let Rij be 1 if Yij is observed and 0 otherwise. Let Ri=(Ri1,Ri2,…,Rim)′ be the vector of (non)missing data indicators, i=1,2,…,n. Dropouts or monotone missing data patterns are considered here. That is, Rij=0 implies Rij=0 for all j′>j. Without loss of generality, assume that Ri1=1 for every subject i. According to the dependence of the missing data process on the response process, missing data mechanisms may be classified as missing completely at random (MCAR), missing at random (MAR), and not missing at random (NMAR) (e.g. Kenward, 1998).

In this paper, we assume an MAR mechanism for the dropout process. That is, given the covariates, the conditional distribution f(ri|xi,zi,yi) depends on the observed response components yiobs and the covariates only. Let λij=P(Rij=1|Ri,j−1=1,xi,zi,yi) and πij=P(Rij=1|xi,zi,yi). Note that πij=∏t=2jλit. Let Hijy={yi1,…,yi,j−1} denote the response history up to (but not including) time point j.

Logistic regression models are commonly used to model the dropout process (e.g. Diggle and Kenward, 1994; Robins and others, 1995), namely,

2.2. (2.2)

where uij is the vector consisting of the information of the covariates xi and zi and the observed responses Hijy, and α is the vector of regression parameters.

Let Mi be the random dropout time for subject i and mi be a realization, i=1,2,…,n. Define Li(α)=(1−λimi)∏t=2mi−1λit, where λit is determined by model (2.2). Let Si(α)=logLi(α)/α be the vector of score functions contributed from subject i. Denote θ=(α′,β′)′ and q=dim(θ).

2.3. Measurement error process

Let Wij be an observed measurement of the covariate xij, i=1,2,…,n,j=1,2,…,m. xij and Wij are assumed to follow a classical additive measurement error model. That is, conditional on xi and zi,

2.3. (2.3)

where the error terms eij′ are assumed to follow N(0,Σe) with Σe being the covariance matrix (e.g. Wang and others, 1998).

It is known that nonidentifiability is often a problem if model (2.3) is employed. For identifiability of model parameters, one needs a validation data set consisting of {Yij,xij,Wij,zij} or repeated measurements Wij to estimate the parameters associated with Σe. If neither validation data nor repeated measurements Wij are available, then one may conduct sensitivity analyses based on background information about the measurement process to assess the impact of different degrees of measurement error on estimation of β (Yi and Lawless, 2007). In this paper, error distribution parameters are assumed known.

3. INFERENCE PROCEDURES

3.1. Weighted estimation functions

The inverse probability weighted generalized estimating equation (IPWGEE) method is often employed to account for the bias induced by the incompleteness of data (Robins and others, 1995) when primary interest lies in the marginal mean parameters β in model (2.1). For i=1,2,…,n, let Di=μi′/β be the matrix of the derivatives of the mean vector μi with respect to β and Δi=diag(I(Rij=1)/πij,j=1,2,…,m) be the weight matrix accommodating missingness, where I(·) is the indicator function. Let Vi=Ai1/2CiAi1/2 be the covariance matrix of Yi, where Ai=diag(vij,j=1,2,…,m) and Ci=[ρi;jk] is the correlation matrix with ρi;jk being the correlation coefficient of response components Yij and Yik, for jk, and ρi;jj=1. For i=1,2,…,n, define Ui(θ) = DiVi−1Δi(Yiμi) and Hi(θ)=(Ui(θ),Si(α))′.

In the absence of measurement error, that is, covariates xij are precisely observed, E[Hi(θ)]=0; hence, H(θ)=∑i=1nHi(θ) are unbiased estimation functions for θ. Consistent estimator Inline graphic of θ can be obtained by solving

3.1. (3.1)

where moment estimates may be used for the correlation matrix Ci or, alternatively, a working independence matrix Ai may be used to replace Vi.

3.1. SIMEX approach

When measurement error is present in covariates xij, H(θ) is no longer unbiased if replacing xij with its observed measurement Wij. A proper adjustment is needed to account for the bias induced by using Wij. In the sequel, we describe the SIMEX method for the adjustment. Let B be a given positive integer and Λ={λ1,λ2,…,λM} be a sequence of nonnegative numbers taken from [0,λM] with λ1=0.

1. Simulation step

For i=1,2,…,n and j=1,2,…,m, generate eijbN(0,Σe)for b=1,2,…,B. Given λΛ, set Wij(b,λ)=Wij+Inline graphic.

2. Estimation step

For given λ and b, we obtain an estimate Inline graphic(b,λ) by solving (3.1) with xij replaced by Wij(b,λ). This step can be quickly implemented using SAS GENMOD procedure to the data set {Yi,Wi(b,λ),zi:i=1,2,…,n}. The model-based covariance matrix for Inline graphic(b,λ) is given by

3.1.

where Inline graphic and Inline graphic.

Denote by Inline graphic(b,λ) the rth diagonal element of Inline graphic(b,λ) and Inline graphic(b,λ) the rth component of Inline graphic(b,λ), r=1,2,…,q. Define Inline graphic(λ)=B−1b=1BInline graphic(b,λ), Inline graphic(λ)=B−1b=1BInline graphic(b,λ), Inline graphic(λ)=(B −1)−1b=1B(Inline graphic(b,λ)−Inline graphic(λ))2, and Inline graphic(λ)=Inline graphic(λ)−Inline graphic(λ).

3. Extrapolation step

For r=1,2,…,q, fit a regression model to each of the sequences {(λ,Inline graphic(λ)): λΛ} and {(λ,Inline graphic(λ)):λΛ}, respectively, and extrapolate it to λ=−1 with Inline graphic and Inline graphic denoting the corresponding predicted values. Then, Inline graphic=(Inline graphic1,Inline graphic2,…,Inline graphicq)′ is the SIMEX estimator of θ and Inline graphic is the associated standard error for the estimator Inline graphic(r=1,2,…,q).

The SIMEX approach is a simulation-based method that was proposed by Cook and Stefanski (1994) for parametric measurement error models. Its idea can be intuitively illustrated with simple linear regression. Suppose that the regression model is given by Y=β0+βxx+ϵ, where ϵ has mean 0. If replacing x with its observed measurement W, modeled by W=x+e with e having mean 0 and variance σ2, then the resulting least squares estimator Inline graphic for βx converges in probability to βx*=(σx2/(σx2+σ2))βx (Fuller, 1987). Here σx2 is the variance of x. Intuitively, if replacing x with W+Inline graphicσeb, where eb is generated from N(0,1), then the resulting estimator Inline graphic(b,λ) converges in probability to βx*(b,λ)=(σx2/(σx2+(1+λ)σ2))βx. If λ=0, Inline graphic(b,0) is just the naive estimator Inline graphic. However, if λ=−1 then the limit βx*(b,−1) is identical to the true parameter βx.

For univariate parametric models, Carroll and others (1996) established the asymptotic normality for the SIMEX estimator. However, their results cannot directly apply here because the current development involves multiple response outcomes along with an additional process concerning the missing data indicators. If the exact extrapolation function is used in Step 3 above, we may establish the following asymptotic distribution for the SIMEX estimator Inline graphic. The proof is outlined in the Appendix.

THEOREM:

Under regularity conditions,

THEOREM:

where Gγ(γ; −1) and Q(γ) are defined in the Appendix. Hence, Inline graphic has an asymptotic normal distribution with mean 0 and covariance matrix being the upper p × p matrix of Inline graphic.

4. AN EXAMPLE

As an illustration, we apply the proposed method to analyze cohort 2 subset of GAW13 (Genetic Analysis Workshops) data arising from the Framingham Heart Study. The data set consists of the measurements for 1672 patients from a series of exams with 5 assessments designed for each individual. Measurements such as height, weight, age, systolic blood pressure (SBP), and CHOL are collected at each assessment. About 24% patients dropped out of the study.

It is of interest to study how an individual's obesity changes with age and how it is associated with SBP and CHOL. Practically, it is convenient and cost effect to use body mass index (BMI), which is defined as weight (kg)/height2 (m2), to estimate adiposity that correlates well with more direct and invasive measures of percentage body fat (Strug and others, 2003). Here, following Yoo and others (2003), we let Y be the binary response variable indicating obesity status of a subject, which takes value 1 if his/her maximum BMI (Max BMI) over all ages is no less than the 90th percentile of the Max BMI values observed in each replicate being analyzed and 0 otherwise. The responses and the covariates are postulated by the logistic regression model

4.

where xij1 represents SBP, rescaled as log(SBP − 50) as in Carroll and others (2006), xij2 is the standardized CHOL, and zij is AGE for subject i at time point j, respectively.

It is well known that both SBP and CHOL are subject to substantial measurement error. We are concerned how measurement error in SBP and CHOL impacts estimation of parameter β = (β0, βx1, βx2, βz)′, and hence, we conduct sensitivity analyses here. Let Wij = (Wij1, Wij2)′ and xij = (xij1, xij2)′. Assume that the error model is given by (2.3) with Inline graphic. σ1 and σ2 are specified as 0, 0.5, and 1.0 to feature scenarios with different degrees of measurement error in SBP and CHOL. Distinct values for ρ are considered to facilitate different strengths in correlation. The missing data process is characterized by the logistic regression model

4. (4.1)

Three analyses are conducted here. Analysis 1 ignores measurement error in SBP and CHOL with Xi naively replaced by Wi when using (3.1), Analysis 2 accounts for measurement error in the response model but not in the missing data model, while Analysis 3 addresses measurement error in both the response and the missing data models. In implementing the SIMEX method, we choose B = 200, M = 9, and a quadratic regression for each extrapolation step.

The analyses show that only α4 in model (4.1) is statistically significant under various situations considered for error model (2.3). Other coefficients such as α1, α2, and α3 are all not statistically significant. The results suggest that the dropout rate increases as the subjects become older. Dropout probability does not depend on the previous obesity status, SBP, or CHOL.

We conduct the analyses for ρ = 0 and ρ = 0.5. Table 1 reports the results for the case with ρ = 0. It is not surprising that the 3 analyses give rise to very similar results when there is no measurement error present in SBP and CHOL. When measurement error does exits, it can be seen that the estimates and associated standard errors may be considerably impacted by different degrees of measurement error in SBP or CHOL. If there is no error in SBP (i.e. σ1 = 0), both CHOL and AGE are not statistically significant, whereas SBP has a significant positive effect no matter what degree of measurement error is involved in CHOL.

Table 1.

Sensitivity analyses of the data from the Framingham Heart Study

σ1 σ2 Analysis βx1 βx2 βz
Bias SE p-value Bias SE p-value Bias SE p-value
0.00 0.00 1 2.9465 0.3103 < 0.0001 0.0904 0.0852 0.2886 − 0.0067 0.0057 0.2427
2 2.9465 0.3119 < 0.0001 0.0904 0.0854 0.2897 − 0.0067 0.0057 0.2450
3 2.9465 0.3103 < 0.0001 0.0904 0.0852 0.2886 − 0.0067 0.0057 0.2427
0.00 0.50 1 2.9827 0.3085 < 0.0001 0.0419 0.0721 0.5614 − 0.0060 0.0057 0.2937
2 2.9736 0.3119 < 0.0001 0.0541 0.0871 0.5341 − 0.0061 0.0057 0.2820
3 2.9737 0.3102 < 0.0001 0.0541 0.0868 0.5334 − 0.0061 0.0057 0.2792
0.00 1.00 1 3.0069 0.3068 < 0.0001 0.0072 0.0503 0.8859 − 0.0055 0.0057 0.3372
2 3.0016 0.3100 < 0.0001 0.0140 0.0706 0.8434 − 0.0056 0.0057 0.3285


3
3.0017
0.3083
< 0.0001
0.0140
0.0704
0.8426
− 0.0056
0.0057
0.3262
0.50 0.00 1 0.2828 0.0897 0.0016 0.1751 0.0797 0.0280 0.0121 0.0053 0.0232
2 0.5050 0.1346 0.0002 0.1654 0.0802 0.0391 0.0106 0.0053 0.0455
3 0.5051 0.1343 0.0002 0.1654 0.0799 0.0385 0.0106 0.0052 0.0441
0.50 0.50 1 0.2316 0.0968 0.0167 0.0797 0.0728 0.2737 0.0144 0.0053 0.0063
2 0.5182 0.1335 0.0001 0.1277 0.0820 0.1194 0.0112 0.0053 0.0337
3 0.5183 0.1332 < 0.0001 0.1276 0.0817 0.1181 0.0112 0.0052 0.0324
0.50 1.00 1 0.2599 0.1018 0.0107 0.0088 0.0538 0.8701 0.0157 0.0053 0.0030
2 0.5331 0.1333 < 0.0001 0.0703 0.0676 0.2988 0.0123 0.0052 0.0188


3
0.5331
0.1330
< 0.0001
0.0703
0.0674
0.2966
0.0123
0.0052
0.0180
1.00 0.00 1 0.0412 0.0454 0.3648 0.1852 0.0794 0.0196 0.0137 0.0054 0.0107
2 0.0801 0.0693 0.2477 0.1830 0.0797 0.0216 0.0135 0.0054 0.0121
3 0.0802 0.0692 0.2464 0.1831 0.0793 0.0210 0.0135 0.0053 0.0116
1.00 0.50 1 0.0073 0.0488 0.8809 0.1074 0.0719 0.1351 0.0156 0.0053 0.0035
2 0.0858 0.0688 0.2128 0.1433 0.0817 0.0794 0.0142 0.0054 0.0080
3 0.0858 0.0686 0.2114 0.1432 0.0813 0.0780 0.0142 0.0053 0.0076
1.00 1.00 1 0.0112 0.0517 0.8285 0.0414 0.0535 0.4388 0.0169 0.0053 0.0015
2 0.0917 0.0688 0.1828 0.0811 0.0675 0.2296 0.0154 0.0053 0.0036
3 0.0917 0.0686 0.1814 0.0811 0.0671 0.2271 0.0154 0.0053 0.0034

If there is moderate error in SBP (i.e. σ1 = 0.5), the 3 analyses still suggest that SBP has significant positive effect on obesity. In contrast to the case with no error in SBP, AGE is found to be statistically significant by the 3 analyses and evidence tends to become stronger as error in CHOL is more substantial. However, the nature of CHOL depends on whether or not there is error in CHOL. If there is no error in CHOL, there is moderate evidence to support that CHOL has a positive effect on obesity; otherwise, CHOL is not statistically significant.

When measurement error in SBP becomes more severe (i.e. σ1 = 1.0), the effect of SBP is no longer significant indicated by the 3 analyses. Again, AGE would have a positive effect and evidence tends to become stronger as error in CHOL increases. CHOL tends to be statistically significant if error in CHOL is none or moderate; if the error in CHOL becomes larger, there is no evidence to support the effect of CHOL.

To save space, we do not display the results for ρ = 0.5 but just comment on the findings here. It seems that moderate correlation ρ tends to decrease the estimates for the effects of both SBP and CHOL but to increase associated standard errors, hence leading to increasing p-values. However, the impact of correlation ρ on AGE effect is different. Moderate correlation ρ tends to increase the estimates of AGE effect while maintaining very stable standard errors, thus the resulting p-values become smaller.

5. SIMULATION STUDIES

In this section, we conduct simulation studies to investigate the impact of ignoring measurement error on estimation and to compare the performance of the 3 analyses discussed in Section 4. The same configurations as those in Section 4 are used when implementing the SIMEX method.

In the following simulation study, we set n = 200 and m = 3 and generate 200 simulations for each parameter configuration. Consider the logistic regression

5.

where zij takes values 0 or 1 with probability 1/2 representing that each subject is randomized to a control or treatment group. Independent of zij, xij = (xij1, xij2)′ is generated from N(μx, Σx), where μx = (μx1, μx2)′ and Inline graphic with μxr = 0.5 and σxr = 1.0 (r = 1, 2). Set βx1 = log(1.5), βx2 = log(1.5), and βz = log(0.75). The surrogate value Wij = (Wij1, Wij2)′ is generated from the normal distribution N(xij, Σe) with Inline graphic Various configurations are considered to feature distinct scenarios of measurement error in covariate xij. Specifically, we consider σ1, σ2 = 0.15, 0.50, and 0.75 to feature minor, moderate, and severe marginal measurement errors. ρx and ρ are specified as 0.5 to represent the cases with moderate correlations. The missing data indicator is generated from model (4.1), where we set α0 = α1 = 0.5, α2 = α3 = 0.1, and αz = 0.2.

In Table 2, we report on the results of the difference of the average of the estimates and the true value (Bias), the empirical standard error (SE), and the coverage rate (CR in percent) for 95% confidence intervals. If measurement error is minor, for instance, when both σ1 and σ2 are 0.15, even Analysis 1 may give rise to reasonable results with fairly small finite-sample biases and CRs that are close to the nominal level 95%. The 3 analyses provide fairly comparable results.

Table 2.

Simulation results

σ1 σ2 Analysis βx1 βx2 βz
Bias SE CR Bias SE CR Bias SE CR
0.15 0.15 1 − 0.0175 0.1323 95.5 0.0000 0.1241 95.5 0.0044 0.2315 94.0
2 − 0.0073 0.1357 96.0 0.0094 0.1277 97.5 0.0038 0.2320 94.5
3 − 0.0073 0.1358 95.5 0.0094 0.1278 97.0 0.0038 0.2321 94.0
0.15 0.50 1 0.0223 0.1303 94.5 − 0.1030 0.1098 87.5 0.0068 0.2314 94.0
2 0.0012 0.1366 95.0 − 0.0135 0.1389 96.0 0.0050 0.2341 94.5
3 0.0011 0.1367 94.5 − 0.0135 0.1393 95.5 0.0053 0.2344 94.5
0.15 0.75 1 0.0579 0.1282 91.0 − 0.1839 0.0957 54.5 0.0080 0.2305 94.0
2 0.0253 0.1365 94.0 − 0.0728 0.1376 91.5 0.0061 0.2343 94.0


3
0.0252
0.1365
94.0
− 0.0727
0.1381
90.5
0.0065
0.2347
94.0
0.50 0.15 1 − 0.1199 0.1175 79.0 0.0389 0.1233 97.5 0.0093 0.2316 94.5
2 − 0.0327 0.1472 95.5 0.0179 0.1301 97.0 0.0077 0.2338 94.5
3 − 0.0326 0.1475 95.0 0.0178 0.1307 97.0 0.0079 0.2342 94.5
0.50 0.50 1 − 0.0970 0.1180 83.0 − 0.0798 0.1113 89.5 0.0129 0.2314 94.0
2 − 0.0259 0.1458 95.5 − 0.0068 0.1386 96.0 0.0094 0.2364 94.5
3 − 0.0258 0.1463 95.0 − 0.0067 0.1394 95.0 0.0098 0.2370 94.5
0.50 0.75 1 − 0.0641 0.1173 92.5 − 0.1754 0.0982 58.5 0.0148 0.2303 93.5
2 − 0.0066 0.1451 96.5 − 0.0683 0.1387 91.0 0.0109 0.2369 94.5


3
− 0.0067
0.1456
96.0
− 0.0681
0.1395
90.0
0.0114
0.2375
94.5
0.75 0.15 1 − 0.1976 0.1028 48.5 0.0730 0.1219 93.0 0.0118 0.2314 94.5
2 − 0.0908 0.1458 84.5 0.0411 0.1312 96.5 0.0107 0.2343 94.5
3 − 0.0906 0.1461 84.5 0.0410 0.1320 96.5 0.0111 0.2348 94.5
0.75 0.50 1 − 0.1899 0.1041 54.5 − 0.0478 0.1109 93.5 0.0161 0.2311 94.0
2 − 0.0865 0.1454 86.0 0.0117 0.1386 98.0 0.0127 0.2370 94.5
3 − 0.0864 0.1459 85.5 0.0118 0.1396 96.5 0.0133 0.2377 94.5
0.75 0.75 1 − 0.1637 0.1041 63.0 − 0.1498 0.0985 68.5 0.0184 0.2300 93.5
2 − 0.0698 0.1451 89.0 − 0.0521 0.1389 92.5 0.0144 0.2376 94.0
3 − 0.0697 0.1456 88.5 − 0.0518 0.1399 92.5 0.0152 0.2384 93.5

When there is moderate or substantial measurement error in covariates xij, the performance of Analysis 1 deteriorates remarkably in estimation of error-prone covariate effects. Analysis 1 may lead to considerably biased estimates for βx1 and βx2. For example, see the entries with σ1 = 0.75 and σ2 = 0.15 in Table 2. The CR for 95% confidence intervals for βx1 can be as low as 49%. Accounting for measurement error in the response model, both Analyses 2 and 3 remarkably improve the performance providing a lot smaller biases and much higher CRs for the 95% confidence intervals. Analysis 2 gives rise to very comparable results to those produced by Analysis 3, though Analysis 2 seems to yield a slightly larger finite-sample biases. The simulation study considered here suggests that the impact of ignoring measurement error in modeling the missing data process is not as remarkable as that in modeling the response process.

In terms of estimation of βz, Analysis 1 produces larger biases than Analyses 2 and 3 do, though the magnitude is not as striking as that for the estimates of βx. Among the 3 analyses, Analysis 1 provides the smallest standard errors while Analysis 3 yields the largest but the differences between Analyses 2 and 3 are not considerable. The CRs for the 95% confidence intervals obtained from the 3 analyses agree reasonably well with the nominal value.

In summary, ignoring measurement error may lead to substantially biased results. Properly addressing covariate measurement error in estimation procedures is necessary. The proposed method (i.e. Analysis 3) performs reasonably well under various configurations. Its performance may become less satisfactory when measurement error becomes substantial. However, the proposed SIMEX method does significantly improve the performance of the naive analysis (i.e. Analysis 1).

6. DISCUSSION

In this paper, we propose a simulation-based marginal method to analyze longitudinal data with both missing observations and error-contaminated covariates. This work is of particular interest because missingness and measurement error in covariates arise commonly in longitudinal studies, and up to date, there is little work to address both features (Liu and Wu, 2007). Yi (2005) discussed inference approaches to handle continuous or count data arising from longitudinal studies, but those methods cannot apply to binary responses due to the nature of the logistic regression. The proposed method may, however, handle binary responses, in addition to continuous responses or count data. Moreover, in contrast to the models of Yi (2005) where only precisely observed covariates may enter model (2.2), the proposed method allows the dependence of the missing data process on error-prone covariates. The proposed method is simple but flexible. Its implementation is straightforward by slightly modifying standard statistical software such as PROC GENMOD in SAS. The proposed method does not require the complete specification of the full distribution of the response process but only requires the specification of the structures of marginal means and variances. Also the method does not need modeling the underlying covariate process, which is desirable for many practical problems.

The proposed methods may apply to handle clustered or correlated data as well. In some situations, the interest may also concern the association strength among response components within clusters. We may, following the lines of Yi and Cook (2002), construct a second set of estimating equations for association parameters. In that formulation, proper adjustments should be introduced to account for biases induced by both missing observations and measurement error in covariates.

In this paper, we focus the discussion on the IPWGEE method for which MAR missing data mechanism is assumed. One may, however, employ other modeling framework such as random-effects models to accommodate NMAR mechanisms as well. Without considering missing observations, Wang and others (1998) studied the random-effects models to account for measurement error in covariates. It would be interesting to develop methods to simultaneously adjust for the biases resulted from missingness and measurement error in this context.

When modeling the missing data process, we consider the case that the true but error-prone covariates Xi enter the model to govern the missingness probability. In some instances, it could be more feasible to facilitate the dependence of dropout on the observed covariates Wi. In this case, the proposed method can apply with a minor modification. See Carroll and others (2006, Chapter 2, Section 11.8) for general discussion on the issue of building a model by conditioning on the true underlying covariates or the observed data.

As seen in Section 4, there is no additional information, such as repeated measurements of SBP and CHOL, available to estimate variance parameters σ1 and σ2, thereby, we undertake sensitivity analyses by specifying a sequence of values of σ1 and σ2 to assess the impact of measurement error on estimation of the response parameters β. Sometimes, there exists additional information on the measurement error process and the associated parameters may be estimated. In these circumstances, we need to accommodate the resulting variation induced by estimating error parameters. With replicate measurements Wi available, for example, we may modify the proposed method by adapting the arguments in Devanarayan and Stefanski (2002) to accommodate measurement error models with unknown variance parameters.

FUNDING

Natural Sciences and Engineering Research Council of Canada.

Acknowledgments

The author acknowledges referees' helpful comments. The author thanks Boston University and the National Heart, Lung, and Blood Institute (NHLBI) for providing the data set from the Framingham Heart Study (No. N01-HC-25195) in the illustration. The Framingham Heart Study is conducted and supported by the NHLBI in collaboration with Boston University. This manuscript was not prepared in collaboration with investigators of the Framingham Heart Study and does not necessarily reflect the opinions or views of the Framingham Heart Study, Boston University, or NHLBI. Conflict of Interest: None declared.

APPENDIX

Adapting the arguments in Carroll and others (1996), we outline the proof of the Theorem as follows. Let Ui(θ; b, λ), Si(α; b, λ), and Hi(θ; b, λ) be Ui(θ), Si(α), and Hi(θ), respectively, with xij replaced by Wij(b, λ). By standard estimating equation theory, under some regularity conditions, Inline graphic(b, λ) → pθ(λ),as B, where θ(λ) is the solution of E[Hi(θ; 1, λ)] = 0.

Let Inline graphic. For each given b and λ, the Taylor Series expansion leads to

graphic file with name biostskxm054fx28_ht.jpg

therefore, for very large B,

graphic file with name biostskxm054fx29_ht.jpg

Let Inline graphic be qM × 1 vectors. Let Inline graphic. Then, by the central limit theorem, as n,

graphic file with name biostskxm054fx32_ht.jpg

where Inline graphic.

Assume that the exact extrapolation functions, say, G(γ; λ), are available in the extrapolation step to fit Inline graphic, where γ is a vector of parameters of dimension d, say. Fit Inline graphic to Inline graphic. Define Inline graphic. Let Inline graphic be the d × qM matrix and Inline graphic be a d × d matrix. Then, by the similar argument to that in Carroll and others (1996), we obtain, as n,

graphic file with name biostskxm054fx39_ht.jpg

where Inline graphic. Letting λ = −1 leads to the SIMEX estimator Inline graphic. Therefore, the asymptotic distribution of the SIMEX estimator is

graphic file with name biostskxm054fx42_ht.jpg

References

  1. Carroll RJ, Küchenhoff H, Lombard F, StefanskiA L. Asymptotics for the SIMEX estimator in nonlinear measurement error models. Journal of the American Statistical Association. 1996;91:242–250. [Google Scholar]
  2. Carroll RJ, Ruppert D, Stefanski LA, Crainiceanu CM. Measurement Error in Nonlinear Models. 2nd edition. Boca Raton (FL): Chapman & Hall; 2006. [Google Scholar]
  3. Cook J, Stefanski LA. A simulation extrapolation method for parametric measurement error models. Journal of the American Statistical Association. 1994;89:464–467. [Google Scholar]
  4. Cook RJ, Zeng L, Yi GY. Marginal analysis of incomplete longitudinal binary data: a cautionary note on LOCF imputation. Biometrics. 2004;60:820–828. doi: 10.1111/j.0006-341X.2004.00234.x. [DOI] [PubMed] [Google Scholar]
  5. Devanarayan V, Stefanski LA. Empirical simulation extrapolation for measurement error models with replicate measurements. Statistics and Probability Letters. 2002;59:219–225. [Google Scholar]
  6. Diggle P, Heagerty P, Liang K-Y, Zeger S. Analysis of Longitudinal Data. 2nd edition. New York: Oxford University Press; 2002. [Google Scholar]
  7. Diggle P, Kenward MG. Informative drop-out in longitudinal data analysis (with discussion) Applied Statistics. 1994;43:49–93. [Google Scholar]
  8. Fuller WA. Measurement Error Models. New York: Wiley; 1987. [Google Scholar]
  9. Kenward MG. Selection models for repeated measurements with nonrandom dropout: an illustration of sensitivity. Statistics in Medicine. 1998;7:2723–2732. doi: 10.1002/(sici)1097-0258(19981215)17:23<2723::aid-sim38>3.0.co;2-5. [DOI] [PubMed] [Google Scholar]
  10. Li Y, Lin X. Functional inference in frailty measurement error models for clustered survival data using the SIMEX approach. Journal of the American Statistical Association. 2003;98:191–203. [Google Scholar]
  11. Liu W, Wu L. Simultaneous inference for semiparametric nonlinear mixed-effects models with covariate measurement errors and missing responses. Biometrics. 2007;63:342–350. doi: 10.1111/j.1541-0420.2006.00687.x. [DOI] [PubMed] [Google Scholar]
  12. Miglioretti DL, Heagerty PJ. Marginal modeling of multilevel binary data with time-varying covariates. Biostatistics. 2004;5:381–398. doi: 10.1093/biostatistics/5.3.381. [DOI] [PubMed] [Google Scholar]
  13. Pepe MS, Anderson GL. A cautionary note on inference for marginal regression models with longitudinal data and general correlated response data. Communications in Statistics—Simulation and Computation. 1994;23:939–951. [Google Scholar]
  14. Prentice RL. Covariate measurement errors and parameter estimation in a failure time regression model. Biometrika. 1982;69:331–342. [Google Scholar]
  15. Robins JM, Rotnitzky A, Zhao LP. Analysis of semiparametric regression models for repeated outcomes in the presence of missing data. Journal of the American Statistical Association. 1995;90:106–121. [Google Scholar]
  16. Strug L, Sun L, Corey M. The genetics of cross-sectional and longitudinal body mass index. BMC Genetics. 2003;4(Suppl 1) doi: 10.1186/1471-2156-4-S1-S14. S14. [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. Wang N, Lin X, Gutierrez RG, Carroll RJ. Bias analysis and SIMEX approach in generalized linear mixed measurement error models. Journal of the American Statistical Association. 1998;93:249–261. [Google Scholar]
  18. Yi GY. Robust methods for incomplete longitudinal data with mismeasured covariates. Far East Journal of Theoretical Statistics. 2005;16:205–234. [Google Scholar]
  19. Yi GY, Cook RJ. Marginal methods for incomplete longitudinal data arising in clusters. Journal of the American Statistical Association. 2002;97:1071–1080. [Google Scholar]
  20. Yi GY, He W. Methods for bivariate survival data with mismeasured covariates under an accelerated failure time model. Communications in Statistics—Theory and Methods. 2006;35:1539–1554. [Google Scholar]
  21. Yi GY, Lawless JF. A corrected likelihood method for the proportional hazards model with covariates subject to measurement error. Journal of Statistical Planning and Inference. 2007;137:1816–1828. [Google Scholar]
  22. Yi GY, Thompson ME. Marginal and association regression models for longitudinal binary data with drop-outs: a likelihood-based approach. The Canadian Journal of Statistics. 2005;33:3–20. [Google Scholar]
  23. Yoo YJ, Huo Y, Ning Y, Gordon D, Finch S, Mendell NR. Power of maximum HLOD tests to detect linkage to obesity genes. BMC Genetics. 2003;4(Suppl 1) doi: 10.1186/1471-2156-4-S1-S16. S16. [DOI] [PMC free article] [PubMed] [Google Scholar]

Articles from Biostatistics (Oxford, England) are provided here courtesy of Oxford University Press

RESOURCES