A simulation-based marginal method for longitudinal data with dropout and mismeasured covariates

Grace Y Yi

doi:10.1093/biostatistics/kxm054

. 2008 Jan 16;9(3):501–512. doi: 10.1093/biostatistics/kxm054

A simulation-based marginal method for longitudinal data with dropout and mismeasured covariates

Grace Y Yi ¹

PMCID: PMC3294321 PMID: 18199691

Abstract

Longitudinal data often contain missing observations and error-prone covariates. Extensive attention has been directed to analysis methods to adjust for the bias induced by missing observations. There is relatively little work on investigating the effects of covariate measurement error on estimation of the response parameters, especially on simultaneously accounting for the biases induced by both missing values and mismeasured covariates. It is not clear what the impact of ignoring measurement error is when analyzing longitudinal data with both missing observations and error-prone covariates. In this article, we study the effects of covariate measurement error on estimation of the response parameters for longitudinal studies. We develop an inference method that adjusts for the biases induced by measurement error as well as by missingness. The proposed method does not require the full specification of the distribution of the response vector but only requires modeling its mean and variance structures. Furthermore, the proposed method employs the so-called functional modeling strategy to handle the covariate process, with the distribution of covariates left unspecified. These features, plus the simplicity of implementation, make the proposed method very attractive. In this paper, we establish the asymptotic properties for the resulting estimators. With the proposed method, we conduct sensitivity analyses on a cohort data set arising from the Framingham Heart Study. Simulation studies are carried out to evaluate the impact of ignoring covariate measurement error and to assess the performance of the proposed method.

Keywords: Estimating equations, Longitudinal data, Measurement error, Missing data, Simulation and extrapolation method

1. INTRODUCTION

Longitudinal studies are commonly conducted in the health sciences, biochemical, and epidemiology fields. Although longitudinal studies are designed to collect data on every individual in the studies at each assessment, missing observations often arise due to various reasons. There has been increasing interest in discussing valid inference methods for longitudinal data with missing values. Yet, there is relatively little work on investigating the effects of covariate measurement error on estimation of the response parameters, especially on simultaneously accounting for the biases induced by both missing values and mismeasured covariates. Measurement error in covariates is, however, a typical feature of longitudinal data. Sometimes, covariates of interest may be difficult to observe precisely due to physical location or cost. Sometimes, it is impossible to measure covariates accurately due to the nature of the covariates. In other situations, a covariate may represent an average of a certain quantity over time (e.g. cholesterol level (CHOL)) and any practical way of measuring such a quantity necessarily features measurement error.

It has been recognized and well documented that, in other contexts, ignoring covariate measurement error may lead to severe biased results. For example, Fuller (1987) pointed out that the slope in a simple linear regression model may be attenuated if covariate measurement error is ignored. For survival data analysis, Prentice (1982), Li and Lin (2003), Yi and He (2006), and Yi and Lawless (2007), among others, investigated measurement error effects and developed inference methods to correct for the bias resulted from measurement error in covariates. For an overview of measurement error problems, see Carroll and others (2006).

In this paper, we investigate the impact of covariate measurement error on longitudinal data analysis. This work is motivated by the need of methods to simultaneously address both missingness and measurement error that are often possessed by longitudinal data. For example, a data set arising from the Framingham Heart Study contains error-prone covariates and a portion of subjects who drop out of the study during the follow-up period. An objective of this study is to understand how obesity is associated with covariates such as age, blood pressure, and CHOL. It is well known that individual measurement for blood pressure and CHOL involves substantial measurement error. Here, the true measurements of these covariates are defined as their long-term average values. The measurement at a specific time point would fluctuate with time, seasonal variation, and other confounding factors. The features of measurement error in covariates and dropout present a challenge to the existing inference methods. In this paper, we develop an inference method for analyzing longitudinal data that have both dropout and error-contaminated covariates. We utilize marginal methods to modulate the response process. A functional method for the measurement error process is employed. Such a method is appealing because it does not require the specification of the covariate distribution.

The remainder is organized as follows. Notation and model setup are introduced in Section 2. In Section 3, we discuss a simulation–extrapolation (SIMEX) method to account for both dropout and covariate measurement error. A data set arising from the Framingham Heart Study is analyzed with the proposed method and the results are reported in Section 4. In Section 5, we conduct simulation studies to assess the performance of the proposed method as well as the impact of ignoring measurement error in covariates. General discussion is included in Section 6.

2. NOTATION AND MODEL SETUP

2.1. Response process

Longitudinal data analysis may typically be conducted based on marginal, random-effects, and transitional models (Diggle and others, 2002). In this paper, we focus on marginal analysis with the primary interest centered on the marginal mean parameters. Let Y_ij be the response variable for subject i at time point j, x_ij be the covariate vector subject to error, and z_ij be the vector of error-free covariates, i=1,2,…,n and j=1,2,…,m. Denote Y_i=(Y_i1,Y_i2,…,Y_im)′, x_i=(x_i1′,x_i2′,…,x_im′)′, and z_i=(z_i1′,z_i2′,…,z_im′)′. Let μ_ij=E(Y_ij|x_i,z_i) and v_ij=var(Y_ij|x_i,z_i) be the conditional expectation and variance of Y_ij, respectively, given the covariates x_i and z_i.

We model the influence of the covariates on the marginal response mean by means of a regression model

(2.1)

where β=(β_x′,β_z′)′ is the vector of regression parameters of dimension p, say, and g(·) is a known monotone function. If necessary, the intercept may be included in β_z by adding a unit vector to covariates z_i. Furthermore, assume v_ij=h(μ_ij;φ), where h(·;·) is a known function and φ is the dispersion parameter that is known or may be estimated. We treat φ as known here with emphasis on estimation of β.

Here, we assume that the dependence of mean μ_ij on the subject-level covariates x_i and z_i is completely reflected by the time-specific covariates x_ij and z_ij, that is, E(Y_ij|x_i,z_i)=E(Y_ij|x_ij,z_ij). This assumption has been widely adopted in modeling longitudinal data, see Diggle and Kenward (1994), Robins and others (1995), Cook and others (2004), and Yi and Thompson (2005), for example. This assumption was noted in Pepe and Anderson (1994) and was justified from the viewpoint of formulating unbiased estimating functions. Model (2.1) may consist of baseline covariates such as gender, age, and treatment status or time-varying covariates. With an exogenous covariate process (i.e. a time-varying covariate that is not predicted by past outcomes), properly including current or lagged values of the covariates may meet this assumption (e.g. Miglioretti and Heagerty, 2004). Both cross-sectional and longitudinal effects of time-varying covariates may be featured in model (2.1). See Diggle and others (2002, Chapter 12) for more detailed discussion.

2.2. Missing data process

Let R_ij be 1 if Y_ij is observed and 0 otherwise. Let R_i=(R_i1,R_i2,…,R_im)′ be the vector of (non)missing data indicators, i=1,2,…,n. Dropouts or monotone missing data patterns are considered here. That is, R_ij=0 implies R_ij′=0 for all j′>j. Without loss of generality, assume that R_i1=1 for every subject i. According to the dependence of the missing data process on the response process, missing data mechanisms may be classified as missing completely at random (MCAR), missing at random (MAR), and not missing at random (NMAR) (e.g. Kenward, 1998).

In this paper, we assume an MAR mechanism for the dropout process. That is, given the covariates, the conditional distribution f(r_i|x_i,z_i,y_i) depends on the observed response components y_i^obs and the covariates only. Let λ_ij=P(R_ij=1|R_i,j−1=1,x_i,z_i,y_i) and π_ij=P(R_ij=1|x_i,z_i,y_i). Note that π_ij=∏_t=2^jλ_it. Let H_ij^y={y_i1,…,y_i,j−1} denote the response history up to (but not including) time point j.

Logistic regression models are commonly used to model the dropout process (e.g. Diggle and Kenward, 1994; Robins and others, 1995), namely,

(2.2)

where u_ij is the vector consisting of the information of the covariates x_i and z_i and the observed responses H_ij^y, and α is the vector of regression parameters.

Let M_i be the random dropout time for subject i and m_i be a realization, i=1,2,…,n. Define L_i(α)=(1−λ_imi)∏_t=2^mi−1λ_it, where λ_it is determined by model (2.2). Let S_i(α)=∂logL_i(α)/∂α be the vector of score functions contributed from subject i. Denote θ=(α′,β′)′ and q=dim(θ).

2.3. Measurement error process

Let W_ij be an observed measurement of the covariate x_ij, i=1,2,…,n,j=1,2,…,m. x_ij and W_ij are assumed to follow a classical additive measurement error model. That is, conditional on x_i and z_i,

(2.3)

where the error terms e_ij′ are assumed to follow N(0,Σ_e) with Σ_e being the covariance matrix (e.g. Wang and others, 1998).

It is known that nonidentifiability is often a problem if model (2.3) is employed. For identifiability of model parameters, one needs a validation data set consisting of {Y_ij,x_ij,W_ij,z_ij} or repeated measurements W_ij to estimate the parameters associated with Σ_e. If neither validation data nor repeated measurements W_ij are available, then one may conduct sensitivity analyses based on background information about the measurement process to assess the impact of different degrees of measurement error on estimation of β (Yi and Lawless, 2007). In this paper, error distribution parameters are assumed known.

3. INFERENCE PROCEDURES

3.1. Weighted estimation functions

The inverse probability weighted generalized estimating equation (IPWGEE) method is often employed to account for the bias induced by the incompleteness of data (Robins and others, 1995) when primary interest lies in the marginal mean parameters β in model (2.1). For i=1,2,…,n, let D_i=∂μ_i′/∂β be the matrix of the derivatives of the mean vector μ_i with respect to β and Δ_i=diag(I(R_ij=1)/π_ij,j=1,2,…,m) be the weight matrix accommodating missingness, where I(·) is the indicator function. Let V_i=A_i^1/2C_iA_i^1/2 be the covariance matrix of Y_i, where A_i=diag(v_ij,j=1,2,…,m) and C_i=[ρ_i;jk] is the correlation matrix with ρ_i;jk being the correlation coefficient of response components Y_ij and Y_ik, for j≠k, and ρ_i;jj=1. For i=1,2,…,n, define U_i(θ) = D_iV_i⁻¹Δ_i(Y_i−μ_i) and H_i(θ)=(U_i^′(θ),S_i^′(α))′.

In the absence of measurement error, that is, covariates x_ij are precisely observed, E[H_i(θ)]=0; hence, H(θ)=∑_i=1ⁿH_i(θ) are unbiased estimation functions for θ. Consistent estimator Inline graphic of θ can be obtained by solving

(3.1)

where moment estimates may be used for the correlation matrix C_i or, alternatively, a working independence matrix A_i may be used to replace V_i.

3.1. SIMEX approach

When measurement error is present in covariates x_ij, H(θ) is no longer unbiased if replacing x_ij with its observed measurement W_ij. A proper adjustment is needed to account for the bias induced by using W_ij. In the sequel, we describe the SIMEX method for the adjustment. Let B be a given positive integer and Λ={λ₁,λ₂,…,λ_M} be a sequence of nonnegative numbers taken from [0,λ_M] with λ₁=0.

1. Simulation step

For i=1,2,…,n and j=1,2,…,m, generate e_ijb ∼ N(0,Σ_e)for b=1,2,…,B. Given λ ∈ Λ, set W_ij(b,λ)=W_ij+ Inline graphic .

2. Estimation step

For given λ and b, we obtain an estimate Inline graphic (b,λ) by solving (3.1) with x_ij replaced by W_ij(b,λ). This step can be quickly implemented using SAS GENMOD procedure to the data set {Y_i,W_i(b,λ),z_i:i=1,2,…,n}. The model-based covariance matrix for (b,λ) is given by

where Inline graphic and .

Denote by Inline graphic (b,λ) the rth diagonal element of (b,λ) and (b,λ) the rth component of (b,λ), r=1,2,…,q. Define (λ)=B⁻¹∑_b=1^B(b,λ), (λ)=B⁻¹∑_b=1^B(b,λ), (λ)=(B −1)⁻¹∑_b=1^B((b,λ)−(λ))², and (λ)=(λ)−(λ).

3. Extrapolation step

For r=1,2,…,q, fit a regression model to each of the sequences {(λ, Inline graphic (λ)): λ ∈ Λ} and {(λ,(λ)):λ ∈ Λ}, respectively, and extrapolate it to λ=−1 with and denoting the corresponding predicted values. Then, =(₁,₂,…,_q)′ is the SIMEX estimator of θ and is the associated standard error for the estimator (r=1,2,…,q).

The SIMEX approach is a simulation-based method that was proposed by Cook and Stefanski (1994) for parametric measurement error models. Its idea can be intuitively illustrated with simple linear regression. Suppose that the regression model is given by Y=β₀+β_xx+ϵ, where ϵ has mean 0. If replacing x with its observed measurement W, modeled by W=x+e with e having mean 0 and variance σ², then the resulting least squares estimator Inline graphic for β_x converges in probability to β_x^*=(σ_x²/(σ_x²+σ²))β_x (Fuller, 1987). Here σ_x² is the variance of x. Intuitively, if replacing x with W+σe_b, where e_b is generated from N(0,1), then the resulting estimator (b,λ) converges in probability to β_x^*(b,λ)=(σ_x²/(σ_x²+(1+λ)σ²))β_x. If λ=0, Inline graphic (b,0) is just the naive estimator . However, if λ=−1 then the limit β_x^*(b,−1) is identical to the true parameter β_x.

For univariate parametric models, Carroll and others (1996) established the asymptotic normality for the SIMEX estimator. However, their results cannot directly apply here because the current development involves multiple response outcomes along with an additional process concerning the missing data indicators. If the exact extrapolation function is used in Step 3 above, we may establish the following asymptotic distribution for the SIMEX estimator Inline graphic . The proof is outlined in the Appendix.

THEOREM:

Under regularity conditions,

where G_γ(γ; −1) and Q_(γ) are defined in the Appendix. Hence, has an asymptotic normal distribution with mean 0 and covariance matrix being the upper p × p matrix of .

4. AN EXAMPLE

As an illustration, we apply the proposed method to analyze cohort 2 subset of GAW13 (Genetic Analysis Workshops) data arising from the Framingham Heart Study. The data set consists of the measurements for 1672 patients from a series of exams with 5 assessments designed for each individual. Measurements such as height, weight, age, systolic blood pressure (SBP), and CHOL are collected at each assessment. About 24% patients dropped out of the study.

It is of interest to study how an individual's obesity changes with age and how it is associated with SBP and CHOL. Practically, it is convenient and cost effect to use body mass index (BMI), which is defined as weight (kg)/height² (m²), to estimate adiposity that correlates well with more direct and invasive measures of percentage body fat (Strug and others, 2003). Here, following Yoo and others (2003), we let Y be the binary response variable indicating obesity status of a subject, which takes value 1 if his/her maximum BMI (Max BMI) over all ages is no less than the 90th percentile of the Max BMI values observed in each replicate being analyzed and 0 otherwise. The responses and the covariates are postulated by the logistic regression model

where x_ij1 represents SBP, rescaled as log(SBP − 50) as in Carroll and others (2006), x_ij2 is the standardized CHOL, and z_ij is AGE for subject i at time point j, respectively.

It is well known that both SBP and CHOL are subject to substantial measurement error. We are concerned how measurement error in SBP and CHOL impacts estimation of parameter β = (β₀, β_x1, β_x2, β_z)′, and hence, we conduct sensitivity analyses here. Let W_ij = (W_ij1, W_ij2)′ and x_ij = (x_ij1, x_ij2)′. Assume that the error model is given by (2.3) with Inline graphic . σ₁ and σ₂ are specified as 0, 0.5, and 1.0 to feature scenarios with different degrees of measurement error in SBP and CHOL. Distinct values for ρ are considered to facilitate different strengths in correlation. The missing data process is characterized by the logistic regression model

(4.1)

Three analyses are conducted here. Analysis 1 ignores measurement error in SBP and CHOL with X_i naively replaced by W_i when using (3.1), Analysis 2 accounts for measurement error in the response model but not in the missing data model, while Analysis 3 addresses measurement error in both the response and the missing data models. In implementing the SIMEX method, we choose B = 200, M = 9, and a quadratic regression for each extrapolation step.

The analyses show that only α₄ in model (4.1) is statistically significant under various situations considered for error model (2.3). Other coefficients such as α₁, α₂, and α₃ are all not statistically significant. The results suggest that the dropout rate increases as the subjects become older. Dropout probability does not depend on the previous obesity status, SBP, or CHOL.

We conduct the analyses for ρ = 0 and ρ = 0.5. Table 1 reports the results for the case with ρ = 0. It is not surprising that the 3 analyses give rise to very similar results when there is no measurement error present in SBP and CHOL. When measurement error does exits, it can be seen that the estimates and associated standard errors may be considerably impacted by different degrees of measurement error in SBP or CHOL. If there is no error in SBP (i.e. σ₁ = 0), both CHOL and AGE are not statistically significant, whereas SBP has a significant positive effect no matter what degree of measurement error is involved in CHOL.

Table 1.

Sensitivity analyses of the data from the Framingham Heart Study

σ₁	σ₂	Analysis	β_x₁			β_x₂			β_z
			Bias	SE	p-value	Bias	SE	p-value	Bias	SE	p-value
0.00	0.00	1	2.9465	0.3103	< 0.0001	0.0904	0.0852	0.2886	− 0.0067	0.0057	0.2427
		2	2.9465	0.3119	< 0.0001	0.0904	0.0854	0.2897	− 0.0067	0.0057	0.2450
		3	2.9465	0.3103	< 0.0001	0.0904	0.0852	0.2886	− 0.0067	0.0057	0.2427
0.00	0.50	1	2.9827	0.3085	< 0.0001	0.0419	0.0721	0.5614	− 0.0060	0.0057	0.2937
		2	2.9736	0.3119	< 0.0001	0.0541	0.0871	0.5341	− 0.0061	0.0057	0.2820
		3	2.9737	0.3102	< 0.0001	0.0541	0.0868	0.5334	− 0.0061	0.0057	0.2792
0.00	1.00	1	3.0069	0.3068	< 0.0001	0.0072	0.0503	0.8859	− 0.0055	0.0057	0.3372
		2	3.0016	0.3100	< 0.0001	0.0140	0.0706	0.8434	− 0.0056	0.0057	0.3285
		3	3.0017	0.3083	< 0.0001	0.0140	0.0704	0.8426	− 0.0056	0.0057	0.3262
0.50	0.00	1	0.2828	0.0897	0.0016	0.1751	0.0797	0.0280	0.0121	0.0053	0.0232
		2	0.5050	0.1346	0.0002	0.1654	0.0802	0.0391	0.0106	0.0053	0.0455
		3	0.5051	0.1343	0.0002	0.1654	0.0799	0.0385	0.0106	0.0052	0.0441
0.50	0.50	1	0.2316	0.0968	0.0167	0.0797	0.0728	0.2737	0.0144	0.0053	0.0063
		2	0.5182	0.1335	0.0001	0.1277	0.0820	0.1194	0.0112	0.0053	0.0337
		3	0.5183	0.1332	< 0.0001	0.1276	0.0817	0.1181	0.0112	0.0052	0.0324
0.50	1.00	1	0.2599	0.1018	0.0107	0.0088	0.0538	0.8701	0.0157	0.0053	0.0030
		2	0.5331	0.1333	< 0.0001	0.0703	0.0676	0.2988	0.0123	0.0052	0.0188
		3	0.5331	0.1330	< 0.0001	0.0703	0.0674	0.2966	0.0123	0.0052	0.0180
1.00	0.00	1	0.0412	0.0454	0.3648	0.1852	0.0794	0.0196	0.0137	0.0054	0.0107
		2	0.0801	0.0693	0.2477	0.1830	0.0797	0.0216	0.0135	0.0054	0.0121
		3	0.0802	0.0692	0.2464	0.1831	0.0793	0.0210	0.0135	0.0053	0.0116
1.00	0.50	1	0.0073	0.0488	0.8809	0.1074	0.0719	0.1351	0.0156	0.0053	0.0035
		2	0.0858	0.0688	0.2128	0.1433	0.0817	0.0794	0.0142	0.0054	0.0080
		3	0.0858	0.0686	0.2114	0.1432	0.0813	0.0780	0.0142	0.0053	0.0076
1.00	1.00	1	0.0112	0.0517	0.8285	0.0414	0.0535	0.4388	0.0169	0.0053	0.0015
		2	0.0917	0.0688	0.1828	0.0811	0.0675	0.2296	0.0154	0.0053	0.0036
		3	0.0917	0.0686	0.1814	0.0811	0.0671	0.2271	0.0154	0.0053	0.0034

Open in a new tab

If there is moderate error in SBP (i.e. σ₁ = 0.5), the 3 analyses still suggest that SBP has significant positive effect on obesity. In contrast to the case with no error in SBP, AGE is found to be statistically significant by the 3 analyses and evidence tends to become stronger as error in CHOL is more substantial. However, the nature of CHOL depends on whether or not there is error in CHOL. If there is no error in CHOL, there is moderate evidence to support that CHOL has a positive effect on obesity; otherwise, CHOL is not statistically significant.

When measurement error in SBP becomes more severe (i.e. σ₁ = 1.0), the effect of SBP is no longer significant indicated by the 3 analyses. Again, AGE would have a positive effect and evidence tends to become stronger as error in CHOL increases. CHOL tends to be statistically significant if error in CHOL is none or moderate; if the error in CHOL becomes larger, there is no evidence to support the effect of CHOL.

To save space, we do not display the results for ρ = 0.5 but just comment on the findings here. It seems that moderate correlation ρ tends to decrease the estimates for the effects of both SBP and CHOL but to increase associated standard errors, hence leading to increasing p-values. However, the impact of correlation ρ on AGE effect is different. Moderate correlation ρ tends to increase the estimates of AGE effect while maintaining very stable standard errors, thus the resulting p-values become smaller.

5. SIMULATION STUDIES

In this section, we conduct simulation studies to investigate the impact of ignoring measurement error on estimation and to compare the performance of the 3 analyses discussed in Section 4. The same configurations as those in Section 4 are used when implementing the SIMEX method.

In the following simulation study, we set n = 200 and m = 3 and generate 200 simulations for each parameter configuration. Consider the logistic regression

where z_ij takes values 0 or 1 with probability 1/2 representing that each subject is randomized to a control or treatment group. Independent of z_ij, x_ij = (x_ij1, x_ij2)′ is generated from N(μ_x, Σ_x), where μx = (μ_x1, μ_x2)′ and Inline graphic with μ_xr = 0.5 and σ_xr = 1.0 (r = 1, 2). Set β_x1 = log(1.5), β_x2 = log(1.5), and β_z = log(0.75). The surrogate value W_ij = (W_ij1, W_ij2)′ is generated from the normal distribution N(x_ij, Σ_e) with Various configurations are considered to feature distinct scenarios of measurement error in covariate x_ij. Specifically, we consider σ₁, σ₂ = 0.15, 0.50, and 0.75 to feature minor, moderate, and severe marginal measurement errors. ρ_x and ρ are specified as 0.5 to represent the cases with moderate correlations. The missing data indicator is generated from model (4.1), where we set α₀ = α₁ = 0.5, α₂ = α₃ = 0.1, and α_z = 0.2.

In Table 2, we report on the results of the difference of the average of the estimates and the true value (Bias), the empirical standard error (SE), and the coverage rate (CR in percent) for 95% confidence intervals. If measurement error is minor, for instance, when both σ₁ and σ₂ are 0.15, even Analysis 1 may give rise to reasonable results with fairly small finite-sample biases and CRs that are close to the nominal level 95%. The 3 analyses provide fairly comparable results.

Table 2.

Simulation results

σ₁	σ₂	Analysis	β_x₁			β_x₂			β_z
			Bias	SE	CR	Bias	SE	CR	Bias	SE	CR
0.15	0.15	1	− 0.0175	0.1323	95.5	0.0000	0.1241	95.5	0.0044	0.2315	94.0
		2	− 0.0073	0.1357	96.0	0.0094	0.1277	97.5	0.0038	0.2320	94.5
		3	− 0.0073	0.1358	95.5	0.0094	0.1278	97.0	0.0038	0.2321	94.0
0.15	0.50	1	0.0223	0.1303	94.5	− 0.1030	0.1098	87.5	0.0068	0.2314	94.0
		2	0.0012	0.1366	95.0	− 0.0135	0.1389	96.0	0.0050	0.2341	94.5
		3	0.0011	0.1367	94.5	− 0.0135	0.1393	95.5	0.0053	0.2344	94.5
0.15	0.75	1	0.0579	0.1282	91.0	− 0.1839	0.0957	54.5	0.0080	0.2305	94.0
		2	0.0253	0.1365	94.0	− 0.0728	0.1376	91.5	0.0061	0.2343	94.0
		3	0.0252	0.1365	94.0	− 0.0727	0.1381	90.5	0.0065	0.2347	94.0
0.50	0.15	1	− 0.1199	0.1175	79.0	0.0389	0.1233	97.5	0.0093	0.2316	94.5
		2	− 0.0327	0.1472	95.5	0.0179	0.1301	97.0	0.0077	0.2338	94.5
		3	− 0.0326	0.1475	95.0	0.0178	0.1307	97.0	0.0079	0.2342	94.5
0.50	0.50	1	− 0.0970	0.1180	83.0	− 0.0798	0.1113	89.5	0.0129	0.2314	94.0
		2	− 0.0259	0.1458	95.5	− 0.0068	0.1386	96.0	0.0094	0.2364	94.5
		3	− 0.0258	0.1463	95.0	− 0.0067	0.1394	95.0	0.0098	0.2370	94.5
0.50	0.75	1	− 0.0641	0.1173	92.5	− 0.1754	0.0982	58.5	0.0148	0.2303	93.5
		2	− 0.0066	0.1451	96.5	− 0.0683	0.1387	91.0	0.0109	0.2369	94.5
		3	− 0.0067	0.1456	96.0	− 0.0681	0.1395	90.0	0.0114	0.2375	94.5
0.75	0.15	1	− 0.1976	0.1028	48.5	0.0730	0.1219	93.0	0.0118	0.2314	94.5
		2	− 0.0908	0.1458	84.5	0.0411	0.1312	96.5	0.0107	0.2343	94.5
		3	− 0.0906	0.1461	84.5	0.0410	0.1320	96.5	0.0111	0.2348	94.5
0.75	0.50	1	− 0.1899	0.1041	54.5	− 0.0478	0.1109	93.5	0.0161	0.2311	94.0
		2	− 0.0865	0.1454	86.0	0.0117	0.1386	98.0	0.0127	0.2370	94.5
		3	− 0.0864	0.1459	85.5	0.0118	0.1396	96.5	0.0133	0.2377	94.5
0.75	0.75	1	− 0.1637	0.1041	63.0	− 0.1498	0.0985	68.5	0.0184	0.2300	93.5
		2	− 0.0698	0.1451	89.0	− 0.0521	0.1389	92.5	0.0144	0.2376	94.0
		3	− 0.0697	0.1456	88.5	− 0.0518	0.1399	92.5	0.0152	0.2384	93.5

Open in a new tab

When there is moderate or substantial measurement error in covariates x_ij, the performance of Analysis 1 deteriorates remarkably in estimation of error-prone covariate effects. Analysis 1 may lead to considerably biased estimates for β_x1 and β_x2. For example, see the entries with σ₁ = 0.75 and σ₂ = 0.15 in Table 2. The CR for 95% confidence intervals for β_x1 can be as low as 49%. Accounting for measurement error in the response model, both Analyses 2 and 3 remarkably improve the performance providing a lot smaller biases and much higher CRs for the 95% confidence intervals. Analysis 2 gives rise to very comparable results to those produced by Analysis 3, though Analysis 2 seems to yield a slightly larger finite-sample biases. The simulation study considered here suggests that the impact of ignoring measurement error in modeling the missing data process is not as remarkable as that in modeling the response process.

In terms of estimation of β_z, Analysis 1 produces larger biases than Analyses 2 and 3 do, though the magnitude is not as striking as that for the estimates of β_x. Among the 3 analyses, Analysis 1 provides the smallest standard errors while Analysis 3 yields the largest but the differences between Analyses 2 and 3 are not considerable. The CRs for the 95% confidence intervals obtained from the 3 analyses agree reasonably well with the nominal value.

In summary, ignoring measurement error may lead to substantially biased results. Properly addressing covariate measurement error in estimation procedures is necessary. The proposed method (i.e. Analysis 3) performs reasonably well under various configurations. Its performance may become less satisfactory when measurement error becomes substantial. However, the proposed SIMEX method does significantly improve the performance of the naive analysis (i.e. Analysis 1).

6. DISCUSSION

In this paper, we propose a simulation-based marginal method to analyze longitudinal data with both missing observations and error-contaminated covariates. This work is of particular interest because missingness and measurement error in covariates arise commonly in longitudinal studies, and up to date, there is little work to address both features (Liu and Wu, 2007). Yi (2005) discussed inference approaches to handle continuous or count data arising from longitudinal studies, but those methods cannot apply to binary responses due to the nature of the logistic regression. The proposed method may, however, handle binary responses, in addition to continuous responses or count data. Moreover, in contrast to the models of Yi (2005) where only precisely observed covariates may enter model (2.2), the proposed method allows the dependence of the missing data process on error-prone covariates. The proposed method is simple but flexible. Its implementation is straightforward by slightly modifying standard statistical software such as PROC GENMOD in SAS. The proposed method does not require the complete specification of the full distribution of the response process but only requires the specification of the structures of marginal means and variances. Also the method does not need modeling the underlying covariate process, which is desirable for many practical problems.

The proposed methods may apply to handle clustered or correlated data as well. In some situations, the interest may also concern the association strength among response components within clusters. We may, following the lines of Yi and Cook (2002), construct a second set of estimating equations for association parameters. In that formulation, proper adjustments should be introduced to account for biases induced by both missing observations and measurement error in covariates.

In this paper, we focus the discussion on the IPWGEE method for which MAR missing data mechanism is assumed. One may, however, employ other modeling framework such as random-effects models to accommodate NMAR mechanisms as well. Without considering missing observations, Wang and others (1998) studied the random-effects models to account for measurement error in covariates. It would be interesting to develop methods to simultaneously adjust for the biases resulted from missingness and measurement error in this context.

When modeling the missing data process, we consider the case that the true but error-prone covariates X_i enter the model to govern the missingness probability. In some instances, it could be more feasible to facilitate the dependence of dropout on the observed covariates W_i. In this case, the proposed method can apply with a minor modification. See Carroll and others (2006, Chapter 2, Section 11.8) for general discussion on the issue of building a model by conditioning on the true underlying covariates or the observed data.

As seen in Section 4, there is no additional information, such as repeated measurements of SBP and CHOL, available to estimate variance parameters σ₁ and σ₂, thereby, we undertake sensitivity analyses by specifying a sequence of values of σ₁ and σ₂ to assess the impact of measurement error on estimation of the response parameters β. Sometimes, there exists additional information on the measurement error process and the associated parameters may be estimated. In these circumstances, we need to accommodate the resulting variation induced by estimating error parameters. With replicate measurements W_i available, for example, we may modify the proposed method by adapting the arguments in Devanarayan and Stefanski (2002) to accommodate measurement error models with unknown variance parameters.

FUNDING

Natural Sciences and Engineering Research Council of Canada.

Acknowledgments

The author acknowledges referees' helpful comments. The author thanks Boston University and the National Heart, Lung, and Blood Institute (NHLBI) for providing the data set from the Framingham Heart Study (No. N01-HC-25195) in the illustration. The Framingham Heart Study is conducted and supported by the NHLBI in collaboration with Boston University. This manuscript was not prepared in collaboration with investigators of the Framingham Heart Study and does not necessarily reflect the opinions or views of the Framingham Heart Study, Boston University, or NHLBI. Conflict of Interest: None declared.

APPENDIX

Adapting the arguments in Carroll and others (1996), we outline the proof of the Theorem as follows. Let U_i(θ; b, λ), S_i(α; b, λ), and H_i(θ; b, λ) be U_i(θ), S_i(α), and H_i(θ), respectively, with x_ij replaced by W_ij(b, λ). By standard estimating equation theory, under some regularity conditions, Inline graphic (b, λ) → pθ(λ),as B → ∞, where θ(λ) is the solution of E[H_i(θ; 1, λ)] = 0.

Let Inline graphic . For each given b and λ, the Taylor Series expansion leads to

therefore, for very large B,

Let Inline graphic be qM × 1 vectors. Let . Then, by the central limit theorem, as n → ∞,

where Inline graphic .

Assume that the exact extrapolation functions, say, G(γ; λ), are available in the extrapolation step to fit Inline graphic , where γ is a vector of parameters of dimension d, say. Fit to . Define . Let be the d × qM matrix and be a d × d matrix. Then, by the similar argument to that in Carroll and others (1996), we obtain, as n → ∞,

where Inline graphic . Letting λ = −1 leads to the SIMEX estimator . Therefore, the asymptotic distribution of the SIMEX estimator is

References

Carroll RJ, Küchenhoff H, Lombard F, StefanskiA L. Asymptotics for the SIMEX estimator in nonlinear measurement error models. Journal of the American Statistical Association. 1996;91:242–250. [Google Scholar]
Carroll RJ, Ruppert D, Stefanski LA, Crainiceanu CM. Measurement Error in Nonlinear Models. 2nd edition. Boca Raton (FL): Chapman & Hall; 2006. [Google Scholar]
Cook J, Stefanski LA. A simulation extrapolation method for parametric measurement error models. Journal of the American Statistical Association. 1994;89:464–467. [Google Scholar]
Cook RJ, Zeng L, Yi GY. Marginal analysis of incomplete longitudinal binary data: a cautionary note on LOCF imputation. Biometrics. 2004;60:820–828. doi: 10.1111/j.0006-341X.2004.00234.x. [DOI] [PubMed] [Google Scholar]
Devanarayan V, Stefanski LA. Empirical simulation extrapolation for measurement error models with replicate measurements. Statistics and Probability Letters. 2002;59:219–225. [Google Scholar]
Diggle P, Heagerty P, Liang K-Y, Zeger S. Analysis of Longitudinal Data. 2nd edition. New York: Oxford University Press; 2002. [Google Scholar]
Diggle P, Kenward MG. Informative drop-out in longitudinal data analysis (with discussion) Applied Statistics. 1994;43:49–93. [Google Scholar]
Fuller WA. Measurement Error Models. New York: Wiley; 1987. [Google Scholar]
Kenward MG. Selection models for repeated measurements with nonrandom dropout: an illustration of sensitivity. Statistics in Medicine. 1998;7:2723–2732. doi: 10.1002/(sici)1097-0258(19981215)17:23<2723::aid-sim38>3.0.co;2-5. [DOI] [PubMed] [Google Scholar]
Li Y, Lin X. Functional inference in frailty measurement error models for clustered survival data using the SIMEX approach. Journal of the American Statistical Association. 2003;98:191–203. [Google Scholar]
Liu W, Wu L. Simultaneous inference for semiparametric nonlinear mixed-effects models with covariate measurement errors and missing responses. Biometrics. 2007;63:342–350. doi: 10.1111/j.1541-0420.2006.00687.x. [DOI] [PubMed] [Google Scholar]
Miglioretti DL, Heagerty PJ. Marginal modeling of multilevel binary data with time-varying covariates. Biostatistics. 2004;5:381–398. doi: 10.1093/biostatistics/5.3.381. [DOI] [PubMed] [Google Scholar]
Pepe MS, Anderson GL. A cautionary note on inference for marginal regression models with longitudinal data and general correlated response data. Communications in Statistics—Simulation and Computation. 1994;23:939–951. [Google Scholar]
Prentice RL. Covariate measurement errors and parameter estimation in a failure time regression model. Biometrika. 1982;69:331–342. [Google Scholar]
Robins JM, Rotnitzky A, Zhao LP. Analysis of semiparametric regression models for repeated outcomes in the presence of missing data. Journal of the American Statistical Association. 1995;90:106–121. [Google Scholar]
Strug L, Sun L, Corey M. The genetics of cross-sectional and longitudinal body mass index. BMC Genetics. 2003;4(Suppl 1) doi: 10.1186/1471-2156-4-S1-S14. S14. [DOI] [PMC free article] [PubMed] [Google Scholar]
Wang N, Lin X, Gutierrez RG, Carroll RJ. Bias analysis and SIMEX approach in generalized linear mixed measurement error models. Journal of the American Statistical Association. 1998;93:249–261. [Google Scholar]
Yi GY. Robust methods for incomplete longitudinal data with mismeasured covariates. Far East Journal of Theoretical Statistics. 2005;16:205–234. [Google Scholar]
Yi GY, Cook RJ. Marginal methods for incomplete longitudinal data arising in clusters. Journal of the American Statistical Association. 2002;97:1071–1080. [Google Scholar]
Yi GY, He W. Methods for bivariate survival data with mismeasured covariates under an accelerated failure time model. Communications in Statistics—Theory and Methods. 2006;35:1539–1554. [Google Scholar]
Yi GY, Lawless JF. A corrected likelihood method for the proportional hazards model with covariates subject to measurement error. Journal of Statistical Planning and Inference. 2007;137:1816–1828. [Google Scholar]
Yi GY, Thompson ME. Marginal and association regression models for longitudinal binary data with drop-outs: a likelihood-based approach. The Canadian Journal of Statistics. 2005;33:3–20. [Google Scholar]
Yoo YJ, Huo Y, Ning Y, Gordon D, Finch S, Mendell NR. Power of maximum HLOD tests to detect linkage to obesity genes. BMC Genetics. 2003;4(Suppl 1) doi: 10.1186/1471-2156-4-S1-S16. S16. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib1] Carroll RJ, Küchenhoff H, Lombard F, StefanskiA L. Asymptotics for the SIMEX estimator in nonlinear measurement error models. Journal of the American Statistical Association. 1996;91:242–250. [Google Scholar]

[bib2] Carroll RJ, Ruppert D, Stefanski LA, Crainiceanu CM. Measurement Error in Nonlinear Models. 2nd edition. Boca Raton (FL): Chapman & Hall; 2006. [Google Scholar]

[bib3] Cook J, Stefanski LA. A simulation extrapolation method for parametric measurement error models. Journal of the American Statistical Association. 1994;89:464–467. [Google Scholar]

[bib4] Cook RJ, Zeng L, Yi GY. Marginal analysis of incomplete longitudinal binary data: a cautionary note on LOCF imputation. Biometrics. 2004;60:820–828. doi: 10.1111/j.0006-341X.2004.00234.x. [DOI] [PubMed] [Google Scholar]

[bib5] Devanarayan V, Stefanski LA. Empirical simulation extrapolation for measurement error models with replicate measurements. Statistics and Probability Letters. 2002;59:219–225. [Google Scholar]

[bib6] Diggle P, Heagerty P, Liang K-Y, Zeger S. Analysis of Longitudinal Data. 2nd edition. New York: Oxford University Press; 2002. [Google Scholar]

[bib7] Diggle P, Kenward MG. Informative drop-out in longitudinal data analysis (with discussion) Applied Statistics. 1994;43:49–93. [Google Scholar]

[bib8] Fuller WA. Measurement Error Models. New York: Wiley; 1987. [Google Scholar]

[bib9] Kenward MG. Selection models for repeated measurements with nonrandom dropout: an illustration of sensitivity. Statistics in Medicine. 1998;7:2723–2732. doi: 10.1002/(sici)1097-0258(19981215)17:23<2723::aid-sim38>3.0.co;2-5. [DOI] [PubMed] [Google Scholar]

[bib10] Li Y, Lin X. Functional inference in frailty measurement error models for clustered survival data using the SIMEX approach. Journal of the American Statistical Association. 2003;98:191–203. [Google Scholar]

[bib11] Liu W, Wu L. Simultaneous inference for semiparametric nonlinear mixed-effects models with covariate measurement errors and missing responses. Biometrics. 2007;63:342–350. doi: 10.1111/j.1541-0420.2006.00687.x. [DOI] [PubMed] [Google Scholar]

[bib12] Miglioretti DL, Heagerty PJ. Marginal modeling of multilevel binary data with time-varying covariates. Biostatistics. 2004;5:381–398. doi: 10.1093/biostatistics/5.3.381. [DOI] [PubMed] [Google Scholar]

[bib13] Pepe MS, Anderson GL. A cautionary note on inference for marginal regression models with longitudinal data and general correlated response data. Communications in Statistics—Simulation and Computation. 1994;23:939–951. [Google Scholar]

[bib14] Prentice RL. Covariate measurement errors and parameter estimation in a failure time regression model. Biometrika. 1982;69:331–342. [Google Scholar]

[bib15] Robins JM, Rotnitzky A, Zhao LP. Analysis of semiparametric regression models for repeated outcomes in the presence of missing data. Journal of the American Statistical Association. 1995;90:106–121. [Google Scholar]

[bib16] Strug L, Sun L, Corey M. The genetics of cross-sectional and longitudinal body mass index. BMC Genetics. 2003;4(Suppl 1) doi: 10.1186/1471-2156-4-S1-S14. S14. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib17] Wang N, Lin X, Gutierrez RG, Carroll RJ. Bias analysis and SIMEX approach in generalized linear mixed measurement error models. Journal of the American Statistical Association. 1998;93:249–261. [Google Scholar]

[bib18] Yi GY. Robust methods for incomplete longitudinal data with mismeasured covariates. Far East Journal of Theoretical Statistics. 2005;16:205–234. [Google Scholar]

[bib19] Yi GY, Cook RJ. Marginal methods for incomplete longitudinal data arising in clusters. Journal of the American Statistical Association. 2002;97:1071–1080. [Google Scholar]

[bib20] Yi GY, He W. Methods for bivariate survival data with mismeasured covariates under an accelerated failure time model. Communications in Statistics—Theory and Methods. 2006;35:1539–1554. [Google Scholar]

[bib21] Yi GY, Lawless JF. A corrected likelihood method for the proportional hazards model with covariates subject to measurement error. Journal of Statistical Planning and Inference. 2007;137:1816–1828. [Google Scholar]

[bib22] Yi GY, Thompson ME. Marginal and association regression models for longitudinal binary data with drop-outs: a likelihood-based approach. The Canadian Journal of Statistics. 2005;33:3–20. [Google Scholar]

[bib23] Yoo YJ, Huo Y, Ning Y, Gordon D, Finch S, Mendell NR. Power of maximum HLOD tests to detect linkage to obesity genes. BMC Genetics. 2003;4(Suppl 1) doi: 10.1186/1471-2156-4-S1-S16. S16. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

A simulation-based marginal method for longitudinal data with dropout and mismeasured covariates

Grace Y Yi

Abstract

1. INTRODUCTION