Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2017 May 10.
Published in final edited form as: Stat Med. 2015 Nov 22;35(10):1676–1688. doi: 10.1002/sim.6812

Methods to Adjust for Misclassification in the Quantiles for the Generalized Linear Model with Measurement Error in Continuous Exposures

Ching-Yun Wang 1, Jean De Dieu Tapsoba 1, Catherine Duggan 1, Kristin L Campbell 2, Anne McTiernan 1
PMCID: PMC4826813  NIHMSID: NIHMS736504  PMID: 26593772

SUMMARY

In many biomedical studies, covariates of interest may be measured with errors. However, frequently in a regression analysis the quantiles of the exposure variable are often used as the covariates in the regression analysis. Because of measurement errors in the continuous exposure variable, there could be misclassification in the quantiles for the exposure variable. Misclassification in the quantiles could lead to bias estimation in the association between the exposure variable and the outcome variable. Adjustment for misclassification will be challenging when the gold standard variables are not available. In this paper, we develop two regression calibration estimators to reduce bias in effect estimation. The first estimator is normal likelihood-based. The second estimator is linearization-based, and it provides a simple and practical correction. Finite sample performance is examined via a simulation study. We apply the methods to a 4-arm randomized clinical trial that tested exercise and weight loss interventions in women aged 50–75 years.

Keywords: Measurement error, Misclassification, Regression calibration

1 Introduction

Covariate measurement error occurs in many research studies when the covariate variable is continuous. Covariate measurement error can result in misclassification if the exposure variable is analyzed as quantiles. One motivating example of our methodology research is covariate measurement error associated with measurement of aerobic fitness in the Nutrition and Exercise in Women (NEW) study, conducted from 2005 to 2009. The NEW study is a 12-month randomized, controlled trial using a 4-arm design to compare the effect of three lifestyle-change interventions (dietary weight loss, moderate-to-vigorous intensity aerobic exercise, or both interventions combined) versus control (no lifestyle change) [1][2]. Ancillary studies have evaluated the effects of physical activity on disease-related biomarkers, such as C-reactive protein (CRP); see Imayama, et al. [3]. VO2max is a measure of aerobic fitness or exercise capacity through determination of the maximum rate of oxygen consumption as measured during incremental exercise, typically on a motorized treadmill. Inflammation markers have been shown to be inversely associated with VO2max (Kullo et al. [4]). In regression analysis, the quantiles of VO2max are often used as a covariate. Because of biological variation and factors related to measurement of VO2max (e.g., individual’s effort, variations in testing protocols), VO2max may be associated with errors, and hence the quantiles for VO2max may be misclassified. Misclassification of exposure can cause bias; see Gustafson [5] for a review.

When there exists a subset that contains all covariates, measured without error, then methods for missing covariates generally can be applied to the misclassification in covariate problem. This subset is often called the complete-case set, or the validation set; see for example Qi et al. [6] and Natarajan [7]. Many important methods for covariate measurement errors have been well reviewed by Carroll et al. [8] and Buonacorsi [9]. However, overall there are considerably fewer methods for correcting for covariate misclassification when a validation set does not exist. A significant challenge in this situation is that the misclassification probabilities could not be identified without additional information. Other than the availability of a validation set, the misclassification probabilities could be identified if multiple surrogate variables are available for the true unobserved misclassified covariates; see for example Kosinski and Flanders [10], Huang and Bandeen-Roche [11], and Wang and Song [12]. In general, multiple surrogate variables are required to be conditionally independent given the latent covariates.

Our main goal in this paper is to develop methods to correct for measurement error in a continuous exposure variable when their quantiles are used as the covariates. Flegal et al. [13] observed that misclassification from the dichotomization of a continuous error-prone exposure variable is differential. By differential misclassification, the misclassification probability of the error-prone binary variable given the true binary variable is not independent of the outcome variable. Under the assumption that the outcome variable is linear regression of the true unobserved continuous exposure, Gustafson and Le [14] developed expressions for the effect of misclassification from the dichotomization of a continuous error-prone exposure variable. However, it is a restriction to assume that the outcome variable is either linear or logistic regression of the true unobserved continuous exposure. In some applications the outcome variable may not be linearly associated with the underlying unobserved continuous exposure variable. For example, among women with breast cancer, Duggan et al. [15] found that the risk of all-cause mortality is higher among women in the first and fifth quintiles of insulin-like growth factor.

In this paper we propose two regression calibration estimators for misclassification in the quantiles due to measurement error in their continuous exposure variable. In the development of our methodology, we do not need to assume a validation subset with gold standard measurements for the exposure variable, which is the case in Natarajan [7], Lyles [16], and Kuchenhoff et al. [17]. Furthermore, we do not need to assume that the outcome variable is linear (Gustafson and Le [14] and Natarajan [7]) or logistic regression (Dalen et al. [18]) of the true unobserved continuous exposure variable. In our main analysis, the outcome variable is a generalized linear model of the true unobserved quantiles (such as median-split binary covariates, quartiles, or quintiles), but the outcome variable may not have a linear association with the unobserved continuous exposure variable. Section 2 describes the regression models for our problem. A likelihood-based approach is described. However, the likelihood-based approach is in general not practical due to additional model assumptions and it often requires technical computations. This leads us to the development of a normality-based regression calibration (NRC) estimator (Section 3). The NRC estimator is consistent under a normality assumption when the primary model is linear. An estimator for a vector of parameters of interest (say denoted by β) is consistent if it converges to the true β when the sample size increases (to infinity). To understand whether the effect of misclassification results in attenuation bias under some situations, we further develop a linearization-based regression calibration (LRC) estimator in Section 4. By attenuation bias, the effect of the categorical covariate is under-estimated (or biased towards 0) when treating the misclassified covariate as the true covariate. The LRC estimator can be treated as a simple replacement estimator for NRC. Although it does not perform as well as NRC in general, the LRC estimator has a simple correction for attenuation interpretation. Finite sample performance of the proposed NRC and LRC estimators is investigated in Section 5. We apply the methods to real data in Section 6. Some concluding remarks are given in Section 7.

2 Statistical Models

Assume that the study cohort consists of n subjects. For i = 1, …, n, let Xi be a continuous exposure variable that can not be precisely measured, such as physical activity data. In the analysis, its quantiles are often used as the covariate of interest. Given that a variable split by quantiles can be presented by binary variables in regression analysis, it is generally sufficient for us to first consider a dichotomous variable as a covariate in regression. In the Discussion section, we will describe the details to adjust for misclassification in the quantiles (such as quartiles or quintiles). Let Xi be a dichotomous variable of Xi with a cutoff point x0 such that Xi=I[Xix0], where I[·] is an indicator function. For example, if x0 is the median of Xi then Xi is a median-split binary covariate. In the development of the method, we assume Xi is the covariate of interest. Let μx be the mean of Xi, and σx2 be the variance of Xi. We will need to estimate μx and σx2 in the development of the proposed methods to be described later.

We assume that a surrogate variable W for X is available and that follows the classical additive measurement error model

Wij=Xi+Uij,whereE(Uij|Xi)=0,

for i = 1, …, n, j = 1, …, ki, in which Uij is the unobserved error. The use of replicates for W is for the purpose of identifying the measurement error variance (σu2). The model given above includes the situation when a reliability sample for W is available. A reliability sample with ki = 2 is often conducted in a study to assess the within subject variation of W. For example, in the Women’s Health Initiative (WHI), investigators were interested in the association between sedentary activity and mortality [19], and the data from a subset of WHI participants were used as reliability analysis of physical activity data. Let Wij be binary such that Wij=1 if Wijx0 and 0 otherwise (for j = 1, …, ki). In the data analysis without an adjustment, W* is used as a binary covariate. Let Z be the vector of other observed covariates (other than W) such as age and gender. The outcome variable Yi, i = 1, …, n follows a generalized linear model such that

E(Yi|Xi,Zi)=g(β0+β1Xi+β2Zi), (1)

where g(·) is a specified function, and our goal is to estimate (β0,β1,β2)β. For example, g(u) = u in linear regression, g(u) = exp(u) in Poisson regression, while g(u) = (1 + eu)−1 in logistic regression.

The problem of interest is a combination of measurement error and misclassification. It is a measurement error problem since X is not available while its surrogate W is available. It is a misclassification problem since the true covariate X* is not available while its surrogate W* is available. Although the problem is a mixture of measurement error and misclassification, it has not been well addressed in the literature (Gustafson [5]; Carroll et al. [8]; Buonacorsi [9]). To address this issue, we first consider an approach using estimating equations. It is easily seen that if X* were known, then an estimating equation for β is i=1n(1,Xi,Zi){Yig(β0+β1Xi+β2Zi)}=0. This estimating equation is normally called the full data estimating equation; that is, the estimating equation for the case when all the data are observed without measurement error. Hence when X* is not available, a consistent estimating approach is to take conditional expectation of the estimating equation of the full data estimating equation given the observed data (Wang et al. [20], Wu et al. [21], Yi et al. [22]). For simplicity of notation, unless otherwise stated Wi denotes ( Wi1,,Wiki). The likelihood score-based expected estimating equation (EEE) for estimating β can be written as

i=1nE[(1XiZi){Yig(β0+β1Xi+β2Zi)}|Yi,Wi,Zi]=0. (2)

The conditional expectation given in (2) can be calculated by assuming the distribution of X and measurement error U. For example, let f(V) denote the probability density function of any random variable V, then f(Y|W,Z)=x=01f(Y|x,W,Z)f(x|W,Z) and that

f(X=1|W,Z)=xI[xx0]f(W|x)f(x,Z)dxxf(W|x)f(x,Z)dx.

3 Normality-based Regression Calibration Method

The EEE approach discussed in the previous section would require calculation via the use of likelihood functions. In this section, we will propose a new NRC estimator. The NRC estimator is a replacement estimator that replaces an unobserved X* by a quantity that can be obtained by simple calculation.

3.1 NRC for Linear Regression

In this subsection, we consider linear regression such that E(Yi|Xi,Zi)=β0+β1Xi+β2Zi. Under this model, it is seen that E(Yi|Xi,Zi)=β0+β1I[Xix0]+β2Zi. However, like Xi, Xi is not observed and hence the mean model will need to be further developed. By simple calculation, we have E(Yi|W¯i,Zi)=β0+β1pr(Xix0|W¯i,Zi)+β2Zi, where W¯i=j=1kiWij/ki. From this association, the regression coefficients can be consistently estimated if we replace the unobserved Xi by pr(Xix0|W¯i,Zi). The conditional probability in general can be well estimated under a multivariate normality assumption of (Xi,W¯i,Zi). When (Xi,W¯i,Zi) is multivariate-normal, it is easily seen that

pr(Xix0|W¯i,Zi)=pr(XiE(Xi|W¯i,Zi){var(Xi|W¯i,Zi)}1/2x0E(Xi|W¯i,Zi){var(Xi|W¯i,Zi)}1/2|W¯i,Zi)=Φ((σx2,xz)(σx2+(σu2/ki)xzxzz)1(W¯iμxZiμz)(x0μx){σx2(σx2,xz)(σx2+(σu2/ki)xzxzz)1(σx2xz)}1/2), (3)

where Φ(·) is the cumulative density function of a standard normal distribution, μz is the mean of Z, xz= cov(Xi,Zi) and z= cov(Zi). The NRC estimator is to replace X* by the conditional probability given above in (3). The (nuisance) parameters involved in (3) can be consistently estimated since replicates for W are available (see the Appendix). Let X^i be the conditional probability pr(Xx0|W¯,Z) in (3), which is often called the calibration function. When there is no Z or when X and Z are independent, the replacement variable (3) for the unobserved Xi can be written as

X^i=Φ(σx(W¯iμx)(σu/ki)σx2+(σu2/ki)+σx2+(σu2/ki)(μxx0)σx(σu/ki)). (4)

However, the expression still does not provide a simple correction for attenuation in our problem. That is, from the development of this section, we can not see directly a method to correct the bias from the naive estimator that uses covariates (W¯i,Zi). The argument given above has shown that when Y given (X*, Z) is linear, a consistent estimator for β can be obtained by using covariates (X^i,Z). In the Appendix, we provide details on how to obtain the standard error of the NRC estimator. The idea is based on stacking the estimating equations for β and the nuisance parameters discussed above, and then the asymptotic variance can be obtained by a sandwich estimator. The asymptotic variance has taken into consideration the uncertainty due to the replacement of Xi with X^i.

3.2 NRC for Logistic Regression

In this subsection, we consider logistic regression such that E(Yi|Xi,Zi)=H(β0+β1Xi+β2Zi), where H(v) = {1 + exp(−v)}−1. The mean function in logistic regression is not linear. We note that by a Taylor expansion,

E{H(β0+β1Xi+β2Zi)|W¯i,Zi}H{β0+β1E(Xi|W¯i,Zi)+β2Zi}+0.5β12H(2){β0+β1E(Xi|W¯i,Zi)+β2Zi}E[{XiE(Xi|W¯i,Zi)}2|W¯i,Zi],

where H(2)(v) = (∂2/∂2v)H(v). If the second term of the above equation is relatively small, then we have the following approximation: E(Yi|W¯i,Zi)H{β0+β1E(Xi|W¯i,Zi)+β2Zi}. Hence, following a similar approximation for β0+β1E(Xi|W¯i,Zi) and the approximation given in the previous subsection, we have that

E(Yi|W¯i,Zi)H{β0+β1X^i+β2Zi},

where X^i is from (3) if the calibration function is a function of both W¯i and Zi, or from (4) if the calibration function is independent of Zi.

4 Linearization-based Regression Calibration Method

The proposed NRC approach discussed in the previous section has nice performance in general, and is consistent under a multivariate normality assumption. In this section, we aim at developing a new approximation approach that can provide an interpretation of the correction for attenuation from the naive estimator.

4.1 LRC for Linear Regression

We first consider linear regression in the subsection such that E(Y|X,Z)=β0+β1X+β2Z. Under this model, we have that E(Y|X,Z)=β0+β1I[Xx0]+β2Z. For any fixed Z, we let β0+β2Z be denoted by γ0. The idea of an approximation will be to approximate Y = γ0 + β1I[Xx0] by a straight line as a function of Y and X. Figure 1 provides a visual demonstration of the linearization. The idea is that among the points with Xx0, most points are in the range (μx − 2σx, x0). Hence, the straight line to approximate the step function would pass through ((x0+μx−2σx)/2, γ0). Similarly, among the points with Xx0, most points are in the range (x0, μx + 2σx). Therefore, the straight line to approximate the step function would also pass through ((x0 + μx + 2σx)/2, γ0 + β1). This indicates that we have that E(Y|X,Z)θ0+(β1/2σx)X+β2Z, where θ0 = β0 + (β1/2) − ((x0 + μx)β1/(4σx)) is a new intercept. Note that because Xi is not observed, the above model can not be directly applied to data analysis. From this approximation, we can further express the conditional expectation of Y given the observed W¯, Z as:

E(Y|W¯,Z)θ0+(β1/2σx)E(X|W¯,Z)+β2Z. (5)

Figure 1.

Figure 1

Approximation of a step function with a jump at x0 in the range of (μx − 2σxx + 2σx).

The above approximation (5) has an idea similar to the RC approach for linear model (Carroll et al. [8], Chapter 3), but our problem is that the unobserved covariate in our model is X* (binary covariate split by median), not X. From the above approximation, we can correct for measurement error in X, or misclassification in X*. This is normally done by assuming a bivariate-normal model for the joint distribution of X, U. If E(X|W¯,Z)=E(X|W¯), then we can further express the above association by:

E(Yi|W¯i,Zi)θ0+β1σx2{σx2+(σu2/ki)}W¯i+β2Zi,

where θ0=β0+(β1/2)((x0+μx)/(4σx))β1+[μxσu2/{2kiσx(σx2+(σu2/ki))}]β1. Note that for a given Zi, the next step for the approximation is to estimate E(W¯i|W¯i). If Wi is approximately normally distributed, then by some calculation we have that E(W¯i|W¯i)c(x0)+1.6σx2+(σu2/ki)W¯i, where c(x0) depends on x0. For example, if x0 is the median, then c(x0)μx0.8σx2+(σu2/ki). If x0 is the first quartile, then c(x0)μx1.2σx2+(σu2/ki). If x0 is the third quartile, then c(x0)μx0.4σx2+(σu2/ki).

Hence, by further simple calculations, we can obtain the following approximation:

E(Yi|W¯i,Zi)θ0++0.8σxσx2+(σu2/ki)β1W¯i+β2Zi, (6)

where θ0+=β0+(β1/2){(x0+μx)/(4σx)}β1+[μxσu2/{2kiσx(σx2+(σu2/ki))}]β1+{σxβ1c(x0)}/[2{σx2+(σu2/ki)}]. From the above approximation, our LRC estimator is to replace Xi by the following calibration function

X^i=0.8σxσx2+(σu2/ki)W¯i. (7)

From the approximation given above and if E(X|W¯,Z)E(X|W¯), then naively using the observed quantiles for a variable that is error-prone would likely lead to an attenuation bias. Under the linear model, a correction for attenuation method for β1 can be obtained based on the approximation given above. By (6), if there are k replicates for W for all subjects (ki = k for all i), then the attenuation factor is 0. 8σx/σx2+(σu2/k). That is, let β^N,1 be the naive estimate of β1 by using the observed quantile W¯i, then the LRC estimator for β1 can be obtained by dividing the naive estimator β^N,1 by the attenuation factor 0.8σx/σx2+(σu2/k). However, the standard error of the LRC estimator can not be obtained by the use of the attenuation factor directly. It can be obtained by a sandwich variance estimator where the vector of the estimating equations is obtained by stacking the estimating equations for β and the nuisance parameters discussed above (see the Appendix).

4.2 LRC for Logistic Regression

We consider logistic regression such that E(Yi|Xi,Zi)=H(β0+β1Xi+β2Zi). As discussed in the previous section, by a Taylor expansion, if the disease is rare, then E(Yi|Xi,Zi) can be approximated by H{β0+β1E(Xi|W¯i,Zi)+β2Zi}. Hence, following an approximation that is similar to that for β0+β1I[Xix0]|W¯i,Zi) given in (6) of the previous subsection, we have that

E(Yi|Wi,Zi)H{θ0++0.8σxσx2+(σu2/ki)β1Wi+β2Zi}.

Hence, if we naively use the observed quantiles for a variable that is error-prone, then the naive estimator would likely have an attenuation bias. If there are k replicates for W for all subjects (ki = k for all i), then the attenuation factor is 0.8σx/σx2+(σu2/k). Let β^N,1 be the naive estimate of β1 by using the observed dichotomized W¯i. To adjust for bias, the LRC estimator for β1 can be obtained by dividing the naive estimator β^N,1 by the attenuation factor 0.8σx/σx2+(σu2/k).

5 Simulation Study

We conducted a simulation study to compare the naive estimator and the proposed NRC and LRC estimators. The naive estimator applied the usual regression analysis by replacing the unobserved Xi by W¯i. The NRC estimator replaced Xi by (4), and the LRC estimator replaced Xi by 0.8σx/σx2+(σu2/ki)W¯i as in (7), and then we corrected the intercept by adjusting for the difference between θ0+ and β0. In Table 1, we considered the situation when Yi given Xi was linear. We first generated covariates Xi, i = 1, …, n from a standard normal distribution. The true binary Xi was 1 if Xiμx, or 0 otherwise. Then Yi, i = 1, …, n, were generated by Yi=β0+β1Xi+ei where ei was from a standard normal distribution. The surrogate variables Wij, j = 1, …, ki, were generated by Wij = Xi + Uij, where Uij was normal with mean 0 and variance σu = 0.5 and 1.0 respectively. There were 50% of the individuals with ki = 2, and 50% with ki = 1. This was like a situation when a reliability sample was available with replicates for W. The binary surrogate Wij was 1 if Wijμx, or 0 otherwise. The sample size in the simulation was n = 500, and n = 1000, respectively. In the tables, the “biases” were calculated by taking the average of β^β from 500 replicates, “SD” denotes the sample standard deviation of the estimators, “ASE” denotes the average of the estimated standard errors of the estimators. The 95% confidence interval coverage probabilities are also included. All the parameters are given in the tables.

Table 1.

Simulation study when Y given X* was linear and X was normal

n = 500
n = 1000
Naive NRC LRC Naive NRC LRC
σu = 0.5
β0 Bias 0.127 0.000 −0.007 0.126 −0.003 −0.009
SD 0.068 0.077 0.082 0.051 0.055 0.058
ASE 0.067 0.074 0.080 0.047 0.054 0.057
CP 0.508 0.958 0.944 0.232 0.950 0.952
β1 Bias −0.259 −0.005 0.009 −0.255 0.004 0.016
SD 0.096 0.118 0.131 0.069 0.078 0.089
ASE 0.094 0.120 0.128 0.067 0.080 0.091
CP 0.222 0.954 0.944 0.060 0.946 0.960

Naive NRC LRC Naive NRC LRC
σu = 1.0
β0 Bias 0.222 −0.002 0.041 0.221 −0.004 0.040
SD 0.069 0.091 0.092 0.047 0.065 0.068
ASE 0.069 0.088 0.096 0.048 0.064 0.067
CP 0.088 0.960 0.920 0.002 0.946 0.900
β1 Bias −0.430 0.005 −0.083 −0.443 0.009 −0.080
SD 0.097 0.151 0.159 0.070 0.112 0.118
ASE 0.097 0.156 0.162 0.068 0.104 0.114
CP 0.006 0.944 0.922 0.000 0.946 0.868

NOTE: The naive estimator replaced X* by W*, NRC was the normality-based RC estimator, and LRC was the linearization-based RC estimator. The regression parameters were β = (0.5,1). The nuisance parameters were μx = 0, σx = 1. The results were obtained from 500 replicates.

From Table 1, it was seen that the naive estimator had large biases. The NRC estimator performed very well since it was a consistent estimator when X was normal and the measurement error was normal. The LRC estimator was not as good as the NRC estimator in terms of bias and efficiency. The LRC’s performance was in general acceptable when σu = 0.5, but the bias was larger when σu = 1.414. If we consider the relative bias versus the true slope, then for σu = 1.414 the relative bias was about 50% from the naive estimator while about 10% from the LRC estimator. Hence, although the LRC estimator was not consistent (such as NRC) it did serve well in terms of bias correction.

The data used for Table 2 were generated similarly to these for Table 1 except that the distribution of X was skewed. The underlying exposure X variables were generated from a mixture of two normals with means (1/51/2, −2/51/2), variances (4/5, 1/5), and the mixture percentages were (2/3, 1/3). The purpose of the simulation setup was to understand if the proposed NRC and LRC estimators were sensitive to the normality assumption of X. We note that when X was not normal, the NRC estimator was no longer consistent. From Table 2, it was seen that the performance of NRC and LRC was still satisfactory.

Table 2.

Simulation study when Y given X* was linear and X was mixture-normal

n = 500
n = 1000
Naive NRC LRC Naive NRC LRC
σu = 0.5
β0 Bias 0.096 −0.019 −0.047 0.098 −0.017 −0.044
SD 0.066 0.075 0.078 0.047 0.052 0.056
ASE 0.065 0.072 0.077 0.046 0.051 0.054
CP 0.688 0.930 0.896 0.402 0.918 0.872
β1 Bias −0.213 0.018 0.071 −0.217 0.013 0.066
SD 0.094 0.114 0.128 0.066 0.078 0.090
ASE 0.094 0.111 0.127 0.066 0.079 0.090
CP 0.400 0.936 0.910 0.102 0.956 0.884

Naive NRC LRC Naive NRC LRC
σu = 1.0
β0 Bias 0.180 −0.050 0.014 0.181 −0.049 0.012
SD 0.071 0.096 0.101 0.046 0.063 0.064
ASE 0.067 0.089 0.094 0.047 0.063 0.066
CP 0.242 0.894 0.934 0.038 0.884 0.958
β1 Bias −0.409 0.046 −0.023 −0.404 0.053 −0.015
SD 0.098 0.155 0.166 0.067 0.107 0.112
ASE 0.096 0.151 0.161 0.068 0.107 0.114
CP 0.006 0.930 0.926 0.000 0.928 0.962

NOTE: The naive estimator replaced X* by W*, NRC was the normality-based RC estimator, and LRC was the linearity-based RC estimator. The regression parameters were β = (0.5,1.0). The nuisance parameters were μx = 0, σx = 1. The results were obtained from 500 replicates.

In Table 3, we studied simulations when the regression model was logistic such that pr(Yi|Xi)={1+exp(β0β1Xi)}1, where β0 = 0 and β1 = ln(3). The variables Xi,Wi,Xi,W¯i were all generated similarly to those in Table 1. The results under logistic regression were very similar to those in Table 1 (linear regression). The NRC estimator did the best in terms of bias correction. The LRC estimator was not as good as NRC but it was considered as a satisfactory estimator with a good interpretation in correction for attenuation.

Table 3.

Simulation study when Y given X was logistic and X was normal

n = 500
n = 1200
Naive NRC LRC Naive NRC LRC
σu = 0.5
β0 Bias 0.143 0.005 0.000 0.032 −0.007 −0.013
SD 0.118 0.135 0.141 0.094 0.108 0.113
ASE 0.127 0.146 0.152 0.090 0.103 0.108
CP 0.820 0.968 0.968 0.682 0.936 0.948
β1 Bias −0.313 −0.029 −0.028 −0.296 −0.010 −0.006
SD 0.181 0.233 0.245 0.142 0.181 0.193
ASE 0.190 0.240 0.259 0.134 0.169 0.183
CP 0.616 0.962 0.958 0.430 0.938 0.946

Naive NRC LRC Naive NRC LRC
σu = 1.0
β0 Bias 0.219 −0.032 0.017 0.225 −0.014 0.032
SD 0.134 0.181 0.188 0.086 0.121 0.120
ASE 0.128 0.174 0.177 0.090 0.122 0.124
CP 0.598 0.944 0.940 0.292 0.954 0.952
β1 Bias −0.476 0.032 −0.069 −0.498 0.014 −0.111
SD 0.198 0.315 0.326 0.132 0.215 0.218
ASE 0.188 0.307 0.313 0.133 0.215 0.219
CP 0.296 0.946 0.936 0.032 0.954 0.926

NOTE: The naive estimator replaced X* by W*, NRC was the normality-based RC estimator, and LRC was the linearity-based RC estimator. The regression parameters were β = (0,ln (3)). The nuisance parameters were μx = 0, σx = 1. The results were obtained from 500 replicates.

The data used for Table 4 were generated similarly to those for Table 3 except that the distribution of X was skewed. The underlying exposure X variables were generated from a mixture of two normals with means (1/51/2, −2/51/2), variances (4/5, 1/5), and the mixture percentages were (2/3, 1/3). The simulation set was aimed to understand if the proposed NRC and LRC estimators were sensitive to the normality assumption of X. From the results in Table 4, it was seen that the performance of NRC and LRC was still satisfactory.

Table 4.

Simulation study when Y given X was logistic and X was mixture-normal

n = 500
n = 1200
Naive NRC LRC Naive NRC LRC
σu = 0.5
β0 Bias 0.091 −0.030 −0.063 0.102 −0.024 −0.049
SD 0.131 0.143 0.155 0.088 0.098 0.104
ASE 0.124 0.140 0.149 0.087 0.099 0.105
CP 0.868 0.948 0.928 0.780 0.952 0.914
β1 Bias −0.313 −0.029 −0.028 −0.262 −0.006 −0.040
SD 0.181 0.233 0.245 0.129 0.156 0.175
ASE 0.190 0.240 0.259 0.135 0.166 0.183
CP 0.616 0.962 0.958 0.484 0.968 0.952

Naive NRC LRC Naive NRC LRC
σu = 1.0
β0 Bias 0.184 −0.058 −0.025 0.181 −0.063 −0.025
SD 0.128 0.169 0.178 0.087 0.119 0.119
ASE 0.126 0.171 0.175 0.089 0.120 0.123
CP 0.692 0.954 0.954 0.442 0.916 0.946
β1 Bias −0.455 0.032 −0.037 −0.464 0.026 −0.054
SD 0.190 0.300 0.318 0.130 0.214 0.216
ASE 0.188 0.303 0.313 0.133 0.212 0.219
CP 0.332 0.966 0.960 0.056 0.952 0.946

NOTE: The naive estimator replaced X* by W*, NRC was the normality-based RC estimator, and LRC was the linearity-based RC estimator. The regression parameters were β = (0,ln (3)). The nuisance parameters were μx = 0, σx = 1. The results were obtained from 500 replicates.

6 Data Analysis of the NEW Trial

The NEW study was briefly described in the introduction. In the data analysis in this section, we investigated the association between CRP and VO2max. We primarily considered the baseline data except that we used the VO2max values at the 12th month from the control group as the 2nd measurement of VO2max replicates. That is, the VO2max values at baseline and 12-month from the control group (n = 87) were used as a reliability subsample to estimate the measurement error variance. We did not use the 12-month VO2max values from the three intervention groups since the values were slightly higher than their baseline values. In our analysis the unit of VO2max was ml/(kg · min), and hence the intervention effect on weight change has also contributed to the increase of VO2max among the intervention individuals. In this data application, we were interested in the effect of median-split binary VO2max and body mass index (BMI) on CRP. BMI is a measure of relative size based on the mass and height of an individual, calculated by dividing their weight in kilograms by the square of their height in meters.

Among the 439 individuals of the study, we excluded one individual with missing baseline CRP and 32 individuals who had CRP values higher than 10 mg/L which could be due to acute illness [1]. In addition, we also excluded three individuals with baseline VO2max less than 10 ml/(kg · min) which was a threshold typically required to complete basic activities of daily living [23]. As a result, 403 individuals of the NEW study were included. We first examined an association between VO2max and CRP at baseline. The upper portion of Figure 2 showed the scatter plot and a fitted kernel smoother of VO2max and CRP at baseline. To understand the measurement error property of VO2max, the lower portion of Figure 2 was for VO2max at baseline and 12-month from the control group. The 12-month VO2max measurements in the control group in general were not systematically different from the baseline measurements. The differences were likely due to variations in physical conditions (or efforts) at different time points, and can be treated as measurement errors.

Figure 2.

Figure 2

The top figure was for baseline VO2max and CRP (n = 403); the bottom figure was for VO2max at baseline and 12-month (from the control group, n = 87). The curve in each scatter plot was from a kernel smoother. The 12-month VO2max measurements in the control group in general were not systematically different from the baseline measurements. The differences were likely due to variations in physical conditions (or efforts) at different time points, and can be treated as measurement errors.

The application in this section was to apply our methods to the regression association for the effects of median-split VO2max and BMI on CRP. Our work was focused on an application of our statistical methodology, and hence the clinical implication of the application here should be further investigated in a clinical trial research. We estimated the regression coefficients using the naive, NRC, and LRC estimators. The results were given in Table 5. All the three estimators showed that VO2max was a significant risk factor for inflammatory marker CRP; but with different magnitudes. From the naive estimator, individuals with higher than mean VO2max had on average about 0.45 mg/L lower CRP than the individuals with lower than mean VO2max. From the NRC and LRC estimator, the difference in CRP between the 2 median-split groups was increased to either 0.87 mg/L or 0.68 mg/L. The effects of VO2max from NRC and LRC estimators were larger than the corresponding naive estimator, which can be explained by the correction for attenuation discussed in Section 4. The standard errors from NRC and LRC estimates were both larger than the naive estimates, which is similar to a general phenomenon of bias-efficiency trade-off that was seen in measurement error literature. In addition, all the three estimates showed a similar significant effect of BMI on CRP that on average an increase of 1 kg/m2 in BMI is associated with an increase of about 0.16 mg/L in CRP.

Table 5.

Modelling the relationship between VO2max (ml/(kg • min) and CRP (mg/L) using naive, NRC and LRC in the NEW trial (n = 403). The outcome variable was CRP. The covariate VO2max was the true underlying binary median-split variable that may be misclassified due to measurement error in the continuous VO2max.

Naive NRC LRC

β^0 (intercept) −2.072 −1.447 −1.940
SE 0.955 1.094 1.037
β^1 (VO2max) −0.446 −0.874 −0.684
SE 0.220 0.366 0.336
β^2 (BMI) 0.166 0.153 0.166
SE 0.029 0.032 0.031
μ^x
24.960 24.960
SE 0.254 0.254
σ^x
3.132 3.132
SE 0.131 0.131
σ^u
3.512 3.512
SE 0.003 0.003

Note: The naive estimator replaced X* by W*, NRC was the normality-based RC estimator, and LRC was the linearity-based RC estimator.

7 Discussion

Most of the details of the methods in the paper are based on dichotomization of an error-prone exposure variable. The methods can be applied to the situations when the covariates are q-quantiles of an error-prone exposure variable. For example, q = 4 for quartiles and q = 5 for quintiles. For simplicity, we consider linear regression in which the q-quantiles are covariates. We can write the regression model as Yi=β0+j=1q1βjI[XiQj]+ei, in which Qj is the jth q-quantile. If q = 4, then in the model, Q1 is the first quartile, Q2 is the median, and Q3 is the third quartile. In this model, if we denote I[XiQj] by Xi,j, then the methods in Section 3 (NRC) and Section 4 (LRC) can be applied to estimate Xi,j. Extension of the modeling of quantiles as covariates from linear regression to logistic regression is straight-forward by using the approaches in Sections 3.2 and 4.2. For survival analysis with censored outcomes, the methods could be extended, but more detailed research is needed. Hence, further research is warranted for the methodology for survival analysis.

We have proposed two bias correction methods in regression analysis when quantiles for the error-prone exposure variable are used as the covariates. Although the proposed methods may not be consistent, the estimators are valid for the analysis of real data. For example, in biomedical research, analysis results are often presented by comparing the risk at different ranges (such as quartiles) of the exposures. The proposed estimators have limited bias especially for data when the exposure effect is not too large, which is often the case for environmental factors. Hence, the methods will likely have good methodological contributions for biomedical data.

In many studies, a calibration sample (or reliability sample) is often used to estimate the within subject and between subject variations (Guo et al. [24], White and Xie [25]). If there are two replicates in the calibration sample then ki = 2 for individuals in the calibration sample, while ki = 1 for the rest of the cohort. The statistical model for W in Section 2 can be treated as an internal reliability sample; say with ki = 2. The simulation study and data analysis can be considered as the setup with an internal reliability sample. If in an application, a reliability sample is not a subset of the main cohort, then it is an external reliability sample. The main difference in analysis between internal and external reliability sample is in the standard error estimation of the estimators. For an internal reliability sample, there is a need to adjust for uncertainty due to the estimation of the measurement error parameters; just like what has been done in the paper. For an external reliability sample, there is no need to adjust for uncertainty due to estimation of the measurement error parameters.

Supplementary Material

Supp infoS1

Acknowledgments

This research was partially supported by National Institutes of Health grants AI112341 (Wang) CA53996 (Wang), ES017030 (Wang and Tapsoba), HL121347 (Wang, Tapsoba, Duggan and McTiernan), CA 105204 (Wang, Duggan, Campbell and McTiernan), CA 116847 (Wang, Duggan, Campbell and McTiernan), the Breast Cancer Research Foundation (Wang, Duggan and McTiernan) and a travel award from the Mathematics Research Promotion Center of National Science Council of Taiwan (Wang).

Appendix: Estimating Equations and Standard Error Estimation for NRC and LRC

The NRC or LRC estimator can be obtained by solving a set of estimating equations, and the asymptotic variance can be established by a sandwich estimator [26]. Recall that the mean function of Y given (Xi,Zi) is g(β0+β1Xi+β2Zi). Let the replacement value (calibration function) of Xi for NRC or LRC be denoted by Xi^. When X and Z are independent (or when Z is not in the main regression model), μx, σx and σu are the nuisance parameters involved in the calculation of Xi^ via (4) or (7). The estimating equations for β and (μx, σx, σu) can be expressed as below.

{i=1n(1Xi^Zi){Yig(β0+β1Xi^+β2Zi)}=0;i=1n(Wi¯μx)=0;i=1n{j=1ki(WijWi.¯)2(ki1)σu2}=0;i=1n{(Wi¯μx)2σx2σu2/ki}=0.

If in case X and Z are dependent, then the calibration function E(X|W¯,Z) (as in (3) of Section 3) is slightly more complicated since there are a few more nuisance parameters to be estimated [26]. See also Wang, et al. [20, Section 3.2] for a similar sandwich covariance estimator.

Footnotes

Supporting information

Additional supporting information may be found in the online version of this article at the publisher’s web site.

References

  • 1.Foster-Schubert KE, Alfano CM, Duggan CR, Xiao L, Campbell KL, Kong A, Bain C, Wang CY, Blackburn G, McTiernan A. Effect of exercise and diet, along or combined, on weight and body composition in overweight-to-obese post-menopausal women. Obesity. 2012;20:1628–1638. doi: 10.1038/oby.2011.76. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Campbell KL, Foster-Schubert KE, Alfano CM, Wang C, Wang CY, Duggan CR, Mason C, Imayama I, Kong A, Xiao L, Bain CE, Blackburn GL, Stanczyk FZ, McTiernan A. Reduced-calorie dietary weight loss, exercise, and sex hormones in postmenopausal women: randomized controlled trial. Journal of Clinical Oncology. 2012;30:2314–2326. doi: 10.1200/JCO.2011.37.9792. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Imayama I, Ulrich CM, Alfano CM, Wang C, Xiao L, Wener MH, Campbell KL, Duggan CR, Foster-Schubert KE, Kong A, Mason CE, Wang CY, Blackburn GL, Bain CE, Thompson HJ, McTiernan A. Effects of a caloric restriction weight loss diet and exercise on inflammatory biomarkers in overweight/obese postmenopausal women: a randomized controlled trial. Cancer Research. 2012;72:2314–2326. doi: 10.1158/0008-5472.CAN-11-3092. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Kullo IJ, Khaleghi M, Hensrud DD. Markers of inflammation are inversely associated with VO2 max in asymptomatic men. Journal of Applied Physiology. 2007;102:1374–1379. doi: 10.1152/japplphysiol.01028.2006. [DOI] [PubMed] [Google Scholar]
  • 5.Gustafson P. Measurement Error and Misclassification in Statistics and Epidemiology: Impacts and Bayesian Adjustments. Chapman and Hall; New York: 2004. [Google Scholar]
  • 6.Qi L, Wang CY, Prentice RL. Weighted estimators for proportional hazards regression with missing covariates. J Amer Statist Assoc. 2005;100:1250–1263. [Google Scholar]
  • 7.Natarajan L. Regression calibration for dichotomized mismeasured predictors. International Journal of Biostatistics. 2009;5(1):1143. doi: 10.2202/1557-4679.1143. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Carroll RJ, Ruppert D, Stefanski LA, Crainiceanu CM. Measurement Error in Nonlinear Models, A modern Perspective. second. Chapman and Hall; London: 2006. [Google Scholar]
  • 9.Buonaccorsi J. Measurement Error: Models, Methods, and Applications. Hapman and Hall/CRC; Boca Raton: 2010. [Google Scholar]
  • 10.Kosinski A, Flanders WD. Evaluating the exposure and disease relationship with adjustment for different types of exposure misclassification: a regression approach. Stat Med. 1999;18:2795–2808. doi: 10.1002/(sici)1097-0258(19991030)18:20<2795::aid-sim192>3.0.co;2-s. [DOI] [PubMed] [Google Scholar]
  • 11.Huang GH, Bandeen-Roche K. Building an identifiable latent variable model with covariate effects on underlying and measured variables. Psychometrika. 2004;69:5–32. [Google Scholar]
  • 12.Wang CY, Song X. Expected Estimating Equations via EM for Proportional Hazards Regression with Covariate Misclassification. Biostatistics. 2013;14:351–365. doi: 10.1093/biostatistics/kxs046. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Flegal KM, Keyl PM, Nieto FJ. Differential misclassification arising from nondifferential errors in exposure measurement. Amer J Epi. 1991;134:1233–1244. doi: 10.1093/oxfordjournals.aje.a116026. [DOI] [PubMed] [Google Scholar]
  • 14.Gustafson P, Le Nhu D. Comparing the effects of continuous and discrete covariate mismeasurement, with emphasis on the dichotomization of mismeasured predictors. Biometrics. 2002;4:878–87. doi: 10.1111/j.0006-341x.2002.00878.x. [DOI] [PubMed] [Google Scholar]
  • 15.Duggan C, Wang CY, Neuhouser M, Xiao L, Smith AW, Reding K, Baumgartner RN, Baumgartner KB, Bernstein L, Ballard-Barbash R, McTiernan A. Associations of insulin-like growth factor and insulin-like growth factor binding protein-3 with mortality in women with breast cancer. International Journal of Cancer. 2013;132:1191–1200. doi: 10.1002/ijc.27753. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Lyles R. A note on estimating crude odds ratios in case-control studies with differentially misclassified exposure. Biometrics. 2002;58:1034–1036. doi: 10.1111/j.0006-341x.2002.1034_1.x. [DOI] [PubMed] [Google Scholar]
  • 17.Kchenhoff H, Mwalili SM, Lesaffre E. A general method for dealing with misclassification in regression: the misclassification SIMEX. Biometrics. 2006;62:85–96. doi: 10.1111/j.1541-0420.2005.00396.x. [DOI] [PubMed] [Google Scholar]
  • 18.Dalen I, Buonaccorsi JP, Sexton JA, Laake P, Thoresen M. Correction for misclassification of a categorized exposure in binary regression using replication data. Stat Med. 2009;28:3386–3410. doi: 10.1002/sim.3712. [DOI] [PubMed] [Google Scholar]
  • 19.Seguin RA, Buchner D, Lui J, Messina C, Manson J, Moreland L, Allison M, Wang CY, Patel M, LaCroix AZ. Sedentary Behavior and Mortality in Older Women: The Womens Health Initiative Observational and Extension Studies. American Journal of Preventive Medicine. 2014;46:122–135. doi: 10.1016/j.amepre.2013.10.021. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Wang CY, Huang Y, Chao EC, Jeffcoat MK. Expected estimating equations for missing data, measurement error, and misclassification, with application to longitudinal nonignorably missing data. Biometrics. 2008;64:85–95. doi: 10.1111/j.1541-0420.2007.00839.x. [DOI] [PubMed] [Google Scholar]
  • 21.Wu L, Liu W, Hu XJ. (2010) Joint Inference on HIV Viral Dynamics and Immune Suppression in Presence of Measurement Errors. Biometrics. 2010;66:327–335. doi: 10.1111/j.1541-0420.2009.01308.x. [DOI] [PubMed] [Google Scholar]
  • 22.Yi GY, Liu W, Wu L. Simultaneous inference and bias analysis for longitudinal data with covariate measurement error and missing responses. Biometrics. 2011;67:67–75. doi: 10.1111/j.1541-0420.2010.01437.x. [DOI] [PubMed] [Google Scholar]
  • 23.Shephard RJ. Maximal oxygen intake and independence in old age. Br J Sports Med. 2009;43:342–346. doi: 10.1136/bjsm.2007.044800. [DOI] [PubMed] [Google Scholar]
  • 24.Guo Y, Little RJ, McConnell DS. On using summary statistics from an external calibration sample to correct for covariate measurement error. Epidemiology. 2012;23:165–174. doi: 10.1097/EDE.0b013e31823a4386. [DOI] [PubMed] [Google Scholar]
  • 25.White MT, Xie SX. Adjustment for measurement error in evaluating diagnostic biomarkers by using an internal reliability sample. Stat Med. 2013;32:4709–4725. doi: 10.1002/sim.5878. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Wang CY. Robust sandwich covariance estimation for regression calibration estimator in Cox regression with measurement error. Statistics and Probability Letters. 1999;45:371–378. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supp infoS1

RESOURCES