Methods to Adjust for Misclassification in the Quantiles for the Generalized Linear Model with Measurement Error in Continuous Exposures

Ching-Yun Wang; Jean De Dieu Tapsoba; Catherine Duggan; Kristin L Campbell; Anne McTiernan

doi:10.1002/sim.6812

. Author manuscript; available in PMC: 2017 May 10.

Published in final edited form as: Stat Med. 2015 Nov 22;35(10):1676–1688. doi: 10.1002/sim.6812

Methods to Adjust for Misclassification in the Quantiles for the Generalized Linear Model with Measurement Error in Continuous Exposures

Ching-Yun Wang ¹, Jean De Dieu Tapsoba ¹, Catherine Duggan ¹, Kristin L Campbell ², Anne McTiernan ¹

PMCID: PMC4826813 NIHMSID: NIHMS736504 PMID: 26593772

SUMMARY

In many biomedical studies, covariates of interest may be measured with errors. However, frequently in a regression analysis the quantiles of the exposure variable are often used as the covariates in the regression analysis. Because of measurement errors in the continuous exposure variable, there could be misclassification in the quantiles for the exposure variable. Misclassification in the quantiles could lead to bias estimation in the association between the exposure variable and the outcome variable. Adjustment for misclassification will be challenging when the gold standard variables are not available. In this paper, we develop two regression calibration estimators to reduce bias in effect estimation. The first estimator is normal likelihood-based. The second estimator is linearization-based, and it provides a simple and practical correction. Finite sample performance is examined via a simulation study. We apply the methods to a 4-arm randomized clinical trial that tested exercise and weight loss interventions in women aged 50–75 years.

Keywords: Measurement error, Misclassification, Regression calibration

1 Introduction

Covariate measurement error occurs in many research studies when the covariate variable is continuous. Covariate measurement error can result in misclassification if the exposure variable is analyzed as quantiles. One motivating example of our methodology research is covariate measurement error associated with measurement of aerobic fitness in the Nutrition and Exercise in Women (NEW) study, conducted from 2005 to 2009. The NEW study is a 12-month randomized, controlled trial using a 4-arm design to compare the effect of three lifestyle-change interventions (dietary weight loss, moderate-to-vigorous intensity aerobic exercise, or both interventions combined) versus control (no lifestyle change) [1][2]. Ancillary studies have evaluated the effects of physical activity on disease-related biomarkers, such as C-reactive protein (CRP); see Imayama, et al. [3]. VO₂max is a measure of aerobic fitness or exercise capacity through determination of the maximum rate of oxygen consumption as measured during incremental exercise, typically on a motorized treadmill. Inflammation markers have been shown to be inversely associated with VO₂max (Kullo et al. [4]). In regression analysis, the quantiles of VO₂max are often used as a covariate. Because of biological variation and factors related to measurement of VO₂max (e.g., individual’s effort, variations in testing protocols), VO₂max may be associated with errors, and hence the quantiles for VO₂max may be misclassified. Misclassification of exposure can cause bias; see Gustafson [5] for a review.

When there exists a subset that contains all covariates, measured without error, then methods for missing covariates generally can be applied to the misclassification in covariate problem. This subset is often called the complete-case set, or the validation set; see for example Qi et al. [6] and Natarajan [7]. Many important methods for covariate measurement errors have been well reviewed by Carroll et al. [8] and Buonacorsi [9]. However, overall there are considerably fewer methods for correcting for covariate misclassification when a validation set does not exist. A significant challenge in this situation is that the misclassification probabilities could not be identified without additional information. Other than the availability of a validation set, the misclassification probabilities could be identified if multiple surrogate variables are available for the true unobserved misclassified covariates; see for example Kosinski and Flanders [10], Huang and Bandeen-Roche [11], and Wang and Song [12]. In general, multiple surrogate variables are required to be conditionally independent given the latent covariates.

Our main goal in this paper is to develop methods to correct for measurement error in a continuous exposure variable when their quantiles are used as the covariates. Flegal et al. [13] observed that misclassification from the dichotomization of a continuous error-prone exposure variable is differential. By differential misclassification, the misclassification probability of the error-prone binary variable given the true binary variable is not independent of the outcome variable. Under the assumption that the outcome variable is linear regression of the true unobserved continuous exposure, Gustafson and Le [14] developed expressions for the effect of misclassification from the dichotomization of a continuous error-prone exposure variable. However, it is a restriction to assume that the outcome variable is either linear or logistic regression of the true unobserved continuous exposure. In some applications the outcome variable may not be linearly associated with the underlying unobserved continuous exposure variable. For example, among women with breast cancer, Duggan et al. [15] found that the risk of all-cause mortality is higher among women in the first and fifth quintiles of insulin-like growth factor.

In this paper we propose two regression calibration estimators for misclassification in the quantiles due to measurement error in their continuous exposure variable. In the development of our methodology, we do not need to assume a validation subset with gold standard measurements for the exposure variable, which is the case in Natarajan [7], Lyles [16], and Kuchenhoff et al. [17]. Furthermore, we do not need to assume that the outcome variable is linear (Gustafson and Le [14] and Natarajan [7]) or logistic regression (Dalen et al. [18]) of the true unobserved continuous exposure variable. In our main analysis, the outcome variable is a generalized linear model of the true unobserved quantiles (such as median-split binary covariates, quartiles, or quintiles), but the outcome variable may not have a linear association with the unobserved continuous exposure variable. Section 2 describes the regression models for our problem. A likelihood-based approach is described. However, the likelihood-based approach is in general not practical due to additional model assumptions and it often requires technical computations. This leads us to the development of a normality-based regression calibration (NRC) estimator (Section 3). The NRC estimator is consistent under a normality assumption when the primary model is linear. An estimator for a vector of parameters of interest (say denoted by β) is consistent if it converges to the true β when the sample size increases (to infinity). To understand whether the effect of misclassification results in attenuation bias under some situations, we further develop a linearization-based regression calibration (LRC) estimator in Section 4. By attenuation bias, the effect of the categorical covariate is under-estimated (or biased towards 0) when treating the misclassified covariate as the true covariate. The LRC estimator can be treated as a simple replacement estimator for NRC. Although it does not perform as well as NRC in general, the LRC estimator has a simple correction for attenuation interpretation. Finite sample performance of the proposed NRC and LRC estimators is investigated in Section 5. We apply the methods to real data in Section 6. Some concluding remarks are given in Section 7.

2 Statistical Models

Assume that the study cohort consists of n subjects. For i = 1, …, n, let X_i be a continuous exposure variable that can not be precisely measured, such as physical activity data. In the analysis, its quantiles are often used as the covariate of interest. Given that a variable split by quantiles can be presented by binary variables in regression analysis, it is generally sufficient for us to first consider a dichotomous variable as a covariate in regression. In the Discussion section, we will describe the details to adjust for misclassification in the quantiles (such as quartiles or quintiles). Let $X_{i}^{*}$ be a dichotomous variable of X_i with a cutoff point x₀ such that $X_{i}^{*} = I [X_{i} \geq x_{0}]$ , where I[·] is an indicator function. For example, if x₀ is the median of X_i then $X_{i}^{*}$ is a median-split binary covariate. In the development of the method, we assume $X_{i}^{*}$ is the covariate of interest. Let μ_x be the mean of X_i, and $σ_{x}^{2}$ be the variance of X_i. We will need to estimate μ_x and $σ_{x}^{2}$ in the development of the proposed methods to be described later.

We assume that a surrogate variable W for X is available and that follows the classical additive measurement error model

W_{i j} = X_{i} + U_{i j}, where E (U_{i j} | X_{i}) = 0,

for i = 1, …, n, j = 1, …, k_i, in which U_ij is the unobserved error. The use of replicates for W is for the purpose of identifying the measurement error variance $(σ_{u}^{2})$ . The model given above includes the situation when a reliability sample for W is available. A reliability sample with k_i = 2 is often conducted in a study to assess the within subject variation of W. For example, in the Women’s Health Initiative (WHI), investigators were interested in the association between sedentary activity and mortality [19], and the data from a subset of WHI participants were used as reliability analysis of physical activity data. Let $W_{ij}^{*}$ be binary such that $W_{ij}^{*} = 1$ if $W_{ij} \geq x_{0}$ and 0 otherwise (for j = 1, …, k_i). In the data analysis without an adjustment, W* is used as a binary covariate. Let Z be the vector of other observed covariates (other than W) such as age and gender. The outcome variable Y_i, i = 1, …, n follows a generalized linear model such that

E(Y_{i} | X_{i}^{*}, Z_{i}) = g (β_{0} + β_{1} X_{i}^{*} + β_{2}^{'} Z_{i}),

(1)

where g(·) is a specified function, and our goal is to estimate ${(β_{0}, β_{1}, β_{2}^{'})}^{'} \equiv β$ . For example, g(u) = u in linear regression, g(u) = exp(u) in Poisson regression, while g(u) = (1 + e⁻^u)⁻¹ in logistic regression.

The problem of interest is a combination of measurement error and misclassification. It is a measurement error problem since X is not available while its surrogate W is available. It is a misclassification problem since the true covariate X* is not available while its surrogate W* is available. Although the problem is a mixture of measurement error and misclassification, it has not been well addressed in the literature (Gustafson [5]; Carroll et al. [8]; Buonacorsi [9]). To address this issue, we first consider an approach using estimating equations. It is easily seen that if X* were known, then an estimating equation for β is $\sum_{i = 1}^{n} {(1, X_{i}^{*}, Z_{i}^{'})}^{'} {Y_{i} - g (β_{0} + β_{1} X_{i}^{*} + β_{2}^{'} Z_{i})} = 0$ . This estimating equation is normally called the full data estimating equation; that is, the estimating equation for the case when all the data are observed without measurement error. Hence when X* is not available, a consistent estimating approach is to take conditional expectation of the estimating equation of the full data estimating equation given the observed data (Wang et al. [20], Wu et al. [21], Yi et al. [22]). For simplicity of notation, unless otherwise stated ${\tilde{W}}_{i}$ denotes ( $W_{i 1}, \dots, W_{{ik}_{i}}$ ). The likelihood score-based expected estimating equation (EEE) for estimating β can be written as

\sum_{i = 1}^{n} E [(\begin{matrix} 1 \\ X_{i}^{*} \\ Z_{i} \end{matrix}) {Y_{i} - g (β_{0} + β_{1} X_{i}^{*} + β_{2}^{'} Z_{i})} | Y_{i}, {\tilde{W}}_{i}, Z_{i}] = 0.

(2)

The conditional expectation given in (2) can be calculated by assuming the distribution of X and measurement error U. For example, let f(V) denote the probability density function of any random variable V, then $f (Y | \tilde{W}, Z) = \sum_{x * = 0}^{1} f (Y | x^{*}, \tilde{W}, Z) f (x^{*} | \tilde{W}, Z)$ and that

f (X^{*} = 1 | W, Z) = \frac{\int_{x} I [x \geq x_{0}] f (W | x) f (x, Z) d x}{\int_{x} f (W | x) f (x, Z) d x} .

3 Normality-based Regression Calibration Method

The EEE approach discussed in the previous section would require calculation via the use of likelihood functions. In this section, we will propose a new NRC estimator. The NRC estimator is a replacement estimator that replaces an unobserved X* by a quantity that can be obtained by simple calculation.

3.1 NRC for Linear Regression

In this subsection, we consider linear regression such that $E (Y_{i} | X_{i}^{*}, Z_{i}) = β_{0} + β_{1} X_{i}^{*} + β_{2}^{'} Z_{i}$ . Under this model, it is seen that $E (Y_{i} | X_{i}, Z_{i}) = β_{0} + β_{1} I [X_{i} \geq x_{0}] + β_{2}^{'} Z_{i}$ . However, like $X_{i}^{*}$ , X_i is not observed and hence the mean model will need to be further developed. By simple calculation, we have $E (Y_{i} | {\bar{W}}_{i}, Z_{i}) = β_{0} + β_{1} pr (X_{i} \geq x_{0} | {\bar{W}}_{i}, Z_{i}) + β_{2}^{'} Z_{i}$ , where ${\bar{W}}_{i} = \sum_{j = 1}^{k_{i}} W_{ij} / k_{i}$ . From this association, the regression coefficients can be consistently estimated if we replace the unobserved $X_{i}^{*}$ by $pr (X_{i} \geq x_{0} | {\bar{W}}_{i}, Z_{i})$ . The conditional probability in general can be well estimated under a multivariate normality assumption of $(X_{i}, {\bar{W}}_{i}, Z_{i})$ . When $(X_{i}, {\bar{W}}_{i}, Z_{i})$ is multivariate-normal, it is easily seen that

\begin{array}{l} pr(X_{i} \geq x_{0} | {\bar{W}}_{i}, Z_{i})=pr (\frac{X_{i} - E (X_{i} | {\bar{W}}_{i}, Z_{i})}{{var (X_{i} | {\bar{W}}_{i}, Z_{i})}^{1 / 2}} \geq \frac{x_{0} - E(X_{i} | {\bar{W}}_{i}, Z_{i})}{{var (X_{i} | {\bar{W}}_{i}, Z_{i})}^{1 / 2}} | {\bar{W}}_{i}, Z_{i}) \\ = Φ (\frac{(σ_{x}^{2}, \sum_{x z}) {(\begin{matrix} σ_{x}^{2} + (σ_{u}^{2} / k_{i}) & \sum_{x z} \\ \sum_{x z} & \sum_{z} \end{matrix})}^{- 1} (\begin{matrix} {\bar{W}}_{i} - μ_{x} \\ Z_{i} - μ_{z} \end{matrix}) - (x_{0} - μ_{x})}{{σ_{x}^{2} - (σ_{x}^{2}, \sum_{x z}) {(\begin{matrix} σ_{x}^{2} + (σ_{u}^{2} / k_{i}) & \sum_{x z} \\ \sum_{x z} & \sum_{z} \end{matrix})}^{- 1} (\begin{matrix} σ_{x}^{2} \\ \sum_{x z} \end{matrix})}^{1 / 2}}), \end{array}

(3)

where Φ(·) is the cumulative density function of a standard normal distribution, μ_z is the mean of Z, $\sum_{xz} = cov (X_{i}, Z_{i})$ and $\sum_{z} = cov (Z_{i})$ . The NRC estimator is to replace X* by the conditional probability given above in (3). The (nuisance) parameters involved in (3) can be consistently estimated since replicates for W are available (see the Appendix). Let ${\hat{X}}_{i}^{*}$ be the conditional probability $pr (X \geq x_{0} | \bar{W}, Z)$ in (3), which is often called the calibration function. When there is no Z or when X and Z are independent, the replacement variable (3) for the unobserved $X_{i}^{*}$ can be written as

{\hat{X}}_{i}^{*} = Φ (\frac{σ_{x} ({\bar{W}}_{i} - μ_{x})}{(σ_{u} / \sqrt{k_{i}}) \sqrt{σ_{x}^{2} + (σ_{u}^{2} / k_{i})}} + \frac{\sqrt{σ_{x}^{2} + (σ_{u}^{2} / k_{i})} (μ_{x} - x_{0})}{σ_{x} (σ_{u} / \sqrt{k_{i}})}) .

(4)

However, the expression still does not provide a simple correction for attenuation in our problem. That is, from the development of this section, we can not see directly a method to correct the bias from the naive estimator that uses covariates $({\bar{W}}_{i}^{*}, Z_{i})$ . The argument given above has shown that when Y given (X*, Z) is linear, a consistent estimator for β can be obtained by using covariates $({\hat{X}}_{i}^{*}, Z)$ . In the Appendix, we provide details on how to obtain the standard error of the NRC estimator. The idea is based on stacking the estimating equations for β and the nuisance parameters discussed above, and then the asymptotic variance can be obtained by a sandwich estimator. The asymptotic variance has taken into consideration the uncertainty due to the replacement of $X_{i}^{*}$ with ${\hat{X}}_{i}^{*}$ .

3.2 NRC for Logistic Regression

In this subsection, we consider logistic regression such that $E (Y_{i} | X_{i}^{*}, Z_{i}) = H (β_{0} + β_{1} X_{i}^{*} + β_{2}^{'} Z_{i})$ , where H(v) = {1 + exp(−v)}⁻¹. The mean function in logistic regression is not linear. We note that by a Taylor expansion,

\begin{array}{l} E {H (β_{0} + β_{1} X_{i}^{*} + β_{2}^{'} Z_{i}) | {\bar{W}}_{i}, Z_{i}} \approx H {β_{0} + β_{1} E (X_{i}^{*} | {\bar{W}}_{i}, Z_{i}) + β_{2}^{'} Z_{i}} \\ + 0.5 β_{1}^{2} H^{(2)} {β_{0} + β_{1} E (X_{i}^{*} | {\bar{W}}_{i}, Z_{i}) + {β^{'}}_{2} Z_{i}} E [{X_{i}^{*} - E (X_{i}^{*} | {\bar{W}}_{i}, Z_{i})}^{2} | {\bar{W}}_{i}, Z_{i}], \end{array}

where H⁽²⁾(v) = (∂²/∂²v)H(v). If the second term of the above equation is relatively small, then we have the following approximation: $E (Y_{i} | {\bar{W}}_{i}, Z_{i}) \approx H {β_{0} + β_{1} E (X_{i}^{*} | {\bar{W}}_{i}, Z_{i}) + β_{2}^{'} Z_{i}}$ . Hence, following a similar approximation for $β_{0} + β_{1} E (X_{i}^{*} | {\bar{W}}_{i}, Z_{i})$ and the approximation given in the previous subsection, we have that

E (Y_{i} | {\bar{W}}_{i}, Z_{i}) \approx H {β_{0} + β_{1} {\hat{X}}_{i}^{*} + β_{2}^{'} Z_{i}},

where ${\hat{X}}_{i}^{*}$ is from (3) if the calibration function is a function of both ${\bar{W}}_{i}$ and Z_i, or from (4) if the calibration function is independent of Z_i.

4 Linearization-based Regression Calibration Method

The proposed NRC approach discussed in the previous section has nice performance in general, and is consistent under a multivariate normality assumption. In this section, we aim at developing a new approximation approach that can provide an interpretation of the correction for attenuation from the naive estimator.

4.1 LRC for Linear Regression

We first consider linear regression in the subsection such that $E (Y | X^{*}, Z) = β_{0} + β_{1} X^{*} + β_{2}^{'} Z .$ Under this model, we have that $E (Y | X, Z) = β_{0} + β_{1} I [X \geq x_{0}] + β_{2}^{'} Z$ . For any fixed Z, we let $β_{0} + β_{2}^{'} Z$ be denoted by γ₀. The idea of an approximation will be to approximate Y = γ₀ + β₁I[X ≥ x₀] by a straight line as a function of Y and X. Figure 1 provides a visual demonstration of the linearization. The idea is that among the points with X ≤ x₀, most points are in the range (μ_x − 2σ_x, x₀). Hence, the straight line to approximate the step function would pass through ((x₀+μ_x−2σ_x)/2, γ₀). Similarly, among the points with X ≥ x₀, most points are in the range (x₀, μ_x + 2σ_x). Therefore, the straight line to approximate the step function would also pass through ((x₀ + μ_x + 2σ_x)/2, γ₀ + β₁). This indicates that we have that $E (Y | X, Z) \approx θ_{0} + (β_{1} / 2 σ_{x}) X + β_{2}^{'} Z$ , where θ₀ = β₀ + (β₁/2) − ((x₀ + μ_x)β₁/(4σ_x)) is a new intercept. Note that because X_i is not observed, the above model can not be directly applied to data analysis. From this approximation, we can further express the conditional expectation of Y given the observed $\bar{W}$ , Z as:

E(Y | \bar{W}, Z) \approx θ_{0} + (β_{1} / 2 σ_{x}) E(X | \bar{W}, Z)+ β_{2}^{'} Z .

(5)

Approximation of a step function with a jump at x₀ in the range of (*μ_x* − 2σ_x,μ_x + 2σ_x).

The above approximation (5) has an idea similar to the RC approach for linear model (Carroll et al. [8], Chapter 3), but our problem is that the unobserved covariate in our model is X^* (binary covariate split by median), not X. From the above approximation, we can correct for measurement error in X, or misclassification in X^*. This is normally done by assuming a bivariate-normal model for the joint distribution of X, U. If $E (X | \bar{W}, Z) = E (X | \bar{W})$ , then we can further express the above association by:

E (Y_{i} | {\bar{W}}_{i}, Z_{i}) \approx θ_{0}^{*} + \frac{β_{1} σ_{x}}{2 {σ_{x}^{2} + (σ_{u}^{2} / k_{i})}} {\bar{W}}_{i} + β_{2}^{'} Z_{i},

where $θ_{0}^{*} = β_{0} + (β_{1} / 2) - ((x_{0} + μ_{x}) / (4 σ_{x})) β_{1} + [μ_{x} σ_{u}^{2} / {2 k_{i} σ_{x} (σ_{x}^{2} + (σ_{u}^{2} / k_{i}))}] β_{1}$ . Note that for a given Z_i, the next step for the approximation is to estimate $E ({\bar{W}}_{i} | {\bar{W}}_{i}^{*})$ . If W_i is approximately normally distributed, then by some calculation we have that $E ({\bar{W}}_{i} | {\bar{W}}_{i}^{*}) \approx c (x_{0}) + 1.6 \sqrt{σ_{x}^{2} + (σ_{u}^{2} / k_{i})} {\bar{W}}_{i}^{*}$ , where c(x₀) depends on x₀. For example, if x₀ is the median, then $c (x_{0}) \approx μ_{x} - 0.8 \sqrt{σ_{x}^{2} + (σ_{u}^{2} / k_{i})}$ . If x₀ is the first quartile, then $c (x_{0}) \approx μ_{x} - 1.2 \sqrt{σ_{x}^{2} + (σ_{u}^{2} / k_{i})}$ . If x₀ is the third quartile, then $c (x_{0}) \approx μ_{x} - 0.4 \sqrt{σ_{x}^{2} + (σ_{u}^{2} / k_{i})}$ .

Hence, by further simple calculations, we can obtain the following approximation:

E (Y_{i} | {\bar{W}}_{i}^{*}, Z_{i}) \approx θ_{0}^{+} + 0.8 \frac{σ_{x}}{\sqrt{σ_{x}^{2} + (σ_{u}^{2} / k_{i})}} β_{1} {\bar{W}}_{i}^{*} + β_{2}^{'} Z_{i},

(6)

where $θ_{0}^{+} = β_{0} + (β_{1} / 2) - {(x_{0} + μ_{x}) / (4 σ_{x})} β_{1} + [μ_{x} σ_{u}^{2} / {2 k_{i} σ_{x} (σ_{x}^{2} + (σ_{u}^{2} / k_{i}))}] β_{1} + {σ_{x} β_{1} c (x_{0})} / [2 {σ_{x}^{2} + (σ_{u}^{2} / k_{i})}]$ . From the above approximation, our LRC estimator is to replace $X_{i}^{*}$ by the following calibration function

{\hat{X}}_{i}^{*} = 0.8 \frac{σ_{x}}{\sqrt{σ_{x}^{2} + (σ_{u}^{2} / k_{i})}} {\bar{W}}_{i}^{*} .

(7)

From the approximation given above and if $E (X | \bar{W}, Z) \approx E (X | \bar{W})$ , then naively using the observed quantiles for a variable that is error-prone would likely lead to an attenuation bias. Under the linear model, a correction for attenuation method for β₁ can be obtained based on the approximation given above. By (6), if there are k replicates for W for all subjects (k_i = k for all i), then the attenuation factor is $0. 8 σ_{x} / \sqrt{σ_{x}^{2} + (σ_{u}^{2} / k)}$ . That is, let ${\hat{β}}_{N, 1}$ be the naive estimate of β₁ by using the observed quantile ${\bar{W}}_{i}^{*}$ , then the LRC estimator for β₁ can be obtained by dividing the naive estimator ${\hat{β}}_{N, 1}$ by the attenuation factor $0.8 σ_{x} / \sqrt{σ_{x}^{2} + (σ_{u}^{2} / k)}$ . However, the standard error of the LRC estimator can not be obtained by the use of the attenuation factor directly. It can be obtained by a sandwich variance estimator where the vector of the estimating equations is obtained by stacking the estimating equations for β and the nuisance parameters discussed above (see the Appendix).

4.2 LRC for Logistic Regression

We consider logistic regression such that $E (Y_{i} | X_{i}^{*}, Z_{i}) = H (β_{0} + β_{1} X_{i}^{*} + β_{2}^{'} Z_{i})$ . As discussed in the previous section, by a Taylor expansion, if the disease is rare, then $E (Y_{i} | X_{i}^{*}, Z_{i})$ can be approximated by $H {β_{0} + β_{1} E (X_{i}^{*} | {\bar{W}}_{i}, Z_{i}) + β_{2}^{'} Z_{i}}$ . Hence, following an approximation that is similar to that for $β_{0} + β_{1} I [X_{i} \geq x_{0}] | {\bar{W}}_{i}, Z_{i})$ given in (6) of the previous subsection, we have that

E (Y_{i} | W_{i}^{*}, Z_{i}) \approx H {θ_{0}^{+} + 0.8 \frac{σ_{x}}{\sqrt{σ_{x}^{2} + (σ_{u}^{2} / k_{i})}} β_{1} W_{i}^{*} + β_{2}^{'} Z_{i}} .

Hence, if we naively use the observed quantiles for a variable that is error-prone, then the naive estimator would likely have an attenuation bias. If there are k replicates for W for all subjects (k_i = k for all i), then the attenuation factor is $0.8 σ_{x} / \sqrt{σ_{x}^{2} + (σ_{u}^{2} / k)}$ . Let ${\hat{β}}_{N, 1}$ be the naive estimate of β₁ by using the observed dichotomized ${\bar{W}}_{i}^{*}$ . To adjust for bias, the LRC estimator for β₁ can be obtained by dividing the naive estimator ${\hat{β}}_{N, 1}$ by the attenuation factor $0.8 σ_{x} / \sqrt{σ_{x}^{2} + (σ_{u}^{2} / k)}$ .

5 Simulation Study

We conducted a simulation study to compare the naive estimator and the proposed NRC and LRC estimators. The naive estimator applied the usual regression analysis by replacing the unobserved $X_{i}^{*}$ by ${\bar{W}}_{i}^{*}$ . The NRC estimator replaced $X_{i}^{*}$ by (4), and the LRC estimator replaced $X_{i}^{*}$ by $0.8 σ_{x} / \sqrt{σ_{x}^{2} + (σ_{u}^{2} / k_{i})} {\bar{W}}_{i}^{*}$ as in (7), and then we corrected the intercept by adjusting for the difference between $θ_{0}^{+}$ and β₀. In Table 1, we considered the situation when Y_i given $X_{i}^{*}$ was linear. We first generated covariates X_i, i = 1, …, n from a standard normal distribution. The true binary $X_{i}^{*}$ was 1 if X_i ≥ μ_x, or 0 otherwise. Then Y_i, i = 1, …, n, were generated by $Y_{i} = β_{0} + β_{1} X_{i}^{*} + e_{i}$ where e_i was from a standard normal distribution. The surrogate variables W_ij, j = 1, …, k_i, were generated by W_ij = X_i + U_ij, where U_ij was normal with mean 0 and variance σ_u = 0.5 and 1.0 respectively. There were 50% of the individuals with k_i = 2, and 50% with k_i = 1. This was like a situation when a reliability sample was available with replicates for W. The binary surrogate $W_{ij}^{*}$ was 1 if W_ij ≥ μ_x, or 0 otherwise. The sample size in the simulation was n = 500, and n = 1000, respectively. In the tables, the “biases” were calculated by taking the average of $\hat{β} - β$ from 500 replicates, “SD” denotes the sample standard deviation of the estimators, “ASE” denotes the average of the estimated standard errors of the estimators. The 95% confidence interval coverage probabilities are also included. All the parameters are given in the tables.

Table 1.

Simulation study when Y given X* was linear and X was normal

		n = 500			n = 1000
		Naive	NRC	LRC	Naive	NRC	LRC
σ_u = 0.5
β₀	Bias	0.127	0.000	−0.007	0.126	−0.003	−0.009
	SD	0.068	0.077	0.082	0.051	0.055	0.058
	ASE	0.067	0.074	0.080	0.047	0.054	0.057
	CP	0.508	0.958	0.944	0.232	0.950	0.952
β₁	Bias	−0.259	−0.005	0.009	−0.255	0.004	0.016
	SD	0.096	0.118	0.131	0.069	0.078	0.089
	ASE	0.094	0.120	0.128	0.067	0.080	0.091
	CP	0.222	0.954	0.944	0.060	0.946	0.960

		Naive	NRC	LRC	Naive	NRC	LRC
σ_u = 1.0
β₀	Bias	0.222	−0.002	0.041	0.221	−0.004	0.040
	SD	0.069	0.091	0.092	0.047	0.065	0.068
	ASE	0.069	0.088	0.096	0.048	0.064	0.067
	CP	0.088	0.960	0.920	0.002	0.946	0.900
β₁	Bias	−0.430	0.005	−0.083	−0.443	0.009	−0.080
	SD	0.097	0.151	0.159	0.070	0.112	0.118
	ASE	0.097	0.156	0.162	0.068	0.104	0.114
	CP	0.006	0.944	0.922	0.000	0.946	0.868

Open in a new tab

NOTE: The naive estimator replaced X* by W*, NRC was the normality-based RC estimator, and LRC was the linearization-based RC estimator. The regression parameters were β = (0.5,1). The nuisance parameters were μ_x = 0, σ_x = 1. The results were obtained from 500 replicates.

From Table 1, it was seen that the naive estimator had large biases. The NRC estimator performed very well since it was a consistent estimator when X was normal and the measurement error was normal. The LRC estimator was not as good as the NRC estimator in terms of bias and efficiency. The LRC’s performance was in general acceptable when σ_u = 0.5, but the bias was larger when σ_u = 1.414. If we consider the relative bias versus the true slope, then for σ_u = 1.414 the relative bias was about 50% from the naive estimator while about 10% from the LRC estimator. Hence, although the LRC estimator was not consistent (such as NRC) it did serve well in terms of bias correction.

The data used for Table 2 were generated similarly to these for Table 1 except that the distribution of X was skewed. The underlying exposure X variables were generated from a mixture of two normals with means (1/5^1/2, −2/5^1/2), variances (4/5, 1/5), and the mixture percentages were (2/3, 1/3). The purpose of the simulation setup was to understand if the proposed NRC and LRC estimators were sensitive to the normality assumption of X. We note that when X was not normal, the NRC estimator was no longer consistent. From Table 2, it was seen that the performance of NRC and LRC was still satisfactory.

Table 2.

Simulation study when Y given X* was linear and X was mixture-normal

		n = 500			n = 1000
		Naive	NRC	LRC	Naive	NRC	LRC
σ_u = 0.5
β₀	Bias	0.096	−0.019	−0.047	0.098	−0.017	−0.044
	SD	0.066	0.075	0.078	0.047	0.052	0.056
	ASE	0.065	0.072	0.077	0.046	0.051	0.054
	CP	0.688	0.930	0.896	0.402	0.918	0.872
β₁	Bias	−0.213	0.018	0.071	−0.217	0.013	0.066
	SD	0.094	0.114	0.128	0.066	0.078	0.090
	ASE	0.094	0.111	0.127	0.066	0.079	0.090
	CP	0.400	0.936	0.910	0.102	0.956	0.884

		Naive	NRC	LRC	Naive	NRC	LRC
σ_u = 1.0
β₀	Bias	0.180	−0.050	0.014	0.181	−0.049	0.012
	SD	0.071	0.096	0.101	0.046	0.063	0.064
	ASE	0.067	0.089	0.094	0.047	0.063	0.066
	CP	0.242	0.894	0.934	0.038	0.884	0.958
β₁	Bias	−0.409	0.046	−0.023	−0.404	0.053	−0.015
	SD	0.098	0.155	0.166	0.067	0.107	0.112
	ASE	0.096	0.151	0.161	0.068	0.107	0.114
	CP	0.006	0.930	0.926	0.000	0.928	0.962

Open in a new tab

NOTE: The naive estimator replaced X* by W*, NRC was the normality-based RC estimator, and LRC was the linearity-based RC estimator. The regression parameters were β = (0.5,1.0). The nuisance parameters were μ_x = 0, σ_x = 1. The results were obtained from 500 replicates.

In Table 3, we studied simulations when the regression model was logistic such that $pr (Y_{i} | X_{i}^{*}) = {1 + exp (- β_{0} - β_{1} X_{i}^{*})}^{- 1},$ where β₀ = 0 and β₁ = ln(3). The variables $X_{i}, W_{i}, X_{i}^{*}, {\bar{W}}_{i}^{*}$ were all generated similarly to those in Table 1. The results under logistic regression were very similar to those in Table 1 (linear regression). The NRC estimator did the best in terms of bias correction. The LRC estimator was not as good as NRC but it was considered as a satisfactory estimator with a good interpretation in correction for attenuation.

Table 3.

Simulation study when Y given X was logistic and X was normal

		n = 500			n = 1200
		Naive	NRC	LRC	Naive	NRC	LRC
σ_u = 0.5
β₀	Bias	0.143	0.005	0.000	0.032	−0.007	−0.013
	SD	0.118	0.135	0.141	0.094	0.108	0.113
	ASE	0.127	0.146	0.152	0.090	0.103	0.108
	CP	0.820	0.968	0.968	0.682	0.936	0.948
β₁	Bias	−0.313	−0.029	−0.028	−0.296	−0.010	−0.006
	SD	0.181	0.233	0.245	0.142	0.181	0.193
	ASE	0.190	0.240	0.259	0.134	0.169	0.183
	CP	0.616	0.962	0.958	0.430	0.938	0.946

		Naive	NRC	LRC	Naive	NRC	LRC
σ_u = 1.0
β₀	Bias	0.219	−0.032	0.017	0.225	−0.014	0.032
	SD	0.134	0.181	0.188	0.086	0.121	0.120
	ASE	0.128	0.174	0.177	0.090	0.122	0.124
	CP	0.598	0.944	0.940	0.292	0.954	0.952
β₁	Bias	−0.476	0.032	−0.069	−0.498	0.014	−0.111
	SD	0.198	0.315	0.326	0.132	0.215	0.218
	ASE	0.188	0.307	0.313	0.133	0.215	0.219
	CP	0.296	0.946	0.936	0.032	0.954	0.926

Open in a new tab

NOTE: The naive estimator replaced X* by W*, NRC was the normality-based RC estimator, and LRC was the linearity-based RC estimator. The regression parameters were β = (0,ln (3)). The nuisance parameters were μ_x = 0, σ_x = 1. The results were obtained from 500 replicates.

The data used for Table 4 were generated similarly to those for Table 3 except that the distribution of X was skewed. The underlying exposure X variables were generated from a mixture of two normals with means (1/5^1/2, −2/5^1/2), variances (4/5, 1/5), and the mixture percentages were (2/3, 1/3). The simulation set was aimed to understand if the proposed NRC and LRC estimators were sensitive to the normality assumption of X. From the results in Table 4, it was seen that the performance of NRC and LRC was still satisfactory.

Table 4.

Simulation study when Y given X was logistic and X was mixture-normal

		n = 500			n = 1200
		Naive	NRC	LRC	Naive	NRC	LRC
σ_u = 0.5
β₀	Bias	0.091	−0.030	−0.063	0.102	−0.024	−0.049
	SD	0.131	0.143	0.155	0.088	0.098	0.104
	ASE	0.124	0.140	0.149	0.087	0.099	0.105
	CP	0.868	0.948	0.928	0.780	0.952	0.914
β₁	Bias	−0.313	−0.029	−0.028	−0.262	−0.006	−0.040
	SD	0.181	0.233	0.245	0.129	0.156	0.175
	ASE	0.190	0.240	0.259	0.135	0.166	0.183
	CP	0.616	0.962	0.958	0.484	0.968	0.952

		Naive	NRC	LRC	Naive	NRC	LRC
σ_u = 1.0
β₀	Bias	0.184	−0.058	−0.025	0.181	−0.063	−0.025
	SD	0.128	0.169	0.178	0.087	0.119	0.119
	ASE	0.126	0.171	0.175	0.089	0.120	0.123
	CP	0.692	0.954	0.954	0.442	0.916	0.946
β₁	Bias	−0.455	0.032	−0.037	−0.464	0.026	−0.054
	SD	0.190	0.300	0.318	0.130	0.214	0.216
	ASE	0.188	0.303	0.313	0.133	0.212	0.219
	CP	0.332	0.966	0.960	0.056	0.952	0.946

Open in a new tab

6 Data Analysis of the NEW Trial

The NEW study was briefly described in the introduction. In the data analysis in this section, we investigated the association between CRP and VO₂max. We primarily considered the baseline data except that we used the VO₂max values at the 12th month from the control group as the 2nd measurement of VO₂max replicates. That is, the VO₂max values at baseline and 12-month from the control group (n = 87) were used as a reliability subsample to estimate the measurement error variance. We did not use the 12-month VO₂max values from the three intervention groups since the values were slightly higher than their baseline values. In our analysis the unit of VO₂max was ml/(kg · min), and hence the intervention effect on weight change has also contributed to the increase of VO₂max among the intervention individuals. In this data application, we were interested in the effect of median-split binary VO₂max and body mass index (BMI) on CRP. BMI is a measure of relative size based on the mass and height of an individual, calculated by dividing their weight in kilograms by the square of their height in meters.

Among the 439 individuals of the study, we excluded one individual with missing baseline CRP and 32 individuals who had CRP values higher than 10 mg/L which could be due to acute illness [1]. In addition, we also excluded three individuals with baseline VO₂max less than 10 ml/(kg · min) which was a threshold typically required to complete basic activities of daily living [23]. As a result, 403 individuals of the NEW study were included. We first examined an association between VO₂max and CRP at baseline. The upper portion of Figure 2 showed the scatter plot and a fitted kernel smoother of VO₂max and CRP at baseline. To understand the measurement error property of VO₂max, the lower portion of Figure 2 was for VO₂max at baseline and 12-month from the control group. The 12-month VO₂max measurements in the control group in general were not systematically different from the baseline measurements. The differences were likely due to variations in physical conditions (or efforts) at different time points, and can be treated as measurement errors.

The top figure was for baseline *VO₂max* and *CRP* (n = 403); the bottom figure was for *VO₂max* at baseline and 12-month (from the control group, n = 87). The curve in each scatter plot was from a kernel smoother. The 12-month *VO₂max* measurements in the control group in general were not systematically different from the baseline measurements. The differences were likely due to variations in physical conditions (or efforts) at different time points, and can be treated as measurement errors.

The application in this section was to apply our methods to the regression association for the effects of median-split VO₂max and BMI on CRP. Our work was focused on an application of our statistical methodology, and hence the clinical implication of the application here should be further investigated in a clinical trial research. We estimated the regression coefficients using the naive, NRC, and LRC estimators. The results were given in Table 5. All the three estimators showed that VO₂max was a significant risk factor for inflammatory marker CRP; but with different magnitudes. From the naive estimator, individuals with higher than mean VO₂max had on average about 0.45 mg/L lower CRP than the individuals with lower than mean VO₂max. From the NRC and LRC estimator, the difference in CRP between the 2 median-split groups was increased to either 0.87 mg/L or 0.68 mg/L. The effects of VO₂max from NRC and LRC estimators were larger than the corresponding naive estimator, which can be explained by the correction for attenuation discussed in Section 4. The standard errors from NRC and LRC estimates were both larger than the naive estimates, which is similar to a general phenomenon of bias-efficiency trade-off that was seen in measurement error literature. In addition, all the three estimates showed a similar significant effect of BMI on CRP that on average an increase of 1 kg/m² in BMI is associated with an increase of about 0.16 mg/L in CRP.

Table 5.

Modelling the relationship between VO₂max (ml/(kg • min) and CRP (mg/L) using naive, NRC and LRC in the NEW trial (n = 403). The outcome variable was CRP. The covariate VO₂max was the true underlying binary median-split variable that may be misclassified due to measurement error in the continuous VO₂max.

Naive

NRC

LRC

{\hat{β}}_{0}

(intercept)

−2.072

−1.447

−1.940

0.955

1.094

1.037

{\hat{β}}_{1}

(VO₂max)

−0.446

−0.874

−0.684

0.220

0.366

0.336

{\hat{β}}_{2}

(BMI)

0.166

0.153

0.166

0.029

0.032

0.031

{\hat{μ}}_{x}

24.960

0.254

{\hat{σ}}_{x}

3.132

0.131

{\hat{σ}}_{u}

3.512

0.003

Open in a new tab

Note: The naive estimator replaced X* by W*, NRC was the normality-based RC estimator, and LRC was the linearity-based RC estimator.

7 Discussion

Most of the details of the methods in the paper are based on dichotomization of an error-prone exposure variable. The methods can be applied to the situations when the covariates are q-quantiles of an error-prone exposure variable. For example, q = 4 for quartiles and q = 5 for quintiles. For simplicity, we consider linear regression in which the q-quantiles are covariates. We can write the regression model as $Y_{i} = β_{0} + \sum_{j = 1}^{q - 1} β_{j} I [X_{i} \geq Q_{j}] + e_{i}$ , in which Q_j is the j^th q-quantile. If q = 4, then in the model, Q₁ is the first quartile, Q₂ is the median, and Q₃ is the third quartile. In this model, if we denote I[X_i ≥ Q_j] by $X_{i, j}^{*}$ , then the methods in Section 3 (NRC) and Section 4 (LRC) can be applied to estimate $X_{i, j}^{*}$ . Extension of the modeling of quantiles as covariates from linear regression to logistic regression is straight-forward by using the approaches in Sections 3.2 and 4.2. For survival analysis with censored outcomes, the methods could be extended, but more detailed research is needed. Hence, further research is warranted for the methodology for survival analysis.

We have proposed two bias correction methods in regression analysis when quantiles for the error-prone exposure variable are used as the covariates. Although the proposed methods may not be consistent, the estimators are valid for the analysis of real data. For example, in biomedical research, analysis results are often presented by comparing the risk at different ranges (such as quartiles) of the exposures. The proposed estimators have limited bias especially for data when the exposure effect is not too large, which is often the case for environmental factors. Hence, the methods will likely have good methodological contributions for biomedical data.

In many studies, a calibration sample (or reliability sample) is often used to estimate the within subject and between subject variations (Guo et al. [24], White and Xie [25]). If there are two replicates in the calibration sample then k_i = 2 for individuals in the calibration sample, while k_i = 1 for the rest of the cohort. The statistical model for W in Section 2 can be treated as an internal reliability sample; say with k_i = 2. The simulation study and data analysis can be considered as the setup with an internal reliability sample. If in an application, a reliability sample is not a subset of the main cohort, then it is an external reliability sample. The main difference in analysis between internal and external reliability sample is in the standard error estimation of the estimators. For an internal reliability sample, there is a need to adjust for uncertainty due to the estimation of the measurement error parameters; just like what has been done in the paper. For an external reliability sample, there is no need to adjust for uncertainty due to estimation of the measurement error parameters.

Supplementary Material

Supp infoS1

NIHMS736504-supplement-Supp_infoS1.pdf^{(110.2KB, pdf)}

Acknowledgments

This research was partially supported by National Institutes of Health grants AI112341 (Wang) CA53996 (Wang), ES017030 (Wang and Tapsoba), HL121347 (Wang, Tapsoba, Duggan and McTiernan), CA 105204 (Wang, Duggan, Campbell and McTiernan), CA 116847 (Wang, Duggan, Campbell and McTiernan), the Breast Cancer Research Foundation (Wang, Duggan and McTiernan) and a travel award from the Mathematics Research Promotion Center of National Science Council of Taiwan (Wang).

Appendix: Estimating Equations and Standard Error Estimation for NRC and LRC

The NRC or LRC estimator can be obtained by solving a set of estimating equations, and the asymptotic variance can be established by a sandwich estimator [26]. Recall that the mean function of Y given $(X_{i}^{*}, Z_{i})$ is $g (β_{0} + β_{1} X_{i}^{*} + β_{2}^{'} Z_{i})$ . Let the replacement value (calibration function) of $X_{i}^{*}$ for NRC or LRC be denoted by $\hat{X_{i}^{*}}$ . When X and Z are independent (or when Z is not in the main regression model), μ_x, σ_x and σ_u are the nuisance parameters involved in the calculation of $\hat{X_{i}^{*}}$ via (4) or (7). The estimating equations for β and (μ_x, σ_x, σ_u) can be expressed as below.

{\begin{cases} \sum_{i = 1}^{n} (\begin{matrix} 1 \\ \hat{X_{i}^{*}} \\ Z_{i} \end{matrix}) {Y_{i} - g (β_{0} + β_{1} \hat{X_{i}^{*}} + β_{2}^{'} Z_{i})} = 0; \\ \sum_{i = 1}^{n} (\bar{W_{i}} - μ_{x}) = 0; \\ \sum_{i = 1}^{n} {\sum_{j = 1}^{k_{i}} {(W_{ij} - \bar{W_{i .}})}^{2} - (k_{i} - 1) σ_{u}^{2}} = 0; \\ \sum_{i = 1}^{n} {{(\bar{W_{i}} - μ_{x})}^{2} - σ_{x}^{2} - σ_{u}^{2} / k_{i}} = 0. \end{cases}

If in case X and Z are dependent, then the calibration function $E (X^{*} | \bar{W}, Z)$ (as in (3) of Section 3) is slightly more complicated since there are a few more nuisance parameters to be estimated [26]. See also Wang, et al. [20, Section 3.2] for a similar sandwich covariance estimator.

Footnotes

Supporting information

Additional supporting information may be found in the online version of this article at the publisher’s web site.

References

1.Foster-Schubert KE, Alfano CM, Duggan CR, Xiao L, Campbell KL, Kong A, Bain C, Wang CY, Blackburn G, McTiernan A. Effect of exercise and diet, along or combined, on weight and body composition in overweight-to-obese post-menopausal women. Obesity. 2012;20:1628–1638. doi: 10.1038/oby.2011.76. [DOI] [PMC free article] [PubMed] [Google Scholar]
2.Campbell KL, Foster-Schubert KE, Alfano CM, Wang C, Wang CY, Duggan CR, Mason C, Imayama I, Kong A, Xiao L, Bain CE, Blackburn GL, Stanczyk FZ, McTiernan A. Reduced-calorie dietary weight loss, exercise, and sex hormones in postmenopausal women: randomized controlled trial. Journal of Clinical Oncology. 2012;30:2314–2326. doi: 10.1200/JCO.2011.37.9792. [DOI] [PMC free article] [PubMed] [Google Scholar]
3.Imayama I, Ulrich CM, Alfano CM, Wang C, Xiao L, Wener MH, Campbell KL, Duggan CR, Foster-Schubert KE, Kong A, Mason CE, Wang CY, Blackburn GL, Bain CE, Thompson HJ, McTiernan A. Effects of a caloric restriction weight loss diet and exercise on inflammatory biomarkers in overweight/obese postmenopausal women: a randomized controlled trial. Cancer Research. 2012;72:2314–2326. doi: 10.1158/0008-5472.CAN-11-3092. [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Kullo IJ, Khaleghi M, Hensrud DD. Markers of inflammation are inversely associated with VO2 max in asymptomatic men. Journal of Applied Physiology. 2007;102:1374–1379. doi: 10.1152/japplphysiol.01028.2006. [DOI] [PubMed] [Google Scholar]
5.Gustafson P. Measurement Error and Misclassification in Statistics and Epidemiology: Impacts and Bayesian Adjustments. Chapman and Hall; New York: 2004. [Google Scholar]
6.Qi L, Wang CY, Prentice RL. Weighted estimators for proportional hazards regression with missing covariates. J Amer Statist Assoc. 2005;100:1250–1263. [Google Scholar]
7.Natarajan L. Regression calibration for dichotomized mismeasured predictors. International Journal of Biostatistics. 2009;5(1):1143. doi: 10.2202/1557-4679.1143. [DOI] [PMC free article] [PubMed] [Google Scholar]
8.Carroll RJ, Ruppert D, Stefanski LA, Crainiceanu CM. Measurement Error in Nonlinear Models, A modern Perspective. second. Chapman and Hall; London: 2006. [Google Scholar]
9.Buonaccorsi J. Measurement Error: Models, Methods, and Applications. Hapman and Hall/CRC; Boca Raton: 2010. [Google Scholar]
10.Kosinski A, Flanders WD. Evaluating the exposure and disease relationship with adjustment for different types of exposure misclassification: a regression approach. Stat Med. 1999;18:2795–2808. doi: 10.1002/(sici)1097-0258(19991030)18:20<2795::aid-sim192>3.0.co;2-s. [DOI] [PubMed] [Google Scholar]
11.Huang GH, Bandeen-Roche K. Building an identifiable latent variable model with covariate effects on underlying and measured variables. Psychometrika. 2004;69:5–32. [Google Scholar]
12.Wang CY, Song X. Expected Estimating Equations via EM for Proportional Hazards Regression with Covariate Misclassification. Biostatistics. 2013;14:351–365. doi: 10.1093/biostatistics/kxs046. [DOI] [PMC free article] [PubMed] [Google Scholar]
13.Flegal KM, Keyl PM, Nieto FJ. Differential misclassification arising from nondifferential errors in exposure measurement. Amer J Epi. 1991;134:1233–1244. doi: 10.1093/oxfordjournals.aje.a116026. [DOI] [PubMed] [Google Scholar]
14.Gustafson P, Le Nhu D. Comparing the effects of continuous and discrete covariate mismeasurement, with emphasis on the dichotomization of mismeasured predictors. Biometrics. 2002;4:878–87. doi: 10.1111/j.0006-341x.2002.00878.x. [DOI] [PubMed] [Google Scholar]
15.Duggan C, Wang CY, Neuhouser M, Xiao L, Smith AW, Reding K, Baumgartner RN, Baumgartner KB, Bernstein L, Ballard-Barbash R, McTiernan A. Associations of insulin-like growth factor and insulin-like growth factor binding protein-3 with mortality in women with breast cancer. International Journal of Cancer. 2013;132:1191–1200. doi: 10.1002/ijc.27753. [DOI] [PMC free article] [PubMed] [Google Scholar]
16.Lyles R. A note on estimating crude odds ratios in case-control studies with differentially misclassified exposure. Biometrics. 2002;58:1034–1036. doi: 10.1111/j.0006-341x.2002.1034_1.x. [DOI] [PubMed] [Google Scholar]
17.Kchenhoff H, Mwalili SM, Lesaffre E. A general method for dealing with misclassification in regression: the misclassification SIMEX. Biometrics. 2006;62:85–96. doi: 10.1111/j.1541-0420.2005.00396.x. [DOI] [PubMed] [Google Scholar]
18.Dalen I, Buonaccorsi JP, Sexton JA, Laake P, Thoresen M. Correction for misclassification of a categorized exposure in binary regression using replication data. Stat Med. 2009;28:3386–3410. doi: 10.1002/sim.3712. [DOI] [PubMed] [Google Scholar]
19.Seguin RA, Buchner D, Lui J, Messina C, Manson J, Moreland L, Allison M, Wang CY, Patel M, LaCroix AZ. Sedentary Behavior and Mortality in Older Women: The Womens Health Initiative Observational and Extension Studies. American Journal of Preventive Medicine. 2014;46:122–135. doi: 10.1016/j.amepre.2013.10.021. [DOI] [PMC free article] [PubMed] [Google Scholar]
20.Wang CY, Huang Y, Chao EC, Jeffcoat MK. Expected estimating equations for missing data, measurement error, and misclassification, with application to longitudinal nonignorably missing data. Biometrics. 2008;64:85–95. doi: 10.1111/j.1541-0420.2007.00839.x. [DOI] [PubMed] [Google Scholar]
21.Wu L, Liu W, Hu XJ. (2010) Joint Inference on HIV Viral Dynamics and Immune Suppression in Presence of Measurement Errors. Biometrics. 2010;66:327–335. doi: 10.1111/j.1541-0420.2009.01308.x. [DOI] [PubMed] [Google Scholar]
22.Yi GY, Liu W, Wu L. Simultaneous inference and bias analysis for longitudinal data with covariate measurement error and missing responses. Biometrics. 2011;67:67–75. doi: 10.1111/j.1541-0420.2010.01437.x. [DOI] [PubMed] [Google Scholar]
23.Shephard RJ. Maximal oxygen intake and independence in old age. Br J Sports Med. 2009;43:342–346. doi: 10.1136/bjsm.2007.044800. [DOI] [PubMed] [Google Scholar]
24.Guo Y, Little RJ, McConnell DS. On using summary statistics from an external calibration sample to correct for covariate measurement error. Epidemiology. 2012;23:165–174. doi: 10.1097/EDE.0b013e31823a4386. [DOI] [PubMed] [Google Scholar]
25.White MT, Xie SX. Adjustment for measurement error in evaluating diagnostic biomarkers by using an internal reliability sample. Stat Med. 2013;32:4709–4725. doi: 10.1002/sim.5878. [DOI] [PMC free article] [PubMed] [Google Scholar]
26.Wang CY. Robust sandwich covariance estimation for regression calibration estimator in Cox regression with measurement error. Statistics and Probability Letters. 1999;45:371–378. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supp infoS1

NIHMS736504-supplement-Supp_infoS1.pdf^{(110.2KB, pdf)}

[R1] 1.Foster-Schubert KE, Alfano CM, Duggan CR, Xiao L, Campbell KL, Kong A, Bain C, Wang CY, Blackburn G, McTiernan A. Effect of exercise and diet, along or combined, on weight and body composition in overweight-to-obese post-menopausal women. Obesity. 2012;20:1628–1638. doi: 10.1038/oby.2011.76. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R2] 2.Campbell KL, Foster-Schubert KE, Alfano CM, Wang C, Wang CY, Duggan CR, Mason C, Imayama I, Kong A, Xiao L, Bain CE, Blackburn GL, Stanczyk FZ, McTiernan A. Reduced-calorie dietary weight loss, exercise, and sex hormones in postmenopausal women: randomized controlled trial. Journal of Clinical Oncology. 2012;30:2314–2326. doi: 10.1200/JCO.2011.37.9792. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R3] 3.Imayama I, Ulrich CM, Alfano CM, Wang C, Xiao L, Wener MH, Campbell KL, Duggan CR, Foster-Schubert KE, Kong A, Mason CE, Wang CY, Blackburn GL, Bain CE, Thompson HJ, McTiernan A. Effects of a caloric restriction weight loss diet and exercise on inflammatory biomarkers in overweight/obese postmenopausal women: a randomized controlled trial. Cancer Research. 2012;72:2314–2326. doi: 10.1158/0008-5472.CAN-11-3092. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R4] 4.Kullo IJ, Khaleghi M, Hensrud DD. Markers of inflammation are inversely associated with VO2 max in asymptomatic men. Journal of Applied Physiology. 2007;102:1374–1379. doi: 10.1152/japplphysiol.01028.2006. [DOI] [PubMed] [Google Scholar]

[R5] 5.Gustafson P. Measurement Error and Misclassification in Statistics and Epidemiology: Impacts and Bayesian Adjustments. Chapman and Hall; New York: 2004. [Google Scholar]

[R6] 6.Qi L, Wang CY, Prentice RL. Weighted estimators for proportional hazards regression with missing covariates. J Amer Statist Assoc. 2005;100:1250–1263. [Google Scholar]

[R7] 7.Natarajan L. Regression calibration for dichotomized mismeasured predictors. International Journal of Biostatistics. 2009;5(1):1143. doi: 10.2202/1557-4679.1143. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R8] 8.Carroll RJ, Ruppert D, Stefanski LA, Crainiceanu CM. Measurement Error in Nonlinear Models, A modern Perspective. second. Chapman and Hall; London: 2006. [Google Scholar]

[R9] 9.Buonaccorsi J. Measurement Error: Models, Methods, and Applications. Hapman and Hall/CRC; Boca Raton: 2010. [Google Scholar]

[R10] 10.Kosinski A, Flanders WD. Evaluating the exposure and disease relationship with adjustment for different types of exposure misclassification: a regression approach. Stat Med. 1999;18:2795–2808. doi: 10.1002/(sici)1097-0258(19991030)18:20<2795::aid-sim192>3.0.co;2-s. [DOI] [PubMed] [Google Scholar]

[R11] 11.Huang GH, Bandeen-Roche K. Building an identifiable latent variable model with covariate effects on underlying and measured variables. Psychometrika. 2004;69:5–32. [Google Scholar]

[R12] 12.Wang CY, Song X. Expected Estimating Equations via EM for Proportional Hazards Regression with Covariate Misclassification. Biostatistics. 2013;14:351–365. doi: 10.1093/biostatistics/kxs046. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R13] 13.Flegal KM, Keyl PM, Nieto FJ. Differential misclassification arising from nondifferential errors in exposure measurement. Amer J Epi. 1991;134:1233–1244. doi: 10.1093/oxfordjournals.aje.a116026. [DOI] [PubMed] [Google Scholar]

[R14] 14.Gustafson P, Le Nhu D. Comparing the effects of continuous and discrete covariate mismeasurement, with emphasis on the dichotomization of mismeasured predictors. Biometrics. 2002;4:878–87. doi: 10.1111/j.0006-341x.2002.00878.x. [DOI] [PubMed] [Google Scholar]

[R15] 15.Duggan C, Wang CY, Neuhouser M, Xiao L, Smith AW, Reding K, Baumgartner RN, Baumgartner KB, Bernstein L, Ballard-Barbash R, McTiernan A. Associations of insulin-like growth factor and insulin-like growth factor binding protein-3 with mortality in women with breast cancer. International Journal of Cancer. 2013;132:1191–1200. doi: 10.1002/ijc.27753. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R16] 16.Lyles R. A note on estimating crude odds ratios in case-control studies with differentially misclassified exposure. Biometrics. 2002;58:1034–1036. doi: 10.1111/j.0006-341x.2002.1034_1.x. [DOI] [PubMed] [Google Scholar]

[R17] 17.Kchenhoff H, Mwalili SM, Lesaffre E. A general method for dealing with misclassification in regression: the misclassification SIMEX. Biometrics. 2006;62:85–96. doi: 10.1111/j.1541-0420.2005.00396.x. [DOI] [PubMed] [Google Scholar]

[R18] 18.Dalen I, Buonaccorsi JP, Sexton JA, Laake P, Thoresen M. Correction for misclassification of a categorized exposure in binary regression using replication data. Stat Med. 2009;28:3386–3410. doi: 10.1002/sim.3712. [DOI] [PubMed] [Google Scholar]

[R19] 19.Seguin RA, Buchner D, Lui J, Messina C, Manson J, Moreland L, Allison M, Wang CY, Patel M, LaCroix AZ. Sedentary Behavior and Mortality in Older Women: The Womens Health Initiative Observational and Extension Studies. American Journal of Preventive Medicine. 2014;46:122–135. doi: 10.1016/j.amepre.2013.10.021. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R20] 20.Wang CY, Huang Y, Chao EC, Jeffcoat MK. Expected estimating equations for missing data, measurement error, and misclassification, with application to longitudinal nonignorably missing data. Biometrics. 2008;64:85–95. doi: 10.1111/j.1541-0420.2007.00839.x. [DOI] [PubMed] [Google Scholar]

[R21] 21.Wu L, Liu W, Hu XJ. (2010) Joint Inference on HIV Viral Dynamics and Immune Suppression in Presence of Measurement Errors. Biometrics. 2010;66:327–335. doi: 10.1111/j.1541-0420.2009.01308.x. [DOI] [PubMed] [Google Scholar]

[R22] 22.Yi GY, Liu W, Wu L. Simultaneous inference and bias analysis for longitudinal data with covariate measurement error and missing responses. Biometrics. 2011;67:67–75. doi: 10.1111/j.1541-0420.2010.01437.x. [DOI] [PubMed] [Google Scholar]

[R23] 23.Shephard RJ. Maximal oxygen intake and independence in old age. Br J Sports Med. 2009;43:342–346. doi: 10.1136/bjsm.2007.044800. [DOI] [PubMed] [Google Scholar]

[R24] 24.Guo Y, Little RJ, McConnell DS. On using summary statistics from an external calibration sample to correct for covariate measurement error. Epidemiology. 2012;23:165–174. doi: 10.1097/EDE.0b013e31823a4386. [DOI] [PubMed] [Google Scholar]

[R25] 25.White MT, Xie SX. Adjustment for measurement error in evaluating diagnostic biomarkers by using an internal reliability sample. Stat Med. 2013;32:4709–4725. doi: 10.1002/sim.5878. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R26] 26.Wang CY. Robust sandwich covariance estimation for regression calibration estimator in Cox regression with measurement error. Statistics and Probability Letters. 1999;45:371–378. [Google Scholar]

PERMALINK

Methods to Adjust for Misclassification in the Quantiles for the Generalized Linear Model with Measurement Error in Continuous Exposures

Ching-Yun Wang

Jean De Dieu Tapsoba

Catherine Duggan

Kristin L Campbell

Anne McTiernan

SUMMARY

1 Introduction

2 Statistical Models

3 Normality-based Regression Calibration Method

3.1 NRC for Linear Regression

3.2 NRC for Logistic Regression

4 Linearization-based Regression Calibration Method

4.1 LRC for Linear Regression

Figure 1.

4.2 LRC for Logistic Regression

5 Simulation Study

Table 1.

Table 2.

Table 3.

Table 4.

6 Data Analysis of the NEW Trial

Figure 2.

Table 5.

7 Discussion

Supplementary Material

Acknowledgments

Appendix: Estimating Equations and Standard Error Estimation for NRC and LRC

Footnotes

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Methods to Adjust for Misclassification in the Quantiles for the Generalized Linear Model with Measurement Error in Continuous Exposures

Ching-Yun Wang

Jean De Dieu Tapsoba

Catherine Duggan

Kristin L Campbell

Anne McTiernan

SUMMARY

1 Introduction

2 Statistical Models

3 Normality-based Regression Calibration Method

3.1 NRC for Linear Regression

3.2 NRC for Logistic Regression

4 Linearization-based Regression Calibration Method

4.1 LRC for Linear Regression

Figure 1.

4.2 LRC for Logistic Regression

5 Simulation Study

Table 1.

Table 2.

Table 3.

Table 4.

6 Data Analysis of the NEW Trial

Figure 2.

Table 5.

7 Discussion

Supplementary Material

Acknowledgments

Appendix: Estimating Equations and Standard Error Estimation for NRC and LRC

Footnotes

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases