Expected Estimating Equation using Calibration Data for Generalized Linear Models with a Mixture of Berkson and Classical Errors in Covariates

Jean de Dieu Tapsoba; Shen-Ming Lee; Ching-Yun Wang

doi:10.1002/sim.5966

. Author manuscript; available in PMC: 2015 Feb 20.

Published in final edited form as: Stat Med. 2013 Sep 6;33(4):675–692. doi: 10.1002/sim.5966

Expected Estimating Equation using Calibration Data for Generalized Linear Models with a Mixture of Berkson and Classical Errors in Covariates

Jean de Dieu Tapsoba ^1,^*,^†, Shen-Ming Lee ², Ching-Yun Wang ¹

PMCID: PMC3947110 NIHMSID: NIHMS528416 PMID: 24009099

Abstract

Data collected in many epidemiological or clinical research studies are often contaminated with measurement errors that may be of classical or Berkson error type. The measurement error may also be a combination of both classical and Berkson errors and failure to account for both errors could lead to unreliable inference in many situations. We consider regression analysis in generalized linear models when some covariates are prone to a mixture of Berkson and classical errors and calibration data are available only for some subjects in a subsample. We propose an expected estimating equation approach to accommodate both errors in generalized linear regression analyses. The proposed method can consistently estimate the classical and Berkson error variances based on the available data, without knowing the mixture percentage. Its finite-sample performance is investigated numerically. Our method is illustrated by an application to real data from an HIV vaccine study.

Keywords: Berkson error, calibration subsample, classical error, expected estimating equation, generalized linear model, instrumental variable

1 Introduction

Measurement error is a recurring issue in medical and epidemiological studies aiming to characterize the relationship between a response variable and a vector of covariates. The problem arises due to the fact that values of some covariates cannot be accurately obtained and are measured with errors instead. Some examples include the CD4 cell count in the AIDS clinical trial ACTG 175 [1] and dosimetry data [2]. Two different types of measurement errors exist in the literature, namely the classical and Berkson measurement errors [3]. A major difference between the two error types is that a classical-type error is independent of the unobservable covariates, while a Berkson-type error is positively correlated with the unobservable covariates. The CD4 cell counts are well-known to be prone to classical measurement error. Radiation dose measures on the other hand are often believed to be subject to a mixture of both Berkson and classical errors. For example, DS02 radiation dose estimates for atomic-bomb survivors, who were followed up by the Radiation Effects Research Foundation (RERF) are contaminated with both errors due to uncertainties related to averaging among survivors who generally shared the same location and survivors’ individual recollections of location and shielding [4]. If ignored, measurement error could lead to loss of power for hypothesis testing and biased estimation in many situations [5, 6].

Substantial effort has been devoted in the literature to developing methods to accommodate covariate measurement error in linear and nonlinear models in the past decades, especially when the error is classified as Berkson error or purely classical error. A simple method adjusting for classical error in covariates is the regression calibration method, which is to replace the unobserved true covariates by their conditional expectations given observed covariates in regression [7]. Other important methods dealing with the problem of classical error in covariates include the instrumental variable approach, the conditional score approach [3], the corrected score method [8, 9] and the simulation-extrapolation (SIMEX) method [10]. For the problem of nonlinear regression with Berkson error in the covariates, Whitemore and Keller [11] suggested an approximation method that reduces the bias induced by the error. Also, a minimum distance method is developed in [12] and maximum likelihood based methods are discussed in [13, 14] for the same problem.

Methods simply accounting for classical or Berkson error may not be applicable to a situation where the covariates are subject to a combination of both classical and Berkson errors and the relationship between the response and true covariates is nonlinear. Recently, several methods have been proposed to simultaneously adjust for the presence of mixture of Berkson and classical errors in covariates for a few generalized regression models. For example, Reeves et al. [2] studied a mixture of Berkson and classical errors model and suggested a regression calibration method for logistic regression. Considering the same problem, Mallick et al. [15] proposed a Bayesian method using Markov chain Monte-Carlo (MCMC) techniques and Li et al. [16] developed a Monte Carlo expectation-maximization (MCEM) approach. Kukush et al. [17] investigated a different measurement error model that also incorporates errors of both types and provided maximum likelihood-based methods for logistic regression. The available methods accounting for the effect of both classical and Berkson measurement errors in regression analysis generally require that the variances of the errors be known or related through a known function. However, these assumptions are hardly justifiable and may not be appropriate in many situations.

We are concerned with the problem of parameter estimation in generalized linear models when the covariates are possibly subject to a mixture of classical and Berkson errors and there is no replication or validation data for the mismeasured covariates. We assume that in a subset of the study cohort, an instrumental variable and another surrogate for the unobserved covariates are available. This subset is called the calibration sample. To our knowledge, this problem has not been well-discussed yet in the literature. It is practically very important to address this issue as we illustrate with the VAX004 real data example. The VAX004 study is a double-blind randomized trial of a vaccine to protect against HIV-1 infection. It involved 5403 adults volunteers (5095 men and 308 women) and was conducted in the United States, Canada, Puerto Rico and the Netherlands between 1998 and 2002. The study participants were either men who have sex with men or women at high risk for heterosexual HIV-1 transmission. During the trial, data were collected on variables including the occurrences of sexually transmitted infections, the number of male partners, HIV sero-status of partners in unprotected anal, vaginal or oral sex acts through questionnaires. More details regarding the study can be found in Flynn et al. [18]. In a regression analysis to study the effect of the number of HIV positive male partners and vaccine treatment on HIV infection, a naive method is to use the reported number of HIV positive male partners as a true covariate value. A problem is that the reported number of HIV positive male partners is potentially subject to recall errors (classical type). Also, misclassification in HIV sero-status of a male partner may lead to an error in the total number of HIV positive male partners. However, the potential misclassification error may be because the subject did not know his partners’ HIV sero-status, and hence is likely to be independent of the reported number of HIV positive partners. This leads to the concern that in addition to classical error due to recall, Berkson error may be involved. To address the concern of Berkson error, the use of a mixture of errors in the analysis may serve as a tool to test if Berkson error is involved. Another complication is the fact that there are no replicates for the number of HIV positive male partners. The development of new methods to adjust for the measurement errors in this situation is then appealing. We treat the number of unprotected anal or oral sex with HIV positive male partners as an instrumental variable. This variable is likely to be correlated with the number of HIV positive male partners and could serve as an instrument for the true underlying number of HIV positive male partners. Data on the number of HIV positive male partners were not available for some participants because they did not respond to the questions regarding the number of times they had unprotected anal or oral sex with their male partners. We allow the measurement error to include features of both Berkson error and classical error and develop an expected estimating equation (EEE) approach to account for such a feature of the measurement error. The proposed method needs no assumption regarding the mixture proportion of the error variances.

The rest of this paper is structured as follows. Section 2 describes the model for the mixture of classical and Berkson errors and the general form of the primary regression model. Section 3 provides a brief review of the naive approach, the regression calibration and SIMEX methods for the estimation of the regression parameters. This section further presents our proposed approach to accommodating the mixture of errors in the covariates. Section 4 shows the results of a simulation study investigating the finite-sample performance of the proposed method. In Section 5, we illustrate our method with an application to the VAX004 data. Section 6 concludes this work with a summary and discussion.

2 Model formulations

Let n be the number of study individuals. For individual i, let Y_i denote the response variable, X_i be the primary covariate that cannot be measured precisely and Z_i represent a vector of error-free covariates in a generalized linear model. For notational simplicity, we consider the case when X_i is univariate. Letting W_i be the observed version of X_i, we assume that W_i and X_i are related through the following mixture of Berkson and classical errors model.

{\begin{matrix} X_{i} = L_{i} + U_{b i}, \\ W_{i} = L_{i} + U_{c i}, \end{matrix}

(1)

where L_i is a latent variable with mean μ_l and variance $σ_{l}^{2}$ , U_bi and U_ci are independent zero-mean measurement errors with variances $σ_{b}^{2}$ and $σ_{c}^{2}$ , respectively. The errors U_bi and U_ci are assumed to be independent of L_i and Z_i. Note that model (1), which was also studied by Mallick et al. [15], Li et al. [16], Carroll et al. [19] and Apanasovich et al. [20] can be written under the form W_i = X_i + U_ci − U_bi. It embodies features of both Berkson error and classical error structures and reduces to a Berkson error model when $σ_{c}^{2} = 0$ and a classical error model when $σ_{b}^{2} = 0$ . Moreover, the relationship between the response variable Y_i and covariates X_i and Z_i is specified as follows.

E (Y_{i} ∣ X_{i}, Z_{i}) = ϕ (β_{0} + β_{1} X_{i} + β_{2}^{'} Z_{i}),

(2)

where $β = {(β_{0}, β_{1}, β_{2}^{'})}^{'}$ is a vector of unknown parameters of interest and φ(.) is a known function. In particular, φ(u) = u for linear regression, φ(u) = exp(u) for Poisson regression and φ(u) = {1 + exp(−u)}⁻¹ for logistic regression. Furthermore, for each subject i in the calibration sample, let M_i denote an instrumental variable for X_i. As noted in [3, 21], an instrumental variable for X_i is essentially correlated with X_i and uncorrelated with the measurement error U_ci − U_bi. The instrumental variable M_i is modeled as follows.

M_{i} = α_{0} + α_{1} L_{i} + α_{2}^{'} Z_{i} + V_{i},

(3)

where $α = {(α_{0}, α_{1}, α_{2}^{'})}^{'}$ is a vector of unknown parameters and V_i is zero-mean random variable, which is independent of L_i, U_bi, U_ci and Z_i, i = 1, … , n. In addition, we assume the availability in the calibration sample of another surrogate variable Q_i for X_i satisfying

Q_{i} = γ_{0} + γ_{1} X_{i} + γ_{2}^{'} Z_{i} + ∊_{i},

(4)

where $γ = {(γ_{0}, γ_{1}, γ_{2}^{'})}^{'}$ is a vector of unknown parameters and ε_i has mean zero and is independent of L_i, U_bi, U_ci, Z_i and V_i, i = 1, … , n. Let η_i indicate whether subject i is in the calibration sample or not and θ = P(η_i = 1). Hence, η_i = 1 if M_i and Q_i are available and η_i = 0 otherwise. We assume that given X_i and Z_i, the response Y_i is independent of W_i, M_i and Q_i, i = 1, … , n. It is further assumed that η_i is independent of (Y_i, Z_i, L_i, U_bi, U_ci, V_i, ε_i) and that (Y_i, Z_i, L_i, U_bi, U_ci, V_i, ε_i, η_i), i = 1, … , n, are independent. Our main interest lies in the estimation of the vector of parameters $β = {(β_{0}, β_{1}, β_{2}^{'})}^{'}$ based on all the observed data in three common generalized linear models, which are linear, logistic and Poisson regression models. In the following, we denote the entire available data for the ith individual by O_1i = (Y_i, W_i, M_i, Q_i, Z_i) if η_i = 1 and O_2i = (Y_i, W_i, Z_i) if η_i = 0, i = 1, … , n.

3 Estimation methods

In this section we first review briefly the naive, regression calibration (RC) and simulation-extrapolation (SIMEX) methods for the estimation of the parameter of interest β. Afterwards, we present our approach to accounting for the errors in the estimation of this parameter using all the observed data.

3.1 Naive regression

In a situation where the covariate X_i is observed, a consistent estimator of β solves the following equation:

\sum_{i = 1}^{n} (\begin{matrix} 1 \\ X_{i} \\ Z_{i} \end{matrix}) {Y_{i} - ϕ (β_{0} + β_{1} X_{i} + β_{2}^{'} Z_{i})} = 0 .

(5)

Since X_i is not observed, a naive approach to estimating β in our context is to substitute X_i with W_i in (5) and to solve the resulting equation for β. Let ${\hat{β}}_{n a}$ denote the estimator of β obtained through the naive method. In linear regression, it is easily seen that $E (Y_{i} ∣ W_{i}, Z_{i}) = β_{0} + β_{1} W_{i} + β_{2}^{'} Z_{i}$ , if $σ_{c}^{2} = 0$ in which case ${\hat{β}}_{n a}$ is unbiased. Hence, pure Berkson error is not a major concern in linear regression. However, ${\hat{β}}_{n a}$ is biased whenever classical error is present in linear regression. For example, if Z_i is univariate and (L_i, U_bi, U_ci, Z_i) is a multivariate normal random variable, it can be shown that $E (X_{i} ∣ W_{i}, Z_{i}) = (1 - λ_{1}) μ_{l} - λ_{2} μ_{z} + λ_{1} W_{i} + λ_{2} Z_{i}$ and $Var (X_{i} ∣ W_{i}, Z_{i}) = σ_{b}^{2} + (1 - λ_{1}) σ_{l}^{2} - λ_{2} σ_{l z}$ , where $λ_{1} = (σ_{l}^{2} σ_{z}^{2} - σ_{l z}^{2}) ∕ {σ_{z}^{2} (σ_{l}^{2} + σ_{c}^{2}) - σ_{l z}^{2}}$ and $λ_{2} = σ_{c}^{2} σ_{l z} ∕ {σ_{z}^{2} (σ_{l}^{2} + σ_{c}^{2}) - σ_{l z}^{2}}$ . Here, σ_lz = cov(L_i, Z_i), and μ_z and $σ_{z}^{2}$ are the mean and variance of Z_i, respectively. As a result, $E (Y_{i} ∣ W_{i}, Z_{i}) = β_{0} + β_{1} {(1 - λ_{1}) μ_{l} - λ_{2} μ_{z}} + β_{1} λ_{1} W_{i} + (β_{2} + λ_{2} β_{1}) Z_{i}$ , and it is clear that ${\hat{β}}_{n a}$ is biased in linear regression if $σ_{c}^{2} > 0$ under normality assumption. In logistic, it is well-known that ${\hat{β}}_{n a}$ is not consistent in a Bekson error setting [11] or classical error situation [3, 22]. Hence, it is obvious that the naive estimation may not work in logistic regression when errors of both types are present in the covariates. Furthermore, it can be shown for Poisson regression that $E (Y_{i} ∣ W_{i}, Z_{i}) = \exp (β_{0} + σ_{b}^{2} β_{1}^{2} ∕ 2 + β_{1} W_{i} + β_{2}^{'} Z_{i})$ in a normal Berkson error case, suggesting that the naive estimators for β₁ is unbiased in this case though, the estimation of the intercept is affected by the error. Also, $E (Y_{i} ∣ W_{i}, Z_{i}) = \exp (β_{0} + β_{2}^{'} Z_{i}) E {\exp (β_{1} X_{i}) ∣ W_{i}, Z_{i}}$ indicating that ${\hat{β}}_{n a}$ is biased when the error is classical as noted in [21]. Therefore, the naive method could result in a biased estimator of β in Poisson regression when the covariates are subject to a mixture of errors of both types. In general, the combination of Berkson and classical errors in the covariates may accentuate the attenuation phenomenon in nonlinear regression.

3.2 RC and SIMEX methods

A simple alternative method that could be applied to reduce the attenuation bias in the estimation of β is the RC approach, which was investigated by Reeves et al. [2], Wang et al. [7] and Kuha [23]. The main idea of the RC method is to replace the unobserved covariates with their conditional expectations given the observed covariates in the estimating equations for β. The method is implemented in our problem by substituting E(X_i|W_i, Z_i) for X_i in (5). It is here referred to as the RC1 method and the corresponding estimator is denoted by ${\hat{β}}_{r c 1}$ . Another RC approach that we termed as RC2 method consists to replace X_i in (5) with E(X_i|W_i, M_i, Q_i, Z_i) if η_i = 1 and E(X_i|W_i, Z_i) otherwise. Let ${\hat{β}}_{r c 2}$ be the estimator, which is obtained based on the RC2 method. ${\hat{β}}_{r c 2}$ is expected to perform better than ${\hat{β}}_{r c 1}$ since it uses additional information provided by the available data in the calibration sample. When the primary regression model is linear, it is clear that $E (Y_{i} ∣ W_{i}, Z_{i}) = β_{0} + β_{1} E (X_{i} ∣ W_{i}, Z_{i}) + β_{2}^{'} Z_{i}$ . This shows that both ${\hat{β}}_{r c 1}$ and ${\hat{β}}_{r c 2}$ are consistent estimators for β in linear regression when the covariates are subject to both classical and Berkson errors. However, they may be biased in logistic or Poisson regression. In fact, both estimators rely on an approximation to $E {ϕ (β_{0} + β_{1} X_{i} + β_{2}^{'} Z_{i}) ∣ W_{i}, Z_{i}}$ based on a first order Taylor series expansion of this expression about E(X_i|W_i, Z_i), where φ(.) was defined in (2). The biasedness of ${\hat{β}}_{r c 1}$ or ${\hat{β}}_{r c 2}$ in nonlinear regression becomes obvious when a second order Taylor series expansion is considered. For example, it can be obtained from (2) that $E (Y_{i} ∣ W_{i}, Z_{i}) \approx ϕ {β_{0} + β_{1} E (X_{i} ∣ W_{i}, Z_{i}) + β_{2}^{'} Z_{i}} + 0.5 β_{1}^{2} [\partial^{2} ϕ {β_{0} + β_{1} E (X_{i} ∣ W_{i}, Z_{i}) + β_{2}^{'} Z_{i}} ∕ \partial X^{2}] Var (X_{i} ∣ W_{i}, Z_{i})$ . Therefore, it is not difficult to see that the approximation, which is based on a first order Taylor series expansion may not be satisfactory when $β_{1}^{2} Var (X_{i} ∣ W_{i}, Z_{i})$ is large. Also, it can be noted that this term is an increasing function of |β₁|, $σ_{b}^{2}$ and $σ_{c}^{2}$ under normality assumption for all the variables. Therefore, ${\hat{β}}_{r c 1}$ and ${\hat{β}}_{r c 2}$ are not consistent in logistic or Poisson regression when the covariates are prone to a mixture of Berkson and classical errors.

Another approximate bias-correction method that has gained substantial attention in the literature is the SIMEX approach, which was first suggested by Cook and Stefanski [10] to deal with the problem of classical measurement error in covariates for linear and nonlinear regression models. It was further discussed in [3, 24, 25]. Moreover, Apanasovich et al. [20] adapted the method to a situation where the mismeasured covariates are modeled purely nonparametrically, purely parametrically or have components that are modeled both parametrically and nonparametrically. They used a kernel-based method to estimate the model parameters. Generally speaking, SIMEX is a simulation-based method that adjusts for the measurement error through the use of a two-step procedure consisting of a simulation step followed by an extrapolation step. The simulation step involves the specification of a naive estimation procedure that would lead to a consistent estimator of the model parameter in the absence of measurement error and the construction of a large number of naive estimates based on additionally simulated data sets of errors with gradually increasing variance. The extrapolation step consists to extrapolate back to the situation of no measurement error using an extrapolant function. The application of the method to our situation is described as follows. Assuming that the conditional distribution of X_i given L_i and Z_i does not involve and that U_b is normally distributed, it can be noted that if L_i in (1) could be observed, the maximum likelihood estimation of β would be to solve

\sum_{i = 1}^{n} \frac{\partial}{\partial β} \log {L (Y_{i} ∣ L_{i}, Z_{i}, β)} = 0,

(6)

where $L (Y_{i} ∣ L_{i}, Z_{i}, β) = \int L (Y_{i} ∣ X_{i} = x, Z_{i}, β) L (X_{i} = x ∣ L_{i}, Z_{i})$ is the conditional distribution of Y_i given L_i, Z_i, and $L (.)$ denotes likelihood function. We recall that L_i is not observable and replacing it by W_i in (6) will lead to a biased estimation of β. In the simulation step of the SIMEX procedure to reduce the bias, for a non-negative real value, we generate new data $W_{ζ, r, i} = W_{i} + σ_{c} \sqrt{ζ} U_{r, i}$ , i = 1, … ,n, r = 1, … , R, where U_r,i are generated as independent and identically distributed random variables following the standard normal distribution. Here R represents the number of simulated data sets. Let ${\hat{β}}_{r} (ζ_{j})$ denote the estimate of β obtained by replacing L_i with W_ζ,r,i in (6) and denote $\hat{β} (ζ_{j}) = R^{- 1} \sum_{r = 1}^{R} {\hat{β}}_{r} (ζ_{j})$ , where 0 = ζ₀ < ζ₁ < … < ζ_J and J > 1. The extrapolation step consists to fit a regression model of $\hat{β} (ζ_{j})$ on ζ_j, j = 1, … , J, using the ordinary least squares estimation method. The SIMEX estimate of β, denoted by ${\hat{β}}_{s i m e x}$ is then obtained by extrapolating back to the case when ζ = −1, which represents the situation of no measurement error. A routinely used extrapolant function is the quadratic function. The covariance matrix of ${\hat{β}}_{s i m e x}$ can be estimated by means of standard bootstrap method. A problem associated with the SIMEX procedure is that it is an approximate method, subject to the choice of the extrapolant function. Furthermore, similar to RC, the procedure may not work well in nonlinear regression when the variance of the measurement error is large or when |β₁| is large [25]. We pursue a more reliable estimation approach, which is based on all the observed data in the following subsection.

3.3 Expected estimating equation approach

The expected estimating equation (EEE) method was investigated by Wang et al. [26] in a situation where the response variable and some covariates may be missing, misclassified or subject to classical measurement error and multiple unbiased surrogates for the error-prone variables are available. In our problem, the covariates are possibly subject to both Berkson and classical errors and there is only one unbiased surrogate. Moreover, a calibration subsample data is available only for some individuals. We propose the EEE estimator for β that solves the following estimating equation:

\sum_{i = 1}^{n} η_{i} E {S (Y_{i}, X_{i}, Z_{i}) ∣ O_{1 i}} + (1 - η_{i}) E {S (Y_{i}, X_{i}, Z_{i}) ∣ O_{2 i}} = 0,

(7)

where $S (Y_{i}, X_{i}, Z_{i}) = {(1, X_{i}, Z_{i}^{'})}^{'} {Y_{i} - ϕ (β_{0} + β_{1} X_{i} + β_{2}^{'} Z_{i})}$ . We denote the solution to (7) by ${\hat{β}}_{e e e}$ and refer to it as the EEE estimator for β. The calculations of conditional expectations involved in (7) and the evaluations of E(X_i|W_i, Z_i) and E(X_i|W_i, M_i, Q_i, Z_i) for RC1 and RC2 methods require specifications of the distribution functions of L_i, U_bi, U_ci, V_i and ε_i which do not have to be necessarily normal. Moreover, letting O_i = O_1i if η_i = 1 and O_i = O_2i otherwise, the conditional expectation $E {S (Y_{i}, X_{i}, Z_{i}) ∣ O_{i}}$ can be evaluated as

\frac{\int_{L} \int_{U} S (Y_{i}, X_{i} = L + U, Z_{i}) L (Y_{i} ∣ X_{i} = L + U, Z_{i}) L (W_{i} ∣ L) F_{i} {(L, U)}^{η_{i}} L (L, Z_{i}) L (U) d L d U}{\int_{L} \int_{U} L (Y_{i} ∣ X_{i} = L + U, Z_{i}) L (W_{i} ∣ L) F_{i} {(L, U)}^{η_{i}} L (L, Z_{i}) L (U) d L d U},

where $L (.)$ denotes likelihood function and $F_{i} (L, U) = L (Q_{i} ∣ X_{i} = L + U, Z_{i}) L (M_{i} ∣ L, Z_{i})$ . The expectations E(X_i|W_i, Z_i) and E(X_i|W_i, M_i, Q_i, Z_i) can be calculated similarly. The integrals in the above expression can be evaluated by means of numerical integration techniques including the Gauss-Hermite quadrature rule and the trapezoidal integration rule, which involves uniformly partitioning acceptable ranges of the L and U_b axes into a specified number of intervals and applying Riemann summation techniques.

A tacit assumption that has been made so far is the knowledge of the nuisance parameters including μ_l, α, γ, $σ_{l}^{2}$ , $σ_{b}^{2}$ , $σ_{c}^{2}$ , $σ_{v}^{2}$ and $σ_{∊}^{2}$ which are needed for the calculation of conditional expectations. In practice, these parameters may need to be estimated from the data. If $σ_{b}^{2}$ is known or is linked to $σ_{c}^{2}$ through a known bijective function, then the assumption of the availability of data on Q_i can be relaxed. In this case O_1i = (Y_i, W_i, M_i, Z_i) and all the nuisance parameters can be identified based on the observations O_1i such that η_i = 1, i = 1, … , n. Moreover, conditional expectations for the estimation of β by the RC2 method will not involve Q_i. In a more general situation when $σ_{b}^{2}$ is unknown and there is no assumed relationship between the error variances, the data on Q_i will serve as an additional information for the estimation of the nuisance parameters. In such a situation, estimating equations for the vector of nuisance parameters, denoted by ν, can be obtained based on moment calculations using the observed data on Y_i, M_i, Q_i, W_i and Z_i for subjects in the calibration subsample. For example, estimating equations based on moment calculations are given in Appendix B when the primary regression model is assumed to be linear and does not involve Z. Let Ω = (β′, ν′)′, write (7) as $\sum_{i = 1}^{n} Ψ_{1 i} (β) = 0$ and define $Ψ_{i} (Ω) \equiv {Ψ_{1 i}^{'} (β), Ψ_{2 i}^{'} (ν)}^{'}$ , where $\sum_{i = 1}^{n} Ψ_{2 i} (ν) = 0$ is an estimating equation for ν. Furthermore, let Ω₀ be the true value of Ω and $\hat{Ω}$ denote the solution to $\sum_{i = 1}^{n} Ψ_{i} (Ω) = 0$ . The asymptotic properties of $\sqrt{n} = (\hat{Ω} - Ω_{0})$ are given by the following result.

Proposition: Under the regular conditions (C1)-(C6) in Appendix A, $\sqrt{n} = (\hat{Ω} - Ω_{0})$ is asymptotically normally distributed with mean zero and covariance matrix $A^{- 1} (Ω_{0}) B (Ω_{0}) {A^{- 1} (Ω_{0})}^{'}$ , where $A (Ω_{0})$ and $B (Ω_{0})$ are the limits in probability of ${\hat{A}}_{n} (Ω_{0}) = n^{- 1} \sum_{i = 1}^{n} \partial Ψ_{i} (Ω_{0}) ∕ \partial Ω$ and ${\hat{B}}_{n} (Ω_{0}) = n^{- 1} \sum_{i = 1}^{n} Ψ_{i} (Ω_{0}) Ψ_{i}^{'} (Ω_{0})$ , respectively.

The proof of the proposition is sketched in Appendix A. The asymptotic covariance matrix of $\sqrt{n} (\hat{Ω} - Ω_{0})$ can be consistently estimated by ${\hat{A}}_{n}^{- 1} (\hat{Ω}) {\hat{B}}_{n} (\hat{Ω}) {{\hat{A}}_{n}^{- 1} (\hat{Ω})}^{'}$ .

4 Simulation study

We investigated the finite-sample performances of the methods discussed in the previous section through a simulation study. We considered linear regression, logistic regression and Poisson regression models for the response Y with a single explanatory variable X, which is scalar. In the simulations, the latent variable L was generated from $N (μ_{l}, σ_{l}^{2})$ with μ_l = 0.5 and $σ_{l}^{2} = 1$ . The Berkson error U_b and the classical error U_c followed zero-mean normal distributions with variances $σ_{b}^{2}$ and $σ_{c}^{2}$ , respectively. Also, X was simulated from the model X = L + U_b and the observed version of X was drawn as W = L + U_c. We set $σ_{b}^{2} = 0.2$ or 0.4, and $σ_{c}^{2} = 0.3$ or 0.5 to study the separate and combined effects of the errors on the performances of the methods. Moreover, the proportion of the calibration data was chosen as θ = 0.5 or 0.7 to investigate how the results evolve with the size of the available data in the calibration sample. The instrumental variable was simulated following the model M = α₀ + α_1L + V, where V was normal with mean 0 and variance $σ_{v}^{2} = 1$ . We took α₀ = 1 and α₁ = −2. Furthermore, the additional variable in the calibration sample was simulated as Q = γ₀ + γ₁X + ε, where γ₀ = −1, γ₁ = 2 and ε was normally distributed with mean zero and variance $σ_{∊}^{2} = 1$ . The variable η, indicating whether M and Q are available or not was generated from the Bernoulli distribution with probability of success P (η = 1) = θ. For the linear regression, the response variable followed the model Y = β₀ + β₁X + e, where e is normal with mean zero and variance $σ_{e}^{2} = 1$ . In the logistic regression case, Y was simulated from the Bernoulli trial with probability of success {1 + exp(−β₀ − β₁X)}⁻¹. We generated Y as Poisson random variable with mean exp(β₀ + β₁X) for Poisson regression.

A total of 400 Monte Carlo samples of size n = 500 were generated in the simulations for each case. We estimated the parameter of interest β = (β₀, β₁)′ based on the naive method, the RC methods (RC1 and RC2), the SIMEX method and the EEE approach. The naive estimator simply replaces X by W in (5) with no covariates Z. The RC1 estimator uses the conditional expectation E(X|W) in place of X in (5) without Z. The RC2 substitutes E(X|W, M, Q) for X in (5) if η = 1 and replaces X by E(X|W) in (5) when there is no calibration data (η = 0). The EEE estimator, which solves (7) without Z involved in the equation uses all observed data and accounts for the mixture of Berkson and classical errors. For the SIMEX method, we created R = 50 additional data sets of measurement error at each point ζ ∈ {0, 0.5, 1, 1.5, 2} in the simulation step and used the quadratic model in the extrapolation step. The estimators were evaluated with regard to their biases (Bias), sample standard deviation of the estimates (SD), average of the estimated standard erros (ASE) of the estimators and coverage probabilities (CP) of their 95% Wald confidence intervals. The standard errors of the RC1, RC2 and EEE estimators were computed using the sandwich standard error estimation approach. The estimation of the standard error for the SIMEX estimator was based on the bootstrap method, resampling 30 times. Integrals involved in the evaluation of condition expectations for the implementation of RC2, SIMEX or EEE method were computed using Gauss-Hermite integration techniques with 10 quadrature points.

Table 1 displays the simulation results for the estimation of in the linear regression case, where we set β = (−2, 2)′. It can be seen from this table that the naive estimator for β performs poorly as expected. It shows large biases and very low coverage probabilities of less than 50%. The biases of the naive estimator increase as the magnitude of the classical error increases. However, these biases appear unaffected to some extent by the Berkson error, indicating that the naive estimator would be unbiased if the classical errors were absent. The RC1, RC2, SIMEX and EEE estimators on the other hand work well, showing small biases and coverage probabilities that are close to the nominal level of 95%. Larger variance of Berkson or classical error have the effect of increasing the standard errors of the RC1, RC2, SIMEX and EEE estimators. Furthermore, the performances of these estimators in terms of efficiencies are better when the proportion of calibration data is larger (θ = 0.7). The SIMEX estimator appears to be more efficient than the regression calibration and EEE estimators in these simulations. It can also be seen that the RC2 performs better than the RC1 with regards to biases and standard errors. This is probably due to the fact that RC2 estimator uses more information from the data than RC1 estimator does. Moreover, the EEE estimator for β₁ is more efficient than the RC2 estimator when the proportion of calibration data is small (θ = 0.5) and the magnitude of the classical error is greater than that of the Berkson error. The advantage seems to vanish as θ gets larger (θ = 0.7). A plausible explanation is that the EEE estimator makes more efficient use of the data over RC2 when using the non-calibration data, while they perform equally well in the calibration data. In Table S1 in the Supplementary materials we have demonstrated the phenomenon that in general when the proportion of the calibration sample is smaller the efficiency gain of the EEE over RC2 is larger. The simulation settings for this table were similar to those for Table 1 with the difference that we set $σ_{b}^{2} = 0.2$ and θ = 0.3, 0.5, 0.7 or 0.9 to examine the effect of the proportion of the calibration data on the parameter estimation. The results in this table indicated that the efficiency gain of EEE over RC2 decreases as the proportion of calibration data gets larger.

Table 1.

Simulation results for linear regression: Y = β₀ + β₁X + e, where X = L + U_b; W = L + U_c; β = (−2, 2)′; M = 1 − 2L + V; Q = −1 + 2X + ∊; θ is the proportion of the calibration data; L ~ N(0.5,1); U_b ~ N(0, $σ_{b}^{2}$ ) is the Berkson error; U_c ~ N(0, $σ_{c}^{2}$ ) is the classical error; V ~ N(0,1); e ~ N(0,1); ∊ ~ N(0,1); n = 500; Naive, “naive” regression replacing X by W; RC1, Regression calibration approach replacing X by its conditional expectation given W; RC2, Regression calibration approach replacing X by its conditional expectation given W and Q, M in the calibration sample; SIMEX, simulation extrapolation procedure; EEE, Expected estimating equation method.

				$σ_{c}^{2}$ = 0.3					$σ_{c}^{2}$ = 0.5
θ	$σ_{b}^{2}$	β		Naive	RC1	RC2	SIMEX	EEE	Naive	RC1	RC2	SIMEX	EEE
0.5	0.2	β ₀	Bias	0.237	0.006	0.006	0.009	0.003	0.330	−0.001	−0.002	0.010	−0.007
			SD	0.082	0.104	0.104	0.094	0.107	0.083	0.112	0.110	0.096	0.111
			ASE	0.080	0.099	0.098	0.091	0.099	0.085	0.117	0.115	0.102	0.116
			CP	0.182	0.948	0.950	0.945	0.950	0.025	0.958	0.960	0.962	0.960
		β ₁	Bias	−0.464	0.003	0.000	−0.007	0.005	−0.666	−0.003	−0.003	−0.028	0.006
			SD	0.061	0.114	0.110	0.096	0.108	0.064	0.137	0.131	0.102	0.122
			ASE	0.065	0.112	0.108	0.097	0.105	0.065	0.137	0.130	0.108	0.122
			CP	0.000	0.965	0.955	0.950	0.942	0.000	0.960	0.948	0.948	0.958
	0.4	β ₀	Bias	0.236	0.002	0.002	0.001	−0.001	0.340	−0.001	0.000	−0.002	−0.004
			SD	0.089	0.109	0.108	0.106	0.114	0.093	0.129	0.126	0.116	0.132
			ASE	0.092	0.110	0.109	0.106	0.114	0.096	0.128	0.126	0.116	0.129
			CP	0.292	0.960	0.958	0.940	0.948	0.052	0.952	0.950	0.938	0.960
		β ₁	Bias	−0.466	0.001	0.001	0.003	0.008	−0.672	0.007	0.006	0.008	0.013
			SD	0.072	0.126	0.122	0.118	0.126	0.072	0.154	0.145	0.130	0.142
			ASE	0.073	0.125	0.121	0.118	0.123	0.072	0.151	0.142	0.129	0.138
			CP	0.000	0.955	0.948	0.952	0.955	0.000	0.955	0.948	0.952	0.955

0.7	0.2	β ₀	Bias	0.241	0.010	0.010	0.014	0.010	0.334	0.001	0.002	0.014	0.003
			SD	0.079	0.092	0.092	0.089	0.094	0.085	0.108	0.107	0.102	0.109
			ASE	0.081	0.093	0.092	0.089	0.094	0.085	0.108	0.106	0.100	0.108
			CP	0.160	0.962	0.960	0.938	0.970	0.030	0.962	0.962	0.948	0.960
		β ₁	Bias	−0.463	0.005	0.001	−0.007	0.003	−0.662	0.004	0.003	−0.023	0.004
			SD	0.065	0.098	0.097	0.090	0.097	0.067	0.119	0.115	0.101	0.112
			ASE	0.064	0.098	0.095	0.090	0.095	0.064	0.116	0.111	0.099	0.108
			CP	0.000	0.970	0.955	0.942	0.952	0.000	0.955	0.945	0.922	0.952
	0.4	β ₀	Bias	0.236	0.001	0.002	0.000	0.003	0.340	0.000	0.002	−0.003	0.004
			SD	0.089	0.102	0.101	0.100	0.103	0.093	0.116	0.114	0.111	0.116
			ASE	0.092	0.105	0.104	0.102	0.107	0.096	0.118	0.117	0.113	0.120
			CP	0.355	0.968	0.958	0.945	0.960	0.052	0.962	0.960	0.945	0.952
		β ₁	Bias	−0.466	0.004	0.001	0.004	0.002	−0.672	0.007	0.003	0.009	0.003
			SD	0.072	0.106	0.102	0.103	0.103	0.072	0.125	0.117	0.114	0.115
			ASE	0.073	0.110	0.107	0.106	0.108	0.072	0.128	0.122	0.118	0.121
			CP	0.000	0.960	0.960	0.958	0.965	0.000	0.955	0.962	0.952	0.965

Open in a new tab