Inference on data with both multiplicative and additive measurement errors

Yuxiang Zong; Yinfu Liu; Yanyuan Ma; Ingrid Van Keilegom

doi:10.1111/sjos.70009

. Author manuscript; available in PMC: 2026 Mar 5.

Published in final edited form as: Scand Stat Theory Appl. 2025 Aug 3;52(4):1763–1785. doi: 10.1111/sjos.70009

Inference on data with both multiplicative and additive measurement errors

Yuxiang Zong ¹, Yinfu Liu ², Yanyuan Ma ³, Ingrid Van Keilegom ¹

PMCID: PMC12959484 NIHMSID: NIHMS2149920 PMID: 41789348

Abstract

Measurement errors are omnipresent in many fields and can lead to serious problems in statistical analysis. In the literature, measurement errors are often assumed to be either additive or multiplicative. We consider the case where a variable is subject to both additive and multiplicative errors. We establish the identifiability and propose a moment-based estimator for the variances of these two types of errors, which is shown to be consistent. We further derive the asymptotic distribution of the estimator and conduct hypothesis tests to examine the existence of the two types of errors. We also develop a likelihood-based method to approximate the density of the error-prone variable. We apply our strategy in the context of linear regression and study its effect on the estimation of regression parameters in combination with Regression Calibration and Simulation Extrapolation. The proposed methodology is numerically investigated through simulations and is implemented in a real data application.

Keywords: Bernstein polynomial, Measurement error, Method of moments, Regression calibration, Simulation extrapolation

1. Introduction

In practice, it is common to encounter situations where a variable cannot be directly observed and is measured with noise, leading to the measurement error problem or the errors-in-variable problem. Measurement error problems are pervasive in many fields due to various reasons, such as inaccurate measuring devices, sampling errors, and imprecise data collection methods. For example, long-term systolic blood pressure may not be directly observable and is often approximated by blood pressure measured during a clinic visit. Ignoring measurement errors in statistical analysis can lead to serious problems, such as biased parameter estimation, loss of statistical power, or masking of important features in the data. Therefore, it is crucial to develop methods to account for measurement errors.

To address measurement errors, it is essential to accurately capture the relation between the true variable and its errored observation. In the literature, it is often assumed that the measurement error is additive or multiplicative. Specifically, let $X$ be the unobserved variable, and $W$ the corresponding observed variable. The classical additive measurement error model is defined as

W = X + ε,

(1)

where $E (ε) = 0$ and $X$ and $ε$ are independent. For example, in the National Cancer Institute’s OPEN study (Subar et al., 2003), the true energy intake was measured by the food frequency questionnaire (FFQ), and the log measured energy intake is assumed to contain additive errors.

The multiplicative measurement error model is defined as

W = X η,

(2)

where $X$ and $η$ are assumed to be independent and $η$ is usually assumed to follow Log-Normal distribution with $E (log η) = 0$ . An example of the multiplicative error model is the analysis of A-bomb survivor data from the Hiroshima and Nagasaki explosions studies by Pierce et al. (1992). The true explosion dose $X$ is not observable, but the estimates $W$ can be obtained and are assumed to be contaminated by multiplicative errors. Extensive researches on these two types of measurement errors have been conducted and are comprehensively explained in Carroll et al. (2006), Buonaccorsi (2010), Yi et al. (2021), among others.

Several studies have compared the performance of additive and multiplicative measurement error models in real-life data analysis; see, for instance, Marques (2004), Hunter et al. (2011), Tian et al. (2013), and Tang et al. (2015). However, very few research considers both multiplicative and additive measurement errors. Specifically, Rocke and Durbin (2001) considered an error model of the form $X η + ε$ where $η$ follows a Log-Normal distribution and $ε$ follows a Normal distribution and applied it to the gene expression measurements with cDNA arrays. However, they require data from a control group in which the real variable $X$ takes the value 0, resulting in multiple pure additive errors to be observed to facilitate estimating the variance of $ε$ . This limits its applicability. Berkson errors with a mixture of additive and multiplicative errors have been studied by Stram and Kopecky (2003). However, this model has not been applied to the error-in-variable regression problem. Therefore, it is necessary to further investigate the model with both additive and multiplicative errors.

Two main types of estimating the error structure are likelihood-based methods and moment-based methods. Likelihood-based methods are one of the earliest techniques applied to address measurement error issues, and they have been extensively discussed in the literature (Yi et al., 2021). However, likelihood-based methods have drawbacks, including assumptions regarding the true variable $X$ and computational complexity. In our setting, where both types of errors coexist, the likelihood function involves a double integration, leading to a significant computational burden. Hence, to estimate the error variances, we adopt the method of moments, which does not require full knowledge of the unobserved variable. We show the identifiability of the error variances, construct estimating equations to estimate the error variances based on replicated data, and prove the consistency of the estimator. With the estimated error variances, we employ a likelihood-based method to approximate the density of true variable, which proves to be a more efficient approach.

Once the error variance has been estimated, we further drive the asymptotic distribution of the estimator. The difficulty is that the true parameter possibly lies on the boundary of the parameter space. Andrews (2002) established the asymptotic distribution of the generalized method of moments estimator in such case. Zhang et al. (2008) focused on conducting variance component tests within the generalized linear mixed models framework based on the derived null asymptotic distribution. In this study, we use Andrews (2002) to derive the asymptotic distribution of the error variance estimator when the true parameter is on the boundary.

We also estimate the distribution of the true variable $X$ . In the literature regarding the classical additive measurement error problem, nonparametric deconvolution methods have been applied to estimate the density of $X$ (Butucea and Matias, 2005). However, when both multiplicative and additive errors coexist, the deconvolution method becomes ineffective in recovering the true variable distribution. Specifically, when both types of errors are present, two steps of deconvolution are required to recover the true distribution of $X$ . First, deconvolution is applied to recover the distribution of $X η$ , and then it is applied again after a log transformation to recover the distribution of $X$ . This process may lead to a loss of information due to the error accumulation from deconvolution procedures. Therefore, in this case, we turn to a likelihood-based approach introduced by Bertrand et al. (2019) to approximate the density function of $X$ , assuming that $X$ has a compact support. Based on the idea of Bertrand et al. (2019), we apply Bernstein polynomials to approximate the unknown density function of $X$ incorporating the estimated error variances, and the tuning parameter is selected based on AIC criteria.

In the context of linear regression, measurement errors can lead to biased and inconsistent estimates of the regression coefficients, invalid inference results, and inaccurate predictions. We investigate the issue caused by the existence of both types of measurement errors on the simple linear regression problem and propose regression calibration method and SIMEX method to correct the issue.

The paper is organized as follows. In Section 2, we introduce the measurement error model with both multiplicative and additive errors, propose a method to estimate the unknown error variances, and conduct hypothesis tests to examine the existence of measurement errors. Section 3 presents a method to estimate the density of the true measurement variable. Errors-in-variable regression problem with both types of errors is investigated in Section 4. Simulation studies and applications are shown in Section 5.

2. Methodology

2.1. Model and Assumptions

Let $X$ denote the true variable and $W$ be the surrogate variable measured with both additive and multiplicative errors. Let $η$ and $ε$ be the multiplicative error and additive error respectively. Specifically, we assume

W = X η + ε,

(3)

where $η ~ Log-Normal (0, σ_{η}^{2})$ , $ε ~ N (0, σ_{ε}^{2})$ , and $X$ , $η$ and $ε$ are mutually independent. We denote $μ \equiv E (X)$ , $σ_{x}^{2} \equiv var (X)$ , and $X$ is either positive or negative, but we do not assume $X$ to belong to any parametric distribution family.

Remark 2.1. The parametric assumptions on the two error distributions are essential for model identifiability. In addition, the assumptions that $η ~ Log-Normal (0, σ_{η}^{2})$ for multiplicative errors and $ε ~ N (0, σ_{ε}^{2})$ for additive errors are widely adopted in the measurement error literature due to their broad applicability and technical convenience. See, for example, Iturria et al. (1999); Lyles and Kupper (1997); Brenner Miguel et al. (2023); Bertrand et al. (2019). These error distribution assumptions can be validated using appropriate validation data or other external information. The estimation procedure in Section 2.2 relies on the error distributions through the first three moments, hence is more sensitive to the distribution assumption than in the pure additive error case.

2.2. Error Variance Estimation

We first propose a method to estimate error variances based on replicate data. Suppose that there are $n$ objects with two replicates $(W_{i 1}, W_{i 2})$ , $i = 1, \dots n$ , then we can write

W_{j} = X η_{j} + ε_{j}, j = 1, 2,

where $η_{1}$ , $η_{2}$ , $ε_{1}$ , $ε_{2}$ , $X$ are mutually independent and $ε_{j} ~ N (0, σ_{ε}^{2})$ , $log η_{j} ~ N (0, σ_{η}^{2})$ .

Note that $E (η) = e^{σ_{η}^{2} / 2}$ and $Var (η) = e^{σ_{η}^{2}} (e^{σ_{η}^{2}} - 1)$ . For notational simplicity, let $μ_{k} \equiv E (X^{k})$ and $a \equiv e^{σ_{η}^{2} / 2}$ . We obtain

E (W_{1}) = a μ_{1}, E (W_{1} W_{2}) = a^{2} μ_{2}, E (W_{1}^{2}) = a^{4} μ_{2} + σ_{ε}^{2}, E (W_{1}^{2} W_{2}) = a^{5} μ_{3} + a μ_{1} σ_{ε}^{2}, E (W_{1}^{3}) = a^{9} μ_{3} + 3 a μ_{1} σ_{ε}^{2} .

Eliminating $μ_{1}$ , $μ_{2}$ , $μ_{3}$ and $σ_{ε}^{2}$ from the above leads to to a cubic equation of $a^{2}$

0 = E (W_{1}) E (W_{1} W_{2}) a^{6} + (E (W_{1}^{2} W_{2}) - E (W_{1}) E (W_{1}^{2})) a^{4} - 3 E (W_{1}) E (W_{1} W_{2}) a^{2} + 3 E (W_{1}) E (W_{1}^{2}) - E (W_{1}^{3}) .

(4)

We next establish that (4) has a unique root for $a$ in the region $(1, \infty)$ .

Lemma 2.1. When $X > 0$ , $c o v (X^{2}, X) > 0$ . When $X < 0$ , $c o v (X^{2}, X) < 0$ .

Theorem 2.1 (Identifiability). When $X$ is positive or negative and $E (| X |^{3}) < \infty$ , there exists a unique $a^{2} > 1$ that satisfies (4). Hence, a and subsequently $μ_{1}$ , $μ_{2}$ , $μ_{3}$ , $σ_{ε}^{2}$ are all unique, i.e., the model is identifiable.

The proof of Lemma 2.1 and Theorem 2.1 can be found in Supplement S1. Based on the identifiability result and its proof, it is straightforward to construct an estimator for $a^{2}$ and $σ_{ε}^{2}$ using the empirical approximation. Specifically, let $θ \equiv {(a^{2}, σ_{ε}^{2})}^{T}$ , then the estimator $\hat{θ} \equiv {({\hat{a}}^{2}, {\hat{σ}}_{ε}^{2})}^{T}$ can be obtained by solving the estimating equation $Ψ_{n} (θ) = n^{- 1} \sum_{i = 1}^{n} ψ_{θ} (W_{i}) = 0$ where

ψ_{θ} (W_{i}) = [\begin{array}{l} \hat{E} (W_{i 1}) a^{4} σ_{ε}^{2} - 3 \hat{E} (W_{i 1}) σ_{ε}^{2} - a^{4} \hat{E} (W_{i 1}^{2} W_{i 2}) + \hat{E} (W_{i 1}^{3}) \\ σ_{ε}^{2} - \hat{E} (W_{i 1}^{2}) + a^{2} W_{i 1} W_{i 2} \end{array}],

(5)

where $\hat{E} (W_{i 1}) = \sum_{j = 1}^{2} W_{i j} / 2$ , $\hat{E} (W_{i 1}^{2}) = \sum_{j = 1}^{2} W_{i j}^{2} / 2$ , $\hat{E} (W_{i 1}^{3}) = \sum_{j = 1}^{2} W_{i j}^{3} / 2$ , and $\hat{E} (W_{i 1}^{2} W_{i 2}) = (W_{i 1}^{2} W_{i 2} + W_{i 1} W_{i 2}^{2}) / 2$ . Note that the estimating function $Ψ_{n} (θ)$ has mean 0 when evaluated at the true parameter $θ_{0}$ , which, under suitable conditions, leads to the consistency of the estimator as we establish in Theorem 2.2, with the proof in Supplement S1.

Theorem 2.2 (Consistency). When $X$ is positive or negative, the estimator $\hat{θ}$ is consistent.

Remark 2.2. The above idea is further extended to cases where more than 2 replicates are available, see Supplement S5 for details. The assumptions on the positiveness or negativeness of $X$ in Theorem 2.1 can be relaxed. An alternative condition, along with its analysis, is presented in the Supplement S1.4.

Remark 2.3. In practice, even though the model is theoretically identifiable, it can be practically unidentifiable. For example, one might have $\hat{E} (W_{1} W_{2}) < 0$ but $\hat{E} (W_{1}^{2} W_{2}) > 0$ where $\hat{E} (W_{1}^{2} W_{2}) = \sum_{i = 1}^{n} (W_{i 1}^{2} W_{i 2} + W_{i 1} W_{i 2}^{2}) / (2 n)$ and $\hat{E} (W_{1} W_{2}) = \sum_{i = 1}^{n} W_{i 1} W_{i 2} / n$ . When this rare situation occurs, we propose using truncation at the value 0 to avoid practical unidentifiability.

2.3. Asymptotic Distribution of Estimator

Upon obtaining the error variance estimator, we proceed to derive the asymptotic distribution of the estimator. Given that the estimator is derived from estimating equations, when the true parameter lies in an open set, the estimator usually has asymptotic normality. However, in our setting, the parameter space $Θ$ is compact, and the true parameter is possibly positioned on the boundary, such as when one of the error variances equals 0, hence the estimator may no longer follow the normal distribution. We follow the theory proposed in Van der Vaart (2000) and Andrews (2002) to derive the asymptotic distribution of the estimator. To simplify notifications, let $Ψ (θ) \equiv E (Ψ_{n} (θ))$ , ${Γ \equiv \frac{\partial}{\partial θ^{T}} Ψ (θ)|}_{θ = θ_{0}}$ , $𝒥 \equiv Γ^{T} Γ$ , and $𝒱 \equiv n E (Ψ_{n} (θ_{0}) Ψ_{n} {(θ_{0})}^{T})$ . Theorem 2.3 establishes the asymptotic distribution for $\hat{θ}$ when the true parameter lies in the interior of the parameter space.

Theorem 2.3. When $θ_{0}$ lies in the interior of the parameter space, that is, $θ_{0} > (1, 0)^{T}$ elementwise, $n^{\frac{1}{2}} ({\hat{θ}}_{n} - θ_{0}) \to_{d} N (0, Γ^{- 1} 𝒱 Γ^{- T})$ .

Next, we consider the scenario where both parameters are situated on the boundary. To derive the asymptotic distribution of the estimator when the true parameter lies on the boundary of parameter space, we use the method proposed by Andrews (2002). The idea involves approximating the objective functions by quadratic functions, approximating the restricted parameter spaces by cones, and determining the asymptotic distribution of the estimators. The following Theorem 2.4 derives the asymptotic distribution of $\hat{θ}$ when $θ_{0} = (1, 0)^{T}$ and the proof is presented in Supplement S1.

Theorem 2.4. When $θ_{0}$ lies on the boundary of the parameter space, that is, $θ_{0} = (1, 0)^{T}$ , $n^{\frac{1}{2}} ({\hat{θ}}_{n} - θ_{0}) \to_{d} \tilde{λ}$ , where $\tilde{λ} \equiv 𝒵 I (𝒵 ⩾ 0)$ , and $𝒵 ~ N (0, Γ^{- 1} 𝒱 Γ^{- T})$ .

We now proceed to consider the scenario where only one of the true parameters lies on the boundary, specifically, either the multiplicative or additive error exists. The vectors and matrices are partitioned as

Ψ = (\binom{Ψ_{a^{2}}}{Ψ_{σ_{ε}^{2}}}), 𝒵 = (\binom{𝒵_{a^{2}}}{𝒵_{σ_{ε}^{2}}}), 𝒥 = [\begin{matrix} 𝒥_{a^{2}} & 𝒥_{a^{2} σ_{ε}^{2}} \\ 𝒥_{σ_{ε}^{2} a^{2}} & 𝒥_{σ_{ε}^{2}} \end{matrix}], \tilde{λ} = (\binom{{\tilde{λ}}_{a^{2}}}{{\tilde{λ}}_{σ_{ε}^{2}}}) .

The following Theorem 2.5 and Theorem 2.6 show the asymptotic distribution of subvectors of $\hat{θ}$ .

Theorem 2.5. Assume $θ_{0} = {(1, σ_{ε 0}^{2})}^{T}$ , $σ_{ε 0}^{2} > 0$ . Then

n^{\frac{1}{2}} ({\hat{a}}^{2} - 1) \to_{d} {\tilde{λ}}_{a^{2}},

where ${\tilde{λ}}_{a^{2}} \equiv 𝒵_{a^{2}} I (𝒵_{a^{2}} ⩾ 0)$ is the mixture of the point mass distribution at 0 and the truncated normal distribution $N (0, H_{1} Γ^{- 1} 𝒱 Γ^{- T} H_{1}^{T})$ with $H_{1} = (1, 0)$ and truncation interval $[0, \infty)$ . Further, when $n \to \infty$ , ${\hat{σ}}_{ε}^{2}$ satisfies

n^{\frac{1}{2}} ({\hat{σ}}_{ε}^{2} - σ_{ε 0}^{2}) \to_{d} 𝒥_{σ_{ε}^{2}}^{- 1} Ψ_{σ_{ε}^{2}} - 𝒥_{σ_{ε}^{2}}^{- 1} 𝒥_{a^{2} σ_{ε}^{2}} {\tilde{λ}}_{a^{2}} .

Theorem 2.6. Assume $θ_{0} = {(a_{0}^{2}, 0)}^{T}$ , $a_{0}^{2} > 1$ . Then

n^{\frac{1}{2}} ({\hat{σ}}_{ε}^{2} - 0) \to_{d} {\tilde{λ}}_{σ_{ε}^{2}},

where ${\tilde{λ}}_{σ_{ε}^{2}} \equiv 𝒵_{σ_{ε}^{2}} I (𝒵_{σ_{ε}^{2}} ⩾ 0)$ is the mixture of the point mass distribution at 0 and the truncated normal distribution $N (0, H_{2} Γ^{- 1} 𝒱 Γ^{- T} H_{2}^{T})$ with $H_{2} = (0, 1)$ and truncation interval $[0, \infty)$ . Further, when $n \to \infty$ , ${\hat{a}}^{2}$ satisfies

n^{\frac{1}{2}} ({\hat{a}}^{2} - a_{0}^{2}) \to_{d} 𝒥_{a^{2}}^{- 1} Ψ_{a^{2}} - 𝒥_{a^{2}}^{- 1} 𝒥_{σ_{ε}^{2} a^{2}} {\tilde{λ}}_{a^{2}} .

3. Probability Density Function Estimation

After estimating the error variances, we proceed to estimate the density function of $X$ . We propose a likelihood based method (Bertrand et al., 2019) which uses Bernstein polynomials to approximate the density function of $X$ with a compact support within the framework of the classical additive error model.

Let $V \equiv X η$ , then $log V = log X + log η$ where $log η ~ N (0, σ_{η}^{2})$ . We assume that $X$ is continuous and has a compact support $[c_{1}, c_{1} + c_{2}]$ , which does not include 0. Let $T \equiv log X$ and define $T \equiv A S + B$ , where $A \equiv log (c_{1} + c_{2}) - log (c_{1})$ , $B \equiv log (c_{1})$ , $S \in [0, 1]$ , and $T \in [log c_{1}, log (c_{1} + c_{2})]$ . The probability density function of $W$ is given by

f_{W} (w) = \frac{1}{σ_{ε}} \int_{0}^{+ \infty} f_{V} (v) ϕ (\frac{w - v}{σ_{ε}}) d v = \frac{1}{σ_{ε}} \int_{0}^{+ \infty} \frac{1}{A σ_{η} v} \int_{B}^{A + B} f_{S} (\frac{t - B}{A}) ϕ (\frac{log v - t}{σ_{η}}) d t ϕ (\frac{w - v}{σ_{ε}}) d v = \frac{1}{A} \int_{B}^{A + B} f_{S} (\frac{t - B}{A}) \int_{0}^{+ \infty} \frac{1}{v σ_{ε} σ_{η}} ϕ (\frac{log v - t}{σ_{η}}) ϕ (\frac{w - v}{σ_{ε}}) dvdt,

(6)

where $f_{S}$ is the density of $S$ , and $ϕ (\cdot)$ is the density of the standard normal distribution.

We intend to use a Bernstein polynomial to estimate the unknown density $f_{S}$ . A Bernstein polynomial of degree $m$ can be expressed as:

B_{m} (s) = \sum_{k = 0}^{m} α_{k, m} b_{k, m} (s), s \in [0, 1],

(7)

where $b_{k, m} (s) = (\binom{m}{k}) s^{k} (1 - s)^{m - k}$ , for $k = 0, \dots, m$ . The method is based on the property that any continuous function $f (s)$ defined on [0, 1] can be uniformly approximated by such a polynomial, by taking $α_{k, m} = f (\frac{k}{m})$ , that is

lim_{m \to \infty} sup_{0 \leq s \leq 1} |\sum_{k = 0}^{m} f (\frac{k}{m}) b_{k, m} (s) - f (s)| = 0 .

We first approximate the density $f_{S}$ by a Bernstein polynomial, which is equivalent to a mixture of $m + 1$ densities of $Beta (k + 1, m - k + 1)$ distributions,

{\tilde{f}}_{S, m} (s; θ_{m}) = \sum_{k = 0}^{m} f_{S} (\frac{k}{m}) b_{k, m} (s) = \sum_{k = 0}^{m} θ_{k, m} {Beta}_{k + 1, m - k + 1} (s),

where ${Beta}_{α, β} (\cdot)$ is the pdf of a Beta distribution with parameters $α$ and $β$ and $θ_{m} = {(θ_{0, m}, \dots, θ_{m, m})}^{T}$ , $θ_{k, m} = \frac{1}{m + 1} f_{S} (\frac{k}{m})$ , and $θ_{k, m} \geq 0$ . Since ${\tilde{f}}_{S, m} (\cdot; θ_{m})$ for $m$ must be a density, we impose the constraint $\sum_{k = 0}^{m} θ_{k, m} = 1$ . The density $f_{W}$ is estimated by

{\tilde{f}}_{W, m} (w; σ_{ε}, σ_{η}, θ_{m}) = = \frac{1}{A σ_{ε} σ_{η}} \sum_{k = 0}^{m} θ_{k, m} \int_{A}^{A + B} {Beta}_{k + 1, m - k + 1} (\frac{t - B}{A}) \int_{0}^{+ \infty} \frac{1}{v} ϕ (\frac{log v - t}{σ_{η}}) ϕ (\frac{w - v}{σ_{ε}}) dvdt .

(8)

In Supplement S3, we prove that ${\tilde{f}}_{W, m} (\cdot)$ uniformly converges to $f_{W} (\cdot)$ , i.e.,

lim_{m \to \infty} sup_{w} |{\tilde{f}}_{W, m} (w) - f_{W} (w)| = 0 .

(9)

When an iid sample of $W$ , $W_{1}, \dots, W_{n}$ , is available, the log-likelihood function of the parameter set $(σ_{ε}, σ_{η}, θ_{m})$ given the observed data is

ℒ (θ_{m}; σ_{ε}, σ_{η}) = \sum_{i = 1}^{n} log \{\frac{1}{A σ_{ε} σ_{η}} \sum_{k = 0}^{m} θ_{k, m} \int_{A}^{A + B} {Beta}_{k + 1, m - k + 1} (\frac{t - B}{A}) \int_{0}^{+ \infty} \frac{1}{v} ϕ (\frac{log v - t}{σ_{η}}) ϕ (\frac{w_{i} - v}{σ_{ε}}) dvdt\} .

(10)

Given the estimated error standard deviations $({\hat{σ}}_{ε}, {\hat{σ}}_{η})$ , the estimator of the Bernstein polynomial parameters can be obtained by maximizing the log-likelihood function with respect to $θ_{m}$ given the degree of the Bernstein polynomial.

Our proposed estimation procedure is to (1) estimate the error variances through methods proposed in Section 2.2 and insert $({\hat{σ}}_{ε}, {\hat{σ}}_{η})$ in the log-likelihood function; (2) obtain the estimated parameters by maximizing the log-likelihood function; (3) select the degree of Bernstein polynomials based on the AIC criteria. Since the log-likelihood function contains double integrals, the computation of maximum likelihood estimator is extremely time-consuming. To reduce the time cost, we apply the Laplace approximation to estimate the integral in the log-likelihood function (10). Details regarding the Laplace approximation can be found in Supplement S3.

Remark 3.1. For convenience, we have used Bernstein polynomials as basis functions, which requires $X$ to have a compact support. This approach is feasible when the support of $X$ can be easily determined. In the situation when it is difficult to know the support of $X$ , we suggest to use alternative basis functions such as Laguerre polynomials, instead of estimating the support of $X$ . Indeed, estimation of the support of an unobserved random variable is challenging and can induce complexity in the subsequent analysis (Kneip et al., 2015; Florens et al., 2020).

4. Error-in-variable Regression Problem

4.1. Error-in-variable Linear Regression Problem

Let $\{(Y_{i}, X_{i}, Z_{i}) i = 1, 2, \dots, n\}$ be a sample of $n$ independent and identically distributed triplets, where $Y_{i}$ is the response variable, $X_{i}$ is univariate, $Z_{i}$ is multivariate and $n$ denotes the sample size. We consider the case where the covariate $X_{i}$ is error-prone and replaced by the surrogate measurement $W_{i j}$ , and $Z_{i}$ is measured without error. The linear model is given by

Y_{i} = β_{0} + β_{x} X_{i} + β_{z}^{T} Z_{i} + e_{i}

with

W_{i j} = X_{i} η_{i j} + ε_{i j}, i = 1, 2, \dots, n, j = 1, 2, \dots, k

where $k$ is the number of replicates for each $X_{i}$ , $β_{0}$ , $β_{x}$ , $β_{z}$ are regression parameters to be estimated, $(e_{i}, η_{i j}, ε_{i j})$ are mutually independent and $e_{i}$ satisfies $E (e_{i} ∣ X_{i}, Z_{i}) = 0$ and it has constant variance $σ_{e}^{2}$ .

When the covariates are contaminated by classical additive or multiplicative errors, traditional least-squares methods will produce biased, inconsistent estimators and invalid statistical inference results. In simple linear regression, when covariates are measured with classical additive error, the estimator established based on contaminated data will be biased towards 0. This phenomenon is called attenuation. It can be proved that with both multiplicative and additive errors, the ordinary least-squares estimator ${({\hat{β}}_{w *}, {\hat{β}}_{z *}^{T})}^{T}$ will also not consistently estimate ${(β_{x}, β_{z}^{T})}^{T}$ and will attenuate to 0. The proof of the attenuation effect can be found in Supplement S4.

4.2. Regression Calibration

Regression calibration (RC) is a statistical method commonly used to address the issue of error-in-variable regression. It was developed by Gleser (1990), Carroll and Stefanski (1990), and others, and is widely used in nutritional epidemiology to correct measurement error bias. The idea of regression calibration is to replace unobserved $X$ by the conditional mean of $X$ given $(W, Z)$ .

The general approach of regression calibration involves three steps. First, a calibration model $f_{X} (W, Z)$ is constructed by regressing the true covariate $X$ on ${W, Z}$ . Second, the unknown $X$ is replaced by the fitted value $\hat{X}$ using $f_{X} (W, Z)$ , and parameter estimates are obtained by regressing the response $Y$ on $(\hat{X}, Z)$ . Finally, standard errors of the parameter estimates are adjusted to account for uncertainties in both the regression model and the calibration model. The key justification for regression calibration is that the estimates obtained from the regression model of ${Y, \hat{X}, Z}$ are consistent to the parameters of the true model ${Y, X, Z}$ for linear models and additive normal errors.

Based on the idea of the best linear approximation proposed by Carroll et al. (2006), we propose the best linear approximation to $X$ given ${\bar{W}, Z}$ in our setting, which is given by

E (X∣ \bar{W}, Z) \approx E (X) + {(\binom{σ_{w x}}{Σ_{z x}})}^{T} {(\begin{matrix} σ_{\bar{w w}}^{2} & Σ_{w z} \\ Σ_{z w} & Σ_{z z} \end{matrix})}^{- 1} (\binom{\bar{W} - μ_{w}}{Z - μ_{z}}),

where

σ_{\bar{w w}}^{2} = \frac{1}{k} E (X^{2}) (e^{2 σ_{η}^{2}} + e^{σ_{η}^{2}}) + E^{2} (X) e^{σ_{η}^{2}} + \frac{1}{k} σ_{ε}^{2}

denotes the variance of $\bar{W}$ , $Σ_{a b}$ denotes the covariance matrix between random vector $A$ , $B$ , and $Σ_{z z}$ is the covariance matrix of $Z$ .

Let ${\hat{X}}_{i}$ denote the estimate of $E (X_{i} ∣ {\bar{W}}_{i}, Z_{i})$ . We have

{\hat{X}}_{i} = \hat{E} (X) + {(\binom{{\hat{σ}}_{w x}}{{\hat{Σ}}_{z x}})}^{T} {(\begin{matrix} {\hat{σ}}_{\bar{w w}}^{2} & {\hat{Σ}}_{w z} \\ {\hat{Σ}}_{z w} & {\hat{Σ}}_{z z} \end{matrix})}^{- 1} (\binom{{\bar{W}}_{i} - {\hat{μ}}_{w}}{Z_{i} - {\hat{μ}}_{z}}),

where

{\bar{W}}_{i \cdot} \equiv \frac{1}{k} \sum_{j = 1}^{k} W_{i j}, {\hat{σ}}_{w x} \equiv \hat{E} (η) (\hat{E} (X^{2}) - \hat{E} (X)^{2}), {\hat{μ}}_{w} \equiv \frac{1}{n} \sum_{i = 1}^{n} {\bar{W}}_{i \cdot}, {\hat{μ}}_{z} \equiv \frac{1}{n} \sum_{i = 1}^{n} Z_{i}, {\hat{Σ}}_{z x} \equiv \hat{E} (η)^{- 1} {\hat{Σ}}_{w z}, \hat{E} (X) \equiv \hat{E} (η)^{- 1} {\hat{μ}}_{w}, \hat{E} (X^{2}) \equiv (\hat{E} (W_{1}^{2}) - {\hat{σ}}_{ε}^{2}) / {\hat{a}}^{4}, \hat{E} (W_{1}^{2}) \equiv \frac{1}{n k} \sum_{j = 1}^{k} \sum_{i = 1}^{n} W_{i j}^{2}, \hat{E} (η) \equiv \sqrt{{\hat{a}}^{2}}, {\hat{Σ}}_{z w} \equiv {\hat{Σ}}_{w z}^{T},

${\hat{σ}}_{\bar{w w}}^{2}$ is the sample variance of ${\bar{W}}_{i \cdot}$ , ${\hat{Σ}}_{w z}$ is the sample covariance matrix between $W$ and $Z$ , ${\hat{Σ}}_{z z}$ is the sample variance of $Z$ , $k$ is the number of the replicates, and ${\hat{a}}^{2}$ and ${\hat{σ}}_{ε}^{2}$ are from the estimator $\hat{θ} = {({\hat{a}}^{2}, {\hat{σ}}_{ε}^{2})}^{T}$ obtained in Section 2.2.

Replacing the unknown $X_{i}$ in the linear regression function by the estimates ${\hat{X}}_{i}$ leads to

Y_{i} = β_{0} + β_{x} {\hat{X}}_{i} + β_{z}^{T} Z_{i} + e_{i}^{*}, i = 1, \dots, n,

then the estimator of regression parameters ${({\hat{β}}_{x *}, {\hat{β}}_{z *}^{T})}^{T}$ is obtained by the ordinary least-squares method based on $\{(Y_{i}, {\hat{X}}_{i}, Z_{i}) i = 1, 2, \dots, n\}$ .

Theorem 4.1. The regression calibration estimator ${({\hat{β}}_{x *}, {\hat{β}}_{z *}^{T})}^{T}$ is consistent.

The proof of Theorem 4.1 is presented in Supplement S4.

The issue here is that the linear regression based on calibration data $\{(Y_{i}, {\hat{X}}_{i}, Z_{i}) i = 1, 2, \dots, n}$ is heteroscedastic due to the non-constant conditional variance of $X$ . As a result, the usual standard errors are not accurate, even though the ordinary least-squares estimator is consistent (Carroll et al., 2006). To address this issue, two methods, the bootstrap method and the sandwich method, were proposed to estimate the standard errors. The bootstrap method requires knowledge of the conditional expectation $E (X ∣ Z, W)$ and the conditional variance $Var (X ∣ Z, W)$ , which are unknown in this case. Hence, we apply the sandwich method to construct the standard errors (Huber, 1967). Let ${\hat{X}}^{*} = (1, \hat{X})^{T}$ and $β = {(β_{0}, β_{x}, β_{z}^{T})}^{T}$ . In the regression calibration model, the estimates ${\hat{X}}_{i}$ depend on the nuisance parameter $\tilde{θ} \equiv {(θ^{T}, μ_{w}, μ_{z}, σ_{\bar{w w}}^{2}, vec {(Σ_{z x})}^{T}, vec {(Σ_{w z})}^{T}, vec {(Σ_{z z})}^{T})}^{T}$ . If the nuisance parameter happens to be known, the ordinary least-squares estimator can be obtained based on the estimating equation $n^{- 1} \sum_{i = 1}^{n} Ψ_{1 i} (β) = 0$ where

Ψ_{1 i} (β) = [\begin{array}{l} Y_{i} - β_{0} - β_{x} {\hat{X}}_{i} - β_{z}^{T} Z_{i} \\ (Y_{i} - β_{0} - β_{x} {\hat{X}}_{i} - β_{z}^{T} Z_{i}) {\hat{X}}_{i} \\ (Y_{i} - β_{0} - β_{x} {\hat{X}}_{i} - β_{z}^{T} Z_{i}) Z_{i} \end{array}] .

More realistically, if the nuisance parameter is unknown, according to Section 2.2, the nuisance parameter is estimated based on the estimating function $n^{- 1} \sum_{i = 1}^{n} Ψ_{2 i} (\tilde{θ}) = 0$ where $Ψ_{2 i}$ is defined in Supplement S4. Based on the idea of joint estimating equations (Wang, 1999), $δ \equiv {(β^{T}, θ^{T}, μ_{w}, μ_{z}, σ_{\bar{w w}}^{2}, vec {(Σ_{z x})}^{T}, vec {(Σ_{w z})}^{T}, vec {(Σ_{z z})}^{T})}^{T}$ can be estimated through solving $n^{- 1} \sum_{i = 1}^{n} {\tilde{Ψ}}_{i} (δ) = 0$ where ${\tilde{Ψ}}_{i} (δ) = {(Ψ_{1 i}^{T} (δ), Ψ_{2 i}^{T} (δ))}^{T}$ . Let $\hat{δ}$ denote the resulting estimator. The estimating function $n^{- 1} \sum_{i = 1}^{n} {\tilde{Ψ}}_{i} (δ)$ has mean 0 when evaluated at the true parameter $δ_{0}$ , that is, $E (n^{- 1} \sum_{i = 1}^{n} {\tilde{Ψ}}_{i} (δ_{0})) = 0$ .

We can prove that $\hat{δ}$ is a consistent estimator of $δ_{0}$ and $n^{1 / 2} (\hat{δ} - δ_{0}) \overset{d}{\to} N (0, Σ_{R C})$ where $Σ_{R C} \equiv {({\tilde{Ψ}}^{*} (δ))}^{- 1} E ({\tilde{Ψ}}_{i} (δ) {\tilde{Ψ}}_{i} (δ)^{T}) {({\tilde{Ψ}}^{*} (δ))}^{- T}$ and ${\tilde{Ψ}}^{*} (δ) \equiv \partial / \partial δ^{T} E (\tilde{Ψ} (δ))$ . We apply the sandwich estimator (Carroll et al., 2006; Huber, 1967) to consistently estimate the variance. Specifically, the sandwich estimator of the covariance matrix is $n^{- 1} {\hat{A}}_{n}^{- 1} {\hat{B}}_{n} {\hat{A}}_{n}^{- T}$ where

{\hat{A}}_{n} = n^{- 1} \sum_{i = 1}^{n} \frac{\partial}{\partial δ^{T}} {\tilde{Ψ}}_{i} (\hat{δ}), {\hat{B}}_{n} = n^{- 1} \sum_{i = 1}^{n} {\tilde{Ψ}}_{i} (\hat{δ}) {\tilde{Ψ}}_{i}^{T} (\hat{δ}) .

The proof of consistency and asymptotic normality is presented in Supplement S4, along with the derivation of the sandwich estimator.

4.3. Simulation Extrapolation

We further consider SIMEX method. SIMEX consists of three steps: simulation, estimation, and extrapolation. During the simulation step, we generate remeasured data that incorporates both multiplicative and additive errors and then demonstrate its compliance with the essential prerequisites of SIMEX. In the simulation step, we will form additional errors that satisfy the basic requirement of SIMEX. In the estimation and extrapolation steps, we will explore various extrapolation functions commonly employed. Besides, we will discuss how to use SIMEX when a data set contains some replicates. In cases where replicates are available, it is crucial to incorporate minor adjustments when constructing remeasured data to fulfill the fundamental requirement of SIMEX. Failure to do so may introduce bias into the analysis (Carroll et al., 2006).

In the simulation step, given the measurement error model (3), we simulate the new data indexed by $ξ > 0$ as

W_{b, i} (ξ) = (W_{i} + \sqrt{ξ} ε_{b, i}) e^{\sqrt{ξ} log (η_{b, i})}, i = 1, \dots, n, b = 1, \dots, B,

(11)

where $η_{b, i}$ and $ε_{b, i}$ are mutually independent errors that are independent of all the observed data, and are identically distributed following $log η_{b, i} ~ N (0, σ_{η}^{2})$ , $ε_{b, i} ~ N (0, σ_{ε}^{2})$ . Note that $W_{b, i} (0) = W_{i}$ and in Supplement S4 we show that $MSE (W_{b, i} (- 1) ∣ X_{i}) = 0$ . Thus, intuitively, when $ξ = 0$ , $W_{b, i} (ξ)$ represents the observed data. As $ξ$ increases, measurement errors are inflated, while at $ξ = - 1$ , the errors are reduced to zero.

For each $ξ$ , the estimated model parameters are given by

\hat{β} (ξ) = B^{- 1} \sum_{b = 1}^{B} {\hat{β}}_{b} (ξ),

where ${\hat{β}}_{b} (ξ)$ is obtained using least-squares method based on data ${\{W_{b, i}\}}_{i = 1}^{n}$ .

In the extrapolation step, the extrapolation function models the $\hat{β} (ξ)$ as a function of $ξ$ . In this study, quadratic extrapolation is used. Setting $ξ = - 1$ in the extrapolation function yields the SIMEX estimator, which we donate ${\hat{β}}_{simex}$ .

In Supplement S4.5, we prove that $\sqrt{n} ({\hat{β}}_{simex} - β_{0})$ follows an asymptotic normal distribution. However, the variance of the SIMEX estimator involves derivations based on high-dimensional matrices. To address this, we apply the Jackknife-type method to estimate the variance of the SIMEX estimator. Following page 392 in Carroll et al. (2006),

Var ({\hat{β}}_{simex}) \approx Var ({\hat{β}}_{true}) + Var ({\hat{β}}_{simex} - {\hat{β}}_{true}),

(12)

where ${\hat{β}}_{true}$ is the least-squares estimator based on ${\{Y_{i}, Z_{i}, X_{i}\}}_{i = 1}^{n}$ . To simplify notation, let $τ^{2} = Var ({\hat{β}}_{true})$ and ${\hat{τ}}_{b}^{2} (ξ) = \hat{Var} ({\hat{β}}_{b} (ξ))$ , ${\hat{τ}}^{2} (ξ) = {lim}_{B \to \infty} B^{- 1} \sum_{b = 1}^{B} {\hat{τ}}_{b}^{2} (ξ)$ .

To estimate the first term $Var ({\hat{β}}_{true})$ on the right side of equation (12), an extrapolant model is fit to the components of ${\{{\hat{τ}}^{2} (ξ_{m}), ξ_{m}\}}_{1}^{M}$ and the estimate of $Var ({\hat{β}}_{true})$ is the modeled value at $ξ = - 1$ . For the second term $Var ({\hat{β}}_{simex} - {\hat{β}}_{true})$ , it is proved that

Var ({\hat{β}}_{simex} - {\hat{β}}_{true}) = - lim_{ξ \to - 1} E \{s_{Δ}^{2} (ξ)\},

where

s_{Δ}^{2} (ξ) = (B - 1)^{- 1} \sum_{b = 1}^{B} ({\hat{β}}_{b} (ξ) - \hat{β} (ξ)) {({\hat{β}}_{b} (ξ) - \hat{β} (ξ))}^{T} .

Hence, the estimate of $Var ({\hat{β}}_{simex})$ can be obtained by extrapolating the components of the differences, ${\hat{τ}}^{2} (ξ) - s_{Δ}^{2} (ξ)$ , to $ξ = - 1$ . It is important to emphasize that the entire procedure is an approximation and is typically only valid in situations where the sample size is large and the measurement error is small (Carroll et al., 2006).

In the case of $k$ replicates, incorporating the mean of the replicates into equation (11) becomes challenging due to the fact that the distribution of $\overline{η}$ no longer follows a Log-Normal distribution. To address this issue, we can utilize the estimator $\hat{β} (ξ)$ to improve the accuracy of the extrapolation function. We define ${\hat{β}}_{b, j} (ξ)$ as the estimator obtained based on the remeasured data from the $j$ th replicates, denoted by ${\{W_{b, i, j} (ξ)\}}_{i = 1}^{n}$ . For each $ξ$ and the $j$ th replicate, the estimated parameters are given by

\hat{β} (ξ)_{j} = B^{- 1} \sum_{b = 1}^{B} {\hat{β}}_{b, j} (ξ) .

We then conduct a quadratic regression of $\hat{β} (ξ)$ on $ξ$ based on the set of data points $\{ξ_{i}, \hat{β} {(ξ_{i})}_{j}\}$ , $i = 1, \dots, m$ , $j = 1, \dots, k$ using nonlinear least-squares method to get the extrapolation function, where $k$ denotes the number of replicates and $m$ denotes the number of chosen values of $ξ$ . Setting $ξ = - 1$ in the extrapolation function gives the SIMEX estimator ${\hat{β}}_{simex}$ .

This procedure allows us to obtain more accurate coefficients by considering all the replicates instead of just one. A similar approach is applied to extrapolate the variance as well.

5. Simulation

5.1. Error Variance Estimation

We conduct simulations to evaluate the performance of the estimator $({\hat{σ}}_{η}^{2}, {\hat{σ}}_{ε}^{2})$ obtained in Section 2.2. Since estimating the error variances requires $X$ to be either positive or negative, we generated $X$ from different distributions with positive supports. We define the noise-to-signal ratio as $\sqrt{σ_{w}^{2} / σ_{x}^{2} - 1}$ . We specify the noise-to-signal ratio to take values from the set {0, 0.25, 0.50, 0.75}, and then we examine different combinations of standard deviations for two types of errors while controlling the noise-to-signal ratio. For each simulation setting, 500 datasets were generated independently for three sample sizes $n = {300, 500, 1000}$ . We present the bias, relative bias, empirical standard deviations, and empirical mean squared error as evaluation criteria for the estimates. Figures 1 and 2 present the simulation results for the estimation when $X ~ LogNormal (0, 0.5)$ and $X ~ LogNormal (0, 1)$ , respectively, with sample sizes of 500 (gray line) and 1000 (black line). The subfigures in each column show the estimation results under different combinations of $σ_{η}$ and $σ_{ε}$ , corresponding to the noise-to-signal ratios 0.25, 0.5, 0.75 respectively. Since we control the ratio to be these three levels, when $σ_{η}$ decreases, $σ_{ε}$ increases correspondingly. The true values of $σ_{η}$ and $σ_{ε}$ are shown respectively in the x-axis of the plots. More comprehensive simulation results can be found in Table S4 – Table S21 of Supplement S7.

Figure 1: — Simulation results for estimating $σ_{η}$ and $σ_{ε}$ when $X ~ LogNormal (0, 0.5)$ , based on sample sizes of 500 and 1000, under various combinations of $σ_{η}$ and $σ_{ε}$ corresponding to noise-to-signal ratios of 0.25, 0.5 and 0.75.

Figure 2: — Simulation results for estimating $σ_{η}$ and $σ_{ε}$ when $X ~ LogNormal (0, 1)$ , based on sample sizes of 500 and 1000, under various combinations of $σ_{η}$ and $σ_{ε}$ corresponding to noise-to-signal ratios of 0.25, 0.5 and 0.75.

We observe that, in the majority of cases, the estimators effectively capture the underlying parameters with minimal bias and low standard deviations. The biases and standard deviations of ${\hat{σ}}_{η}$ tend to increase as the noise-to-signal ratio increases, but they remain stable across different true values when the same noise-to-signal ratio is maintained. The performance of ${\hat{σ}}_{ε}$ deteriorates with higher noise-to-signal ratios and increasing $σ_{η} / σ_{ε}$ ratios. Notably, when the true value of $σ_{ε}$ is zero, the estimator tends to overestimate the parameter.

We conducted additional simulations to separately investigate the impact of $σ_{η}$ with fixing $σ_{ε}$ and the impact of $σ_{ε}$ with fixing $σ_{η}$ . The results, provided in Supplement S7, show a similar pattern to the fixed noise ratio scenarios. The bias and the standard deviation of ${\hat{σ}}_{η}$ increase as $σ_{η}$ increases but are not affected by changes in $σ_{ε}$ . On the other hand, the performance of ${\hat{σ}}_{ε}$ is highly influenced by changes in $σ_{η}$ . As $σ_{η}$ increases, the estimator ${\hat{σ}}_{ε}$ becomes more biased and unstable regardless of the real $σ_{ε}$ . Given $σ_{η}$ , the bias and empirical standard deviation of ${\hat{σ}}_{ε}$ tend to decrease as $σ_{ε}$ increases, but this improvement becomes smaller as $σ_{η}$ increases.

The reduced performance of ${\hat{σ}}_{ε}$ may be attributed to the inflation of the variance of $X η$ when the variance of $η$ increases, while the variance of the additive error $ε$ is relatively low compared to $X η$ , which makes it difficult to estimate precisely.

Comparing the estimation results of different distributions of true measurement $X$ , for unconstrained distribution, the $Log-Normal (0, 0.5)$ yields the best performance with the lowest bias and standard deviation. The estimators obtained based on the true measurement $X$ with constrained support outperform those based on $X$ with unconstrained distributions.

Overall, when the true value of $σ_{ε}$ is not zero, the estimator $({\hat{σ}}_{η}, {\hat{σ}}_{ε})$ demonstrates high accuracy. Our simulation results suggest that the performance of these estimators is influenced by several factors, including the distribution of the true measurement variable $X$ , the sample size, the noise-to-signal ratio, and the ratio between $σ_{η}$ and $σ_{ε}$ . Specifically, an increase in the noise-to-signal ratio and in the variance of variable $X$ has a detrimental effect on the estimator’s performance. Conversely, the estimator performs better as the sample size increases.

We further perform hypothesis tests to examine the presence of measurement error, specifically testing whether $a^{2} = 1$ or $σ_{ε}^{2} = 0$ . Test 1 and Test 2 correspond to tests for multiplicative and additive errors, respectively, which are given by

Test 1 : H_{0} : a^{2} = 1 versus H_{1} : a^{2} > 1,

(13)

Test 2 : H_{0} : σ_{ε}^{2} = 0 versus H_{1} : σ_{ε}^{2} > 0 .

(14)

Instead of using the asymptotic variance expressions, we adopted a bootstrap version for its better approximation. We described the bootstrap details in the Supplement S2. For each dataset, 1000 bootstrap samples were used. Three significance levels, $α = 0.01, 0.05, 0.1$ , were examined. The results are shown in Table 1 and Tables S22–S23. For Test 1, the size of the test is close to the corresponding significance level, and the difference between the size and the significance level decreases as the sample size increases. The power of Test 1 is relatively high, and it increases as the noise-to-signal ratio decreases and as the significance level rises. Additionally, as the sample size increases, the performance of the hypothesis tests improves. Note that in Table 1, the sample size is set to 5000, because, compared to the error variances estimation, the hypothesis test, specifically Test 2, requires a larger sample size to achieve reasonable performance. For Test 2, the observed sizes exceed the nominal significance levels, but align more closely when sample sizes increased to 5000 and 10000. This trend suggests that the statistics tend to the follow the asymptotic distribution only at very large sample size. The requirement of very large sample size might be caused by the complex calculations involving higher-order moments, and the combined impact of multiplicative and additive errors.

Table 1:

Simulation results regarding the hypothesis tests for samples of size 5000 with two replicates

				Test 1			Test 2
Distribution	NS ratio	$σ_{η}$	$σ_{ε}$	$α = 0.01$	$α = 0.05$	$α = 0.1$	$α = 0.01$	$α = 0.05$	$α = 0.1$
LogNorm(0,0.5)	0.25	0.106	0.000	1.000	1.000	1.000	0.046	0.102	0.149
		0.102	0.038	1.000	1.000	1.000	0.271	0.392	0.455
		0.092	0.075	1.000	1.000	1.000	0.944	0.965	0.970
		0.070	0.113	1.000	1.000	1.000	0.999	0.999	0.999
		0.000	0.151	0.004	0.045	0.093	1.000	1.000	1.000
	0.5	0.207	0.000	1.000	1.000	1.000	0.058	0.110	0.181
		0.201	0.075	1.000	1.000	1.000	0.299	0.424	0.500
		0.180	0.151	0.999	1.000	1.000	0.947	0.962	0.969
		0.139	0.226	1.000	1.000	1.000	1.000	1.000	1.000
		0.000	0.302	0.013	0.049	0.098	1.000	1.000	1.000
	0.75	0.301	0.000	1.000	1.000	1.000	0.063	0.118	0.181
		0.292	0.113	1.000	1.000	1.000	0.305	0.418	0.490
		0.264	0.226	1.000	1.000	1.000	0.970	0.978	0.981
		0.205	0.340	1.000	1.000	1.000	1.000	1.000	1.000
		0.000	0.453	0.011	0.055	0.096	1.000	1.000	1.000
Exp(0.5)	0.25	0.143	0.000	0.998	1.000	1.000	0.046	0.106	0.152
		0.138	0.125	1.000	1.000	1.000	0.464	0.563	0.618
		0.124	0.250	1.000	1.000	1.000	0.985	0.989	0.992
		0.095	0.375	0.999	1.000	1.000	1.000	1.000	1.000
		0.000	0.500	0.010	0.044	0.089	1.000	1.000	1.000
	0.5	0.276	0.000	1.000	1.000	1.000	0.064	0.127	0.194
		0.268	0.250	0.999	1.000	1.000	0.436	0.546	0.612
		0.242	0.500	1.000	1.000	1.000	0.974	0.986	0.987
		0.187	0.750	1.000	1.000	1.000	1.000	1.000	1.000
		0.000	1.000	0.008	0.043	0.097	1.000	1.000	1.000
	0.75	0.395	0.000	1.000	1.000	1.000	0.080	0.155	0.222
		0.384	0.375	1.000	1.000	1.000	0.459	0.559	0.617
		0.349	0.750	0.999	1.000	1.000	0.971	0.980	0.983
		0.274	1.125	0.999	1.000	1.000	1.000	1.000	1.000
		0.000	1.500	0.014	0.051	0.116	1.000	1.000	1.000
$χ_{3}^{2}$	0.25	0.132	0.000	1.000	1.000	1.000	0.043	0.096	0.143
		0.128	0.153	0.999	1.000	1.000	0.441	0.538	0.609
		0.115	0.306	1.000	1.000	1.000	0.989	0.992	0.993
		0.088	0.459	1.000	1.000	1.000	1.000	1.000	1.000
		0.000	0.612	0.015	0.059	0.101	1.000	1.000	1.000
	0.5	0.257	0.000	0.999	1.000	1.000	0.055	0.117	0.164
		0.249	0.306	1.000	1.000	1.000	0.411	0.535	0.594
		0.224	0.612	1.000	1.000	1.000	0.984	0.987	0.990
		0.174	0.919	1.000	1.000	1.000	1.000	1.000	1.000
		0.000	1.225	0.009	0.045	0.100	1.000	1.000	1.000
	0.75	0.369	0.000	1.000	1.000	1.000	0.075	0.137	0.198
		0.359	0.459	1.000	1.000	1.000	0.462	0.553	0.602
		0.325	0.919	1.000	1.000	1.000	0.977	0.986	0.987
		0.255	1.378	1.000	1.000	1.000	0.999	0.999	0.999
		0.000	1.837	0.014	0.042	0.097	1.000	1.000	1.000

Open in a new tab

5.2. Density Estimation

Simulations were conducted to evaluate the model fitting based on Bernstein polynomials. Since the use of Bernstein polynomials requires $X$ to have compact support and the error variance estimation requires $X$ to be positive or negative, we generated $X$ from various distributions with positive, compact support, including 2 Beta(1, 2) + 2, 2 Beta(3, 2) + 1, 2 Beta(2, 2) + 1, Normal(3, 1.5) truncated at (2, 4), Normal(4, 0.5) truncated at (3, 5), and Exp(2) truncated at (1, 2), with a sample size of 1000. For each setting, we generated 300 replications. The true variables were contaminated by both multiplicative and additive errors with a noise-to-signal ratio of 0.25. Under this ratio, $σ_{ε}$ is set to 50% of its maximum achievable value, and $σ_{η}$ is set to the corresponding value. The detailed values of $σ_{η}$ and $σ_{ε}$ are presented in Figure 3. For each subject $X$ , two replicates $W_{i}$ were generated, and multiple data sets were generated for each simulation setting.

Figure 3: — Simulation results for density estimation. In each plot, the gray lines show the estimated densities from 300 replications, the black solid line is the true density, the red dotted-dashed line is the pointwise median curve, and the blue dashed lines are the 5% and 95% pointwise quantile curves.

The estimator $({\hat{σ}}_{η}, {\hat{σ}}_{ε})$ was obtained based on the method in Section 2.2, and then plugged into the model to estimate the density function of the variable $X$ . Different degrees $m = {0, 1, 2, 3, 4, 5, 6}$ of Bernstein polynomials were used, and the criteria AIC were used to select the degree. Following Kekeç and Van Keilegom (2022), we evaluate the density estimation via the mean integrated absolute error (MIAE), presented in Figure 3, where $MIAE = \frac{1}{N} \sum_{r = 1}^{N} \int |f_{X} (x) - {\hat{f}}_{X, m, r} (x)| d x, {\hat{f}}_{X, m, r} (x)$ denotes the estimated density using the $r$ th replication and the selected degree $m$ , and $N$ is the total number of replications. The results of all estimated density functions are presented in Figure 3. The settings with Beta distributions and the exponential distribution yield the best results. In each setting, both the estimated curves and the pointwise median curves closely align with the true density, and the MIAE values are relatively low. The results indicate that the density of $X$ can be properly estimated through the Bernstein polynomial method.

5.3. Error-in-variable Regression Problem

In this section, simulations were conducted to evaluate the performance of the regression calibration method and simulation extrapolation. We compared traditional least-squares (LS) estimation, RC, and SIMEX based on true parameters and estimators.

The values of $X$ were generated independently from three different distributions: exponential distribution, chi-square distribution, and Log-Normal distribution, and we varied the sample size to test the performance of different methods. For each subject, two repeated measurements $W_{i j}$ were obtained for the unobserved $X_{i}$ . 100 replicates were generated for each simulation configuration. Estimators of regression parameters are compared: the least-squares estimator based on $\{Y_{i}, {\bar{W}}_{i \cdot}\}$ , the regression calibration estimators, and the SIMEX estimators using the true nuisance parameter and the estimate of the nuisance parameter respectively. The empirical bias, empirical standard error, mean of the estimated standard errors, and standard deviation of estimated standard errors of the five estimators are presented, respectively.

Two linear regression models were considered, $Y = β_{0} + β_{x} X + β_{z} Z + e$ and $Y = β_{0} + β_{x} X + β_{z} Z + e$ , where $e ~ N (0, σ_{e})$ . The values of $σ_{e}$ for the error term $e$ in the linear regression models were set to 0.4 and 0.04. The true parameter values were set to $\{β_{0} = 1, β_{x} = 0.8, β_{z} = 0.6\}$ and $\{β_{0} = 1, β_{x} = 0.8\}$ . The sample size $n$ was set to $n = {100, 500, 2000}$ , and two noise-to-signal ratios, 0.25 and 0.5, were considered. Under each ratio, $σ_{ε}$ is set to 50% of its maximum achievable value, and $σ_{η}$ is set to the corresponding value. The results are presented in Table 2 and Table S24 – S30 in Supplement S7.

Table 2:

Simulation results with $Y = 1 + 0.8 X + e$ where $e ~ N (0, 0.04)$ and the noise-to-signal ratio 0.25

			Least-squares $\{Y_{i}, {\bar{W}}_{i \cdot}\}$		Regession Calibration $σ_{η}$ and $σ_{ε}$		Regession Calibration ${\hat{σ}}_{η}$ and ${\hat{σ}}_{ε}$		SIMEX $σ_{η}$ and $σ_{ε}$		SIMEX ${\hat{σ}}_{η}$ and ${\hat{σ}}_{ε}$
Distribution	Sample Size		$β_{0}$	$β_{x}$	$β_{0}$	$β_{x}$	$β_{0}$	$β_{x}$	$β_{0}$	$β_{x}$	$β_{0}$	$β_{x}$
LogNorm(0,0.5)	n=100	Bias	0.020	−0.020	−0.013	0.014	−0.013	0.013	−0.002	0.003	−0.002	0.004
		S.D.	0.025	0.024	0.026	0.025	0.026	0.025	0.027	0.026	0.027	0.025
		$Mean (\hat{S.E.})$	0.137	0.122	0.021	0.021	0.023	0.022	0.030	0.029	0.028	0.028
		$S.D. (\hat{S.E.})$	0.008	0.009	0.005	0.006	0.005	0.006	0.006	0.006	0.007	0.008
	n=500	Bias	0.023	−0.024	−0.003	0.003	−0.003	0.003	0.000	0.000	−0.000	0.000
		S.D.	0.011	0.011	0.011	0.011	0.011	0.012	0.011	0.012	0.012	0.013
		$Mean (\hat{S.E.})$	0.091	0.080	0.011	0.010	0.011	0.011	0.014	0.014	0.014	0.014
		$S.D. (\hat{S.E.})$	0.002	0.003	0.002	0.002	0.002	0.002	0.003	0.003	0.003	0.004
	n=2000	Bias	0.023	−0.024	−0.001	0.001	−0.001	0.001	0.000	−0.001	0.000	−0.000
		S.D.	0.004	0.005	0.005	0.005	0.005	0.005	0.005	0.005	0.005	0.006
		$Mean (\hat{S.E.})$	0.064	0.057	0.006	0.005	0.006	0.006	0.007	0.007	0.007	0.007
		$S.D. (\hat{S.E.})$	0.001	0.001	0.001	0.001	0.001	0.001	0.001	0.001	0.001	0.002
Exp(0.5)	n=100	Bias	0.036	−0.023	−0.017	0.010	−0.017	0.010	0.004	−0.001	0.004	−0.001
		S.D.	0.041	0.024	0.044	0.026	0.044	0.026	0.048	0.029	0.049	0.029
		$Mean (\hat{S.E.})$	0.185	0.110	0.037	0.023	0.040	0.025	0.054	0.034	0.053	0.033
		$S.D. (\hat{S.E.})$	0.012	0.007	0.011	0.007	0.011	0.008	0.016	0.009	0.018	0.012
	n=500	Bias	0.033	−0.022	−0.008	0.005	−0.008	0.005	−0.000	0.001	−0.002	0.002
		S.D.	0.018	0.010	0.018	0.011	0.018	0.011	0.019	0.011	0.018	0.011
		$Mean (\hat{S.E.})$	0.124	0.073	0.018	0.011	0.020	0.012	0.025	0.016	0.026	0.016
		$S.D. (\hat{S.E.})$	0.004	0.002	0.004	0.002	0.004	0.003	0.007	0.004	0.007	0.003
	n=2000	Bias	0.037	−0.024	0.000	0.000	0.000	0.000	0.001	−0.000	0.001	−0.000
		S.D.	0.011	0.006	0.011	0.007	0.011	0.007	0.012	0.007	0.012	0.007
		$Mean (\hat{S.E.})$	0.088	0.052	0.010	0.006	0.011	0.007	0.013	0.008	0.014	0.008
		$S.D. (\hat{S.E.})$	0.001	0.001	0.002	0.001	0.002	0.001	0.003	0.002	0.003	0.002
$χ_{3}^{2}$	n=100	Bias	0.058	−0.024	−0.026	0.009	−0.026	0.009	0.006	−0.002	0.005	−0.002
		S.D.	0.060	0.023	0.065	0.025	0.065	0.026	0.067	0.025	0.066	0.026
		$Mean (\hat{S.E.})$	0.215	0.110	0.050	0.020	0.055	0.022	0.075	0.031	0.077	0.030
		$S.D. (\hat{S.E.})$	0.013	0.007	0.014	0.006	0.015	0.007	0.019	0.007	0.024	0.010
	n=500	Bias	0.059	−0.025	−0.003	0.001	−0.003	0.001	0.006	−0.003	0.004	−0.001
		S.D.	0.028	0.011	0.030	0.012	0.030	0.012	0.031	0.012	0.031	0.013
		$Mean (\hat{S.E.})$	0.146	0.073	0.026	0.010	0.027	0.011	0.035	0.014	0.036	0.015
		$S.D. (\hat{S.E.})$	0.004	0.002	0.006	0.002	0.006	0.002	0.007	0.003	0.009	0.004
	n=2000	Bias	0.055	−0.024	−0.004	0.001	−0.004	0.001	−0.000	−0.000	−0.002	0.000
		S.D.	0.011	0.005	0.012	0.005	0.012	0.005	0.013	0.005	0.012	0.005
		$Mean (\hat{S.E.})$	0.103	0.052	0.013	0.005	0.014	0.006	0.018	0.007	0.018	0.007
		$S.D. (\hat{S.E.})$	0.002	0.001	0.002	0.001	0.002	0.001	0.003	0.001	0.004	0.002

Open in a new tab

Here “Bias” denotes the average of $\hat{β} - β$ , “S.D.” is the standard deviation of the 1000 estimates, “ $Mean (\hat{S.E.})$ ” denotes the average of 1000 standard error estimates, and $S.D. (\hat{S.E.})$ is the standard deviation of 1000 standard error estimates.

The traditional least-squares method yields seriously biased estimates. The biases of the estimators of $β_{x}$ based on the LS method are negative, which verifies the attenuation symptom. Both the RC method and SIMEX we proposed can effectively correct the bias caused by measurement errors. The bias and the empirical standard deviation decrease with a larger sample size. The estimation methods’ performance depends on the distribution of X. However, as the sample size increases, the biases of the estimators converge to zero for all distributions.

For both RC and SIMEX, the estimator of $β_{x}$ obtained based on $({\hat{σ}}_{η}, {\hat{σ}}_{ε})$ yields similar biases to the one based on the true parameters. This suggests that using the estimator $({\hat{σ}}_{η}, {\hat{σ}}_{ε})$ will not cause significant loss. Comparing the results from these two methods, when the sample size is small, the SIMEX estimator performs better. However, as the sample size increases, the RC estimator outperforms the SIMEX estimator slightly.

For the RC method, the simulation results show that the differences between the estimates using the sandwich method and the empirical standard deviations are small, indicating that the sandwich standard error estimator can estimate the standard error well. Particularly, whether using the $({\hat{σ}}_{η}, {\hat{σ}}_{ε})$ or not, the estimates $β_{0}$ obtained through the RC method are identical. We explain the reason in Supplement S4. In the case of the SIMEX method, during the estimation of variances in the extrapolation step, it is possible to encounter negative extrapolated variances. To address this issue, we repeat the SIMEX method until a positive extrapolated variance is obtained. As shown from the simulation results, the extrapolated standard errors are similar to the empirical standard deviations, indicating that the SIMEX method accurately estimates the standard error. To conclude, performances of both RC SIMEX improves as the sample size increases and is negatively associated with the variances of measurement errors and the standard error of the regression noise.

6. Analysis of a genetic data set

We proceed by applying the proposed methods to genetic data, GeneRepeat, provided by the R package augSIMEX (Zhang and Yi, 2019). The GeneRepeat dataset is adapted from the outbred Carworth Farms White (CFW) data Parker et al. (2016). The original data were analyzed to explore the relationship between genotype and behavioral, physiological, and gene expression traits in outbred CFW mice.

The dataset consists of two parts: the main study data, which includes 672 observations, and the validation data, which includes 339 observations. Here, we only use the main study data. The main study data include measurements for the genotype of the SNP rs223979909, which serves as the response variable $Y$ in this context. The genotype is a continuous variable ranging from 0 to 2. The covariates in the main study data include error-prone measurements $W$ of the tibia length $X$ , collected repeatedly at 5-minute intervals over a period of 30 minutes $\{W_{1}, W_{2}, W_{3}, \dots, W_{6}\}$ , and the body weight $Z$ of the mice.

Prior to applying the proposed method, it is necessary to study the distributions of the repeated measurements $\{W_{1}, W_{2}, W_{3}, \dots, W_{6}\}$ since it is assumed that the replicates are identically distributed. Supplement S6 shows the descriptive statistics and density plots. The analysis indicates that the distributions of the six replicates are not identical. The replicates $W_{4}$ , $W_{5}$ , $W_{6}$ exhibit the most similar densities among the six replicates. Hence, we use $\{W_{4}, W_{5}, W_{6}\}$ as the replicates of the true tibia length. We apply the methodology for three replicates presented in Supplement S5 to $\{W_{4}, W_{5}, W_{6}\}$ . We also apply the methodology for two replicates to $\{W_{4}, W_{5}\}$ and $\{W_{5}, W_{6}\}$ and the results are in Supplement S6. The covariate $W$ ranges from 2000 to 5000. To enhance accuracy, we adjust their scale by dividing by 1000.

We first estimate the corresponding standard deviations by the estimation method described in Section 2.2. Subsequently, we apply the two bootstrap hypothesis tests to the repeated measurements to investigate the existence of measurement errors in data. As we aim to simultaneously test two hypotheses, it becomes essential to adjust the significance level to account for multiple tests. Various multiple correction methods can be employed, taking into consideration the specific practical scenario. In this case, we employ the Bonferroni correction method and set the significance level for each test to 0.025. The estimates and the equal tail 95% bootstrap confidence intervals are given in Table 3. Notably, the lower bounds of 95% equal-tail confidence intervals for both Test 1 and Test 2 exceed the null values $a_{0}^{2} = 1$ and $σ_{ε 0}^{2} = 0$ for $\{W_{4}, W_{5}, W_{6}\}$ , which suggests that the true measurement is contaminated by both multiplicative and additive measurement errors.

Table 3:

Results of the estimation of standard deviations and the confidence intervals for tibia length

	${\hat{σ}}_{η}$	${\hat{σ}}_{ε}$	${\hat{a}}^{2}$	${\hat{σ}}_{ε}^{2}$	95% CI for ${\hat{a}}^{2}$	95% CI for ${\hat{σ}}_{ε}^{2}$
$\{W_{4}, W_{5}, W_{6}\}$	0.087	0.377	1.008	0.142	[1.004, 1.011]	[0.096, 0.188]

Open in a new tab

With the standard deviations, we proceed with the linear regression using RC and SIMEX on the response variable genotype $Y$ and the replicated observations of tibia length, while also incorporating the instrumental variable body weight $Z$ . Additionally, we also employ the ordinary least-squares method (LS) without considering measurement errors, to estimate the regression parameters. The results of the estimated regression coefficients and their corresponding standard errors based on $\{W_{4}, W_{5}, W_{6}\}$ are presented in Table 4, and the results based on $\{W_{4}, W_{5}\}$ and $\{W_{5}, W_{6}\}$ are shown in Supplement S6.

Table 4:

Regression coefficient estimates and estimated standard deviations based on $\{W_{4}, W_{5}, W_{6}\}$ .

		$β_{0}$	$β_{x}$	$β_{z}$
RC	Estimate	0.604	−0.0000341
RC	$\hat{S.E.}$	0.077	0.0000189
SIMEX	Estimate	0.606	−0.0000348
SIMEX	$\hat{S.E.}$	0.075	0.0000198
LS	Estimate	0.599	−0.0000327
LS	$\hat{S.E.}$	0.074	0.0000182
RC	Estimate	0.809	−0.0000350	−0.00808
RC	$\hat{S.E.}$	0.257	0.0000188	0.00950
SIMEX	Estimate	0.807	−0.0000333	−0.00806
SIMEX	$\hat{S.E.}$	0.243	0.0000196	0.00910
LS	Estimate	0.805	−0.0000321	−0.00810
LS	$\hat{S.E.}$	0.242	0.0000182	0.00909

Open in a new tab

As shown in Table 4, the RC method and SIMEX method yield similar estimates of regression parameters and standard errors. Comparing the results of different methods on different combinations of replicates, we find that the RC and SIMEX estimates of the coefficient $β_{x}$ are both smaller than the estimates obtained through the ordinary least squares method. In other words, the absolute values of the estimated ${\hat{β}}_{x}$ obtained through the RC and SIMEX methods are greater than the estimates obtained using the LS method. This finding confirms that regression calibration and SIMEX can effectively correct the attenuation effect caused by measurement errors, as discussed in Section 2.1. The estimates for the coefficients $β_{0}$ and $β_{z}$ obtained through RC, SIMEX, and LS methods exhibit similar results. This consistency aligns with the conclusion that the presence of both types of errors has no significant impact on the estimation of $β_{0}$ and $β_{z}$ .

7. Conclusion

In this work, we studied the measurement error problem when the true measurement variable is subject to both additive and multiplicative errors. We proposed a method to estimate the standard deviations of additive error and multiplicative error based on replicated data. We proved the identifiability of the proposed model and the consistency of the obtained estimator. We conducted hypothesis tests to test the significance of the variances of the two types of errors, which enabled us to determine the type of measurement errors.

Further, we applied approximate MLE on Bernstein polynomials to estimate the density function of the true measurement $X$ with compact support. We then investigated the effect of both types of errors on the estimation of regression parameters in the error-in-variable regression problem. We also adjusted the correction methods, RC and SIMEX, to correct the bias caused by both types of errors. We combined the correction methods with the variance estimator to correct the bias. Compared to previous studies, which relied on extra assumptions or specific application scenarios, our method is more versatile and can be applied in various situations. Moreover, we are the first to apply the mixture of two types of errors in regression, investigate the effect, and adjust the correction methods to make them suitable for this type of error.

The simulation results showed that our estimator performed well in various cases. The hypothesis tests correctly identified the type of measurement error present in the data, and the density of the variable can be estimated well through the combination of the estimator and the Bernstein polynomials. In the simple linear regression model, the correction methods significantly reduced the bias when the covariate is prone to two types of errors, and the combination with the proposed estimator does not cause significant loss. The method is also applied to analyze a genetic data set.

Supplementary Material

NIHMS2149920-supplement-1.pdf^{(473.2KB, pdf)}

References

Andrews DWK (2002). Generalized method of moments estimation when a parameter is on a boundary. Journal of Business & Economic Statistics, 20(4):530–544. [Google Scholar]
Bertrand A, Van Keilegom I, and Legrand C (2019). Flexible parametric approach to classical measurement error variance estimation without auxiliary data. Biometrics, 75(1):297–307. [DOI] [PubMed] [Google Scholar]
Brenner Miguel S, Comte F, and Johannes J (2023). Linear functional estimation under multiplicative measurement error. Bernoulli, 29(3):2247–2271. [Google Scholar]
Buonaccorsi JP (2010). Measurement error: models, methods, and applications. Chapman and Hall/CRC. [Google Scholar]
Butucea C and Matias C (2005). Minimax estimation of the noise level and of the deconvolution density in a semiparametric convolution model. Bernoulli, 11(2):309–340. [Google Scholar]
Carroll R, Ruppert D, Stefanski L, and Crainiceanu C (2006). Measurement error in nonlinear models: A modern perspective, second edition. Chapman and Hall/CRC. [Google Scholar]
Carroll RJ and Stefanski LA (1990). Approximate quasi-likelihood estimation in models with surrogate predictors. Journal of the American Statistical Association, 85(411):652–663. [Google Scholar]
Florens J-P, Simar L, and Van Keilegom I (2020). Estimation of the boundary of a variable observed with symmetric error. Journal of the American Statistical Association, 115(529):425–441. [Google Scholar]
Gleser LJ (1990). Improvements of the naive approach to estimation in nonlinear errors-in-variables regression models. Contemporary Mathematics, 112:99–114. [Google Scholar]
Huber P (1967). The behavior of maximum likelihood estimates under nonstandard conditions. Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, University of California Press, 1:221–233. [Google Scholar]
Hunter N, Muirhead CR, and Miles JC (2011). Two error components model for measurement error: application to radon in homes. Journal of environmental radioactivity, 102(9):799–805. [DOI] [PubMed] [Google Scholar]
Iturria SJ, Carroll RJ, and Firth D (1999). Polynomial regression and estimating functions in the presence of multiplicative measurement error. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 61(3):547–561. [Google Scholar]
Kekeç E and Van Keilegom I (2022). Estimation of the variance matrix in bivariate classical measurement error models. Electronic Journal of Statistics, 16(1):1831–1854. [Google Scholar]
Kneip A, Simar L, and Van Keilegom I (2015). Frontier estimation in the presence of measurement error with unknown variance. Journal of Econometrics, 184(2):379–393. [Google Scholar]
Lyles RH and Kupper LL (1997). A detailed evaluation of adjustment methods for multiplicative measurement error in linear regression with applications in occupational epidemiology. Biometrics, 53:1008–1025. [PubMed] [Google Scholar]
Marques TA (2004). Predicting and correcting bias caused by measurement error in line transect sampling using multiplicative error models. Biometrics, 60(3):757–763. [DOI] [PubMed] [Google Scholar]
Parker CC, Gopalakrishnan S, Carbonetto P, Gonzales NM, Leung E, Park YJ, Aryee E, Davis J, Blizard DA, Ackert-Bicknell CL, et al. (2016). Genome-wide association study of behavioral, physiological and gene expression traits in outbred CFW mice. Nature genetics, 48(8):919–926. [DOI] [PMC free article] [PubMed] [Google Scholar]
Pierce DA, Stram DO, Vaeth M, and Schafer DW (1992). The errors-in-variables problem: considerations provided by radiation dose-response analyses of the A-bomb survivor data. Journal of the American Statistical Association, 87(418):351–359. [Google Scholar]
Rocke DM and Durbin B (2001). A model for measurement error for gene expression arrays. Journal of Computational Biology, 8(6):557–569. [DOI] [PubMed] [Google Scholar]
Royden HL (1968). Real analysis. Macmillan, New York, 2d ed. edition. [Google Scholar]
Stram DO and Kopecky KJ (2003). Power and uncertainty analysis of epidemiological studies of radiation-related disease risk in which dose estimates are based on a complex dosimetry system: some observations. Radiation Research, 160(4):408–417. [DOI] [PubMed] [Google Scholar]
Subar AF, Kipnis V, Troiano RP, Midthune D, Schoeller DA, Bingham S, Sharbaugh CO, Trabulsi J, Runswick S, Ballard-Barbash R, et al. (2003). Using intake biomarkers to evaluate the extent of dietary misreporting in a large sample of adults: the open study. American Journal of Epidemiology, 158(1):1–13. [DOI] [PubMed] [Google Scholar]
Tang L, Tian Y, Yan F, and Habib E (2015). An improved procedure for the validation of satellite-based precipitation estimates. Atmospheric Research, 163:61–73. [Google Scholar]
Tian Y, Huffman GJ, Adler RF, Tang L, Sapiano M, Maggioni V, and Wu H (2013). Modeling errors in daily precipitation measurements: Additive or multiplicative? Geophysical Research Letters, 40(10):2060–2065. [Google Scholar]
Van der Vaart AW (2000). Asymptotic statistics. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, Cambridge. [Google Scholar]
Vinberg ĖB (2003). A course in algebra. American Mathematical Soc. [Google Scholar]
Wang C (1999). Robust sandwich covariance estimation for regression calibration estimator in Cox regression with measurement error. Statistics & probability letters, 45(4):371–378. [Google Scholar]
Yi GY, Delaigle A, and Gustafson P (2021). Handbook of Measurement Error Models. Chapman and Hall/CRC. [Google Scholar]
Zhang D, Lin X, and Dunson DB (2008). Variance component testing in generalized linear mixed models for longitudinal/clustered data and other related topics. In Random Effect and Latent Variable Model Selection, pages 19–36. Springer New York, New York, NY. [Google Scholar]
Zhang Q and Yi GY (2019). R package for analysis of data with mixed measurement error and misclassification in covariates: augsimex. Journal of Statistical Computation and Simulation, 89(12):2293–2315. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

NIHMS2149920-supplement-1.pdf^{(473.2KB, pdf)}

[R1] Andrews DWK (2002). Generalized method of moments estimation when a parameter is on a boundary. Journal of Business & Economic Statistics, 20(4):530–544. [Google Scholar]

[R2] Bertrand A, Van Keilegom I, and Legrand C (2019). Flexible parametric approach to classical measurement error variance estimation without auxiliary data. Biometrics, 75(1):297–307. [DOI] [PubMed] [Google Scholar]

[R3] Brenner Miguel S, Comte F, and Johannes J (2023). Linear functional estimation under multiplicative measurement error. Bernoulli, 29(3):2247–2271. [Google Scholar]

[R4] Buonaccorsi JP (2010). Measurement error: models, methods, and applications. Chapman and Hall/CRC. [Google Scholar]

[R5] Butucea C and Matias C (2005). Minimax estimation of the noise level and of the deconvolution density in a semiparametric convolution model. Bernoulli, 11(2):309–340. [Google Scholar]

[R6] Carroll R, Ruppert D, Stefanski L, and Crainiceanu C (2006). Measurement error in nonlinear models: A modern perspective, second edition. Chapman and Hall/CRC. [Google Scholar]

[R7] Carroll RJ and Stefanski LA (1990). Approximate quasi-likelihood estimation in models with surrogate predictors. Journal of the American Statistical Association, 85(411):652–663. [Google Scholar]

[R8] Florens J-P, Simar L, and Van Keilegom I (2020). Estimation of the boundary of a variable observed with symmetric error. Journal of the American Statistical Association, 115(529):425–441. [Google Scholar]

[R9] Gleser LJ (1990). Improvements of the naive approach to estimation in nonlinear errors-in-variables regression models. Contemporary Mathematics, 112:99–114. [Google Scholar]

[R10] Huber P (1967). The behavior of maximum likelihood estimates under nonstandard conditions. Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, University of California Press, 1:221–233. [Google Scholar]

[R11] Hunter N, Muirhead CR, and Miles JC (2011). Two error components model for measurement error: application to radon in homes. Journal of environmental radioactivity, 102(9):799–805. [DOI] [PubMed] [Google Scholar]

[R12] Iturria SJ, Carroll RJ, and Firth D (1999). Polynomial regression and estimating functions in the presence of multiplicative measurement error. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 61(3):547–561. [Google Scholar]

[R13] Kekeç E and Van Keilegom I (2022). Estimation of the variance matrix in bivariate classical measurement error models. Electronic Journal of Statistics, 16(1):1831–1854. [Google Scholar]

[R14] Kneip A, Simar L, and Van Keilegom I (2015). Frontier estimation in the presence of measurement error with unknown variance. Journal of Econometrics, 184(2):379–393. [Google Scholar]

[R15] Lyles RH and Kupper LL (1997). A detailed evaluation of adjustment methods for multiplicative measurement error in linear regression with applications in occupational epidemiology. Biometrics, 53:1008–1025. [PubMed] [Google Scholar]

[R16] Marques TA (2004). Predicting and correcting bias caused by measurement error in line transect sampling using multiplicative error models. Biometrics, 60(3):757–763. [DOI] [PubMed] [Google Scholar]

[R17] Parker CC, Gopalakrishnan S, Carbonetto P, Gonzales NM, Leung E, Park YJ, Aryee E, Davis J, Blizard DA, Ackert-Bicknell CL, et al. (2016). Genome-wide association study of behavioral, physiological and gene expression traits in outbred CFW mice. Nature genetics, 48(8):919–926. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R18] Pierce DA, Stram DO, Vaeth M, and Schafer DW (1992). The errors-in-variables problem: considerations provided by radiation dose-response analyses of the A-bomb survivor data. Journal of the American Statistical Association, 87(418):351–359. [Google Scholar]

[R19] Rocke DM and Durbin B (2001). A model for measurement error for gene expression arrays. Journal of Computational Biology, 8(6):557–569. [DOI] [PubMed] [Google Scholar]

[R20] Royden HL (1968). Real analysis. Macmillan, New York, 2d ed. edition. [Google Scholar]

[R21] Stram DO and Kopecky KJ (2003). Power and uncertainty analysis of epidemiological studies of radiation-related disease risk in which dose estimates are based on a complex dosimetry system: some observations. Radiation Research, 160(4):408–417. [DOI] [PubMed] [Google Scholar]

[R22] Subar AF, Kipnis V, Troiano RP, Midthune D, Schoeller DA, Bingham S, Sharbaugh CO, Trabulsi J, Runswick S, Ballard-Barbash R, et al. (2003). Using intake biomarkers to evaluate the extent of dietary misreporting in a large sample of adults: the open study. American Journal of Epidemiology, 158(1):1–13. [DOI] [PubMed] [Google Scholar]

[R23] Tang L, Tian Y, Yan F, and Habib E (2015). An improved procedure for the validation of satellite-based precipitation estimates. Atmospheric Research, 163:61–73. [Google Scholar]

[R24] Tian Y, Huffman GJ, Adler RF, Tang L, Sapiano M, Maggioni V, and Wu H (2013). Modeling errors in daily precipitation measurements: Additive or multiplicative? Geophysical Research Letters, 40(10):2060–2065. [Google Scholar]

[R25] Van der Vaart AW (2000). Asymptotic statistics. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, Cambridge. [Google Scholar]

[R26] Vinberg ĖB (2003). A course in algebra. American Mathematical Soc. [Google Scholar]

[R27] Wang C (1999). Robust sandwich covariance estimation for regression calibration estimator in Cox regression with measurement error. Statistics & probability letters, 45(4):371–378. [Google Scholar]

[R28] Yi GY, Delaigle A, and Gustafson P (2021). Handbook of Measurement Error Models. Chapman and Hall/CRC. [Google Scholar]

[R29] Zhang D, Lin X, and Dunson DB (2008). Variance component testing in generalized linear mixed models for longitudinal/clustered data and other related topics. In Random Effect and Latent Variable Model Selection, pages 19–36. Springer New York, New York, NY. [Google Scholar]

[R30] Zhang Q and Yi GY (2019). R package for analysis of data with mixed measurement error and misclassification in covariates: augsimex. Journal of Statistical Computation and Simulation, 89(12):2293–2315. [Google Scholar]

PERMALINK

Inference on data with both multiplicative and additive measurement errors

Yuxiang Zong

Yinfu Liu

Yanyuan Ma

Ingrid Van Keilegom

Abstract

1. Introduction

2. Methodology

2.1. Model and Assumptions

2.2. Error Variance Estimation

2.3. Asymptotic Distribution of Estimator

3. Probability Density Function Estimation

4. Error-in-variable Regression Problem

4.1. Error-in-variable Linear Regression Problem

4.2. Regression Calibration

4.3. Simulation Extrapolation

5. Simulation

5.1. Error Variance Estimation

Figure 1:

Figure 2:

Table 1:

5.2. Density Estimation

Figure 3:

5.3. Error-in-variable Regression Problem

Table 2:

6. Analysis of a genetic data set

Table 3:

Table 4:

7. Conclusion

Supplementary Material

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Inference on data with both multiplicative and additive measurement errors

Yuxiang Zong

Yinfu Liu

Yanyuan Ma

Ingrid Van Keilegom

Abstract

1. Introduction

2. Methodology

2.1. Model and Assumptions

2.2. Error Variance Estimation

2.3. Asymptotic Distribution of Estimator

3. Probability Density Function Estimation

4. Error-in-variable Regression Problem

4.1. Error-in-variable Linear Regression Problem

4.2. Regression Calibration

4.3. Simulation Extrapolation

5. Simulation

5.1. Error Variance Estimation

Figure 1:

Figure 2:

Table 1:

5.2. Density Estimation

Figure 3:

5.3. Error-in-variable Regression Problem

Table 2:

6. Analysis of a genetic data set

Table 3:

Table 4:

7. Conclusion

Supplementary Material

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases