Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2026 Mar 5.
Published in final edited form as: Scand Stat Theory Appl. 2025 Aug 3;52(4):1763–1785. doi: 10.1111/sjos.70009

Inference on data with both multiplicative and additive measurement errors

Yuxiang Zong 1, Yinfu Liu 2, Yanyuan Ma 3, Ingrid Van Keilegom 1
PMCID: PMC12959484  NIHMSID: NIHMS2149920  PMID: 41789348

Abstract

Measurement errors are omnipresent in many fields and can lead to serious problems in statistical analysis. In the literature, measurement errors are often assumed to be either additive or multiplicative. We consider the case where a variable is subject to both additive and multiplicative errors. We establish the identifiability and propose a moment-based estimator for the variances of these two types of errors, which is shown to be consistent. We further derive the asymptotic distribution of the estimator and conduct hypothesis tests to examine the existence of the two types of errors. We also develop a likelihood-based method to approximate the density of the error-prone variable. We apply our strategy in the context of linear regression and study its effect on the estimation of regression parameters in combination with Regression Calibration and Simulation Extrapolation. The proposed methodology is numerically investigated through simulations and is implemented in a real data application.

Keywords: Bernstein polynomial, Measurement error, Method of moments, Regression calibration, Simulation extrapolation

1. Introduction

In practice, it is common to encounter situations where a variable cannot be directly observed and is measured with noise, leading to the measurement error problem or the errors-in-variable problem. Measurement error problems are pervasive in many fields due to various reasons, such as inaccurate measuring devices, sampling errors, and imprecise data collection methods. For example, long-term systolic blood pressure may not be directly observable and is often approximated by blood pressure measured during a clinic visit. Ignoring measurement errors in statistical analysis can lead to serious problems, such as biased parameter estimation, loss of statistical power, or masking of important features in the data. Therefore, it is crucial to develop methods to account for measurement errors.

To address measurement errors, it is essential to accurately capture the relation between the true variable and its errored observation. In the literature, it is often assumed that the measurement error is additive or multiplicative. Specifically, let X be the unobserved variable, and W the corresponding observed variable. The classical additive measurement error model is defined as

W=X+ε, (1)

where E(ε)=0 and X and ε are independent. For example, in the National Cancer Institute’s OPEN study (Subar et al., 2003), the true energy intake was measured by the food frequency questionnaire (FFQ), and the log measured energy intake is assumed to contain additive errors.

The multiplicative measurement error model is defined as

W=Xη, (2)

where X and η are assumed to be independent and η is usually assumed to follow Log-Normal distribution with E(logη)=0. An example of the multiplicative error model is the analysis of A-bomb survivor data from the Hiroshima and Nagasaki explosions studies by Pierce et al. (1992). The true explosion dose X is not observable, but the estimates W can be obtained and are assumed to be contaminated by multiplicative errors. Extensive researches on these two types of measurement errors have been conducted and are comprehensively explained in Carroll et al. (2006), Buonaccorsi (2010), Yi et al. (2021), among others.

Several studies have compared the performance of additive and multiplicative measurement error models in real-life data analysis; see, for instance, Marques (2004), Hunter et al. (2011), Tian et al. (2013), and Tang et al. (2015). However, very few research considers both multiplicative and additive measurement errors. Specifically, Rocke and Durbin (2001) considered an error model of the form Xη+ε where η follows a Log-Normal distribution and ε follows a Normal distribution and applied it to the gene expression measurements with cDNA arrays. However, they require data from a control group in which the real variable X takes the value 0, resulting in multiple pure additive errors to be observed to facilitate estimating the variance of ε. This limits its applicability. Berkson errors with a mixture of additive and multiplicative errors have been studied by Stram and Kopecky (2003). However, this model has not been applied to the error-in-variable regression problem. Therefore, it is necessary to further investigate the model with both additive and multiplicative errors.

Two main types of estimating the error structure are likelihood-based methods and moment-based methods. Likelihood-based methods are one of the earliest techniques applied to address measurement error issues, and they have been extensively discussed in the literature (Yi et al., 2021). However, likelihood-based methods have drawbacks, including assumptions regarding the true variable X and computational complexity. In our setting, where both types of errors coexist, the likelihood function involves a double integration, leading to a significant computational burden. Hence, to estimate the error variances, we adopt the method of moments, which does not require full knowledge of the unobserved variable. We show the identifiability of the error variances, construct estimating equations to estimate the error variances based on replicated data, and prove the consistency of the estimator. With the estimated error variances, we employ a likelihood-based method to approximate the density of true variable, which proves to be a more efficient approach.

Once the error variance has been estimated, we further drive the asymptotic distribution of the estimator. The difficulty is that the true parameter possibly lies on the boundary of the parameter space. Andrews (2002) established the asymptotic distribution of the generalized method of moments estimator in such case. Zhang et al. (2008) focused on conducting variance component tests within the generalized linear mixed models framework based on the derived null asymptotic distribution. In this study, we use Andrews (2002) to derive the asymptotic distribution of the error variance estimator when the true parameter is on the boundary.

We also estimate the distribution of the true variable X. In the literature regarding the classical additive measurement error problem, nonparametric deconvolution methods have been applied to estimate the density of X (Butucea and Matias, 2005). However, when both multiplicative and additive errors coexist, the deconvolution method becomes ineffective in recovering the true variable distribution. Specifically, when both types of errors are present, two steps of deconvolution are required to recover the true distribution of X. First, deconvolution is applied to recover the distribution of Xη, and then it is applied again after a log transformation to recover the distribution of X. This process may lead to a loss of information due to the error accumulation from deconvolution procedures. Therefore, in this case, we turn to a likelihood-based approach introduced by Bertrand et al. (2019) to approximate the density function of X, assuming that X has a compact support. Based on the idea of Bertrand et al. (2019), we apply Bernstein polynomials to approximate the unknown density function of X incorporating the estimated error variances, and the tuning parameter is selected based on AIC criteria.

In the context of linear regression, measurement errors can lead to biased and inconsistent estimates of the regression coefficients, invalid inference results, and inaccurate predictions. We investigate the issue caused by the existence of both types of measurement errors on the simple linear regression problem and propose regression calibration method and SIMEX method to correct the issue.

The paper is organized as follows. In Section 2, we introduce the measurement error model with both multiplicative and additive errors, propose a method to estimate the unknown error variances, and conduct hypothesis tests to examine the existence of measurement errors. Section 3 presents a method to estimate the density of the true measurement variable. Errors-in-variable regression problem with both types of errors is investigated in Section 4. Simulation studies and applications are shown in Section 5.

2. Methodology

2.1. Model and Assumptions

Let X denote the true variable and W be the surrogate variable measured with both additive and multiplicative errors. Let η and ε be the multiplicative error and additive error respectively. Specifically, we assume

W=Xη+ε, (3)

where η~Log-Normal0,ση2, ε~N0,σε2, and X, η and ε are mutually independent. We denote μE(X), σx2var(X), and X is either positive or negative, but we do not assume X to belong to any parametric distribution family.

Remark 2.1. The parametric assumptions on the two error distributions are essential for model identifiability. In addition, the assumptions that η~Log-Normal0,ση2 for multiplicative errors and ε~N0,σε2 for additive errors are widely adopted in the measurement error literature due to their broad applicability and technical convenience. See, for example, Iturria et al. (1999); Lyles and Kupper (1997); Brenner Miguel et al. (2023); Bertrand et al. (2019). These error distribution assumptions can be validated using appropriate validation data or other external information. The estimation procedure in Section 2.2 relies on the error distributions through the first three moments, hence is more sensitive to the distribution assumption than in the pure additive error case.

2.2. Error Variance Estimation

We first propose a method to estimate error variances based on replicate data. Suppose that there are n objects with two replicates Wi1,Wi2, i=1,n, then we can write

Wj=Xηj+εj,j=1, 2,

where η1, η2, ε1, ε2, X are mutually independent and εj~N0,σε2, logηj~N0,ση2.

Note that E(η)=eση2/2 and Var(η)=eση2eση2-1. For notational simplicity, let μkEXk and aeση2/2. We obtain

EW1=aμ1,EW1W2=a2μ2,EW12=a4μ2+σε2,EW12W2=a5μ3+aμ1σε2,EW13=a9μ3+3aμ1σε2.

Eliminating μ1, μ2, μ3 and σε2 from the above leads to to a cubic equation of a2

0=EW1EW1W2a6+EW12W2-EW1EW12a4-3EW1EW1W2a2+3EW1EW12-EW13. (4)

We next establish that (4) has a unique root for a in the region 1,.

Lemma 2.1. When X>0, covX2,X>0. When X<0, covX2,X<0.

Theorem 2.1 (Identifiability). When X is positive or negative and E|X|3<, there exists a unique a2>1 that satisfies (4). Hence, a and subsequently μ1, μ2, μ3, σε2 are all unique, i.e., the model is identifiable.

The proof of Lemma 2.1 and Theorem 2.1 can be found in Supplement S1. Based on the identifiability result and its proof, it is straightforward to construct an estimator for a2 and σε2 using the empirical approximation. Specifically, let θa2,σε2T, then the estimator θ^aˆ2,σˆε2T can be obtained by solving the estimating equation Ψn(θ)=n-1i=1nψθWi=0 where

ψθWi=E^Wi1a4σε2-3E^Wi1σε2-a4E^Wi12Wi2+E^Wi13σε2-E^Wi12+a2Wi1Wi2, (5)

where E^Wi1=j=12Wij/2, E^Wi12=j=12Wij2/2, E^Wi13=j=12Wij3/2, and E^Wi12Wi2=Wi12Wi2+Wi1Wi22/2. Note that the estimating function Ψn(θ) has mean 0 when evaluated at the true parameter θ0, which, under suitable conditions, leads to the consistency of the estimator as we establish in Theorem 2.2, with the proof in Supplement S1.

Theorem 2.2 (Consistency). When X is positive or negative, the estimator θ^ is consistent.

Remark 2.2. The above idea is further extended to cases where more than 2 replicates are available, see Supplement S5 for details. The assumptions on the positiveness or negativeness of X in Theorem 2.1 can be relaxed. An alternative condition, along with its analysis, is presented in the Supplement S1.4.

Remark 2.3. In practice, even though the model is theoretically identifiable, it can be practically unidentifiable. For example, one might have E^W1W2<0 but E^W12W2>0 where E^W12W2=i=1nWi12Wi2+Wi1Wi22/(2n) and E^W1W2=i=1nWi1Wi2/n. When this rare situation occurs, we propose using truncation at the value 0 to avoid practical unidentifiability.

2.3. Asymptotic Distribution of Estimator

Upon obtaining the error variance estimator, we proceed to derive the asymptotic distribution of the estimator. Given that the estimator is derived from estimating equations, when the true parameter lies in an open set, the estimator usually has asymptotic normality. However, in our setting, the parameter space Θ is compact, and the true parameter is possibly positioned on the boundary, such as when one of the error variances equals 0, hence the estimator may no longer follow the normal distribution. We follow the theory proposed in Van der Vaart (2000) and Andrews (2002) to derive the asymptotic distribution of the estimator. To simplify notifications, let Ψ(θ)EΨn(θ), ΓθTΨ(θ)θ=θ0, 𝒥ΓTΓ, and 𝒱nEΨnθ0Ψnθ0T. Theorem 2.3 establishes the asymptotic distribution for θˆ when the true parameter lies in the interior of the parameter space.

Theorem 2.3. When θ0 lies in the interior of the parameter space, that is, θ0>(1, 0)T elementwise, n12θ^n-θ0dN0,Γ-1𝒱Γ-T.

Next, we consider the scenario where both parameters are situated on the boundary. To derive the asymptotic distribution of the estimator when the true parameter lies on the boundary of parameter space, we use the method proposed by Andrews (2002). The idea involves approximating the objective functions by quadratic functions, approximating the restricted parameter spaces by cones, and determining the asymptotic distribution of the estimators. The following Theorem 2.4 derives the asymptotic distribution of θ^ when θ0=(1, 0)T and the proof is presented in Supplement S1.

Theorem 2.4. When θ0 lies on the boundary of the parameter space, that is, θ0=(1, 0)T, n12θ^n-θ0dλ~, where λ~𝒵I(𝒵0), and 𝒵~N0,Γ-1𝒱Γ-T.

We now proceed to consider the scenario where only one of the true parameters lies on the boundary, specifically, either the multiplicative or additive error exists. The vectors and matrices are partitioned as

Ψ=Ψa2Ψσε2,𝒵=𝒵a2𝒵σε2,𝒥=𝒥a2𝒥a2σε2𝒥σε2a2𝒥σε2,λ~=λ~a2λ~σε2.

The following Theorem 2.5 and Theorem 2.6 show the asymptotic distribution of subvectors of θ^.

Theorem 2.5. Assume θ0=1,σε02T, σε02>0. Then

n12a^2-1dλ~a2,

where λ~a2𝒵a2I𝒵a20 is the mixture of the point mass distribution at 0 and the truncated normal distribution N0,H1Γ-1𝒱Γ-TH1T with H1=(1, 0) and truncation interval [0,). Further, when n, σ^ε2 satisfies

n12σ^ε2-σε02d𝒥σε2-1Ψσε2-𝒥σε2-1𝒥a2σε2λ~a2.

Theorem 2.6. Assume θ0=a02,0T, a02>1. Then

n12σ^ε2-0dλ~σε2,

where λ~σε2𝒵σε2I𝒵σε20 is the mixture of the point mass distribution at 0 and the truncated normal distribution N0,H2Γ-1𝒱Γ-TH2T with H2=(0, 1) and truncation interval [0,). Further, when n, a^2 satisfies

n12a^2-a02d𝒥a2-1Ψa2-𝒥a2-1𝒥σε2a2λ~a2.

3. Probability Density Function Estimation

After estimating the error variances, we proceed to estimate the density function of X. We propose a likelihood based method (Bertrand et al., 2019) which uses Bernstein polynomials to approximate the density function of X with a compact support within the framework of the classical additive error model.

Let VXη, then logV=logX+logη where logη~N0,ση2. We assume that X is continuous and has a compact support c1,c1+c2, which does not include 0. Let TlogX and define TAS+B, where Alogc1+c2-logc1, Blogc1, S[0, 1], and Tlogc1,logc1+c2. The probability density function of W is given by

fW(w)=1σε0+fV(v)ϕw-vσεdv=1σε0+1AσηvBA+BfSt-BAϕlogv-tσηdtϕw-vσεdv=1ABA+BfSt-BA0+1vσεσηϕlogv-tσηϕw-vσεdvdt, (6)

where fS is the density of S, and ϕ() is the density of the standard normal distribution.

We intend to use a Bernstein polynomial to estimate the unknown density fS. A Bernstein polynomial of degree m can be expressed as:

Bms=k=0mαk,mbk,ms,s0, 1, (7)

where bk,m(s)=mksk(1-s)m-k, for k=0,,m. The method is based on the property that any continuous function f(s) defined on [0, 1] can be uniformly approximated by such a polynomial, by taking αk,m=fkm, that is

limmsup0s1k=0mfkmbk,m(s)-f(s)=0.

We first approximate the density fS by a Bernstein polynomial, which is equivalent to a mixture of m+1 densities of Beta(k+1,m-k+1) distributions,

f~S,ms;θm=k=0mfSkmbk,m(s)=k=0mθk,mBetak+1,m-k+1s,

where Betaα,β() is the pdf of a Beta distribution with parameters α and β and θm=θ0,m,,θm,mT, θk,m=1m+1fSkm, and θk,m0. Since f~S,m;θm for m must be a density, we impose the constraint k=0mθk,m=1. The density fW is estimated by

f~W,mw;σε,ση,θm==1Aσεσηk=0mθk,mAA+BBetak+1,m-k+1t-BA0+1vϕlogv-tσηϕw-vσεdvdt. (8)

In Supplement S3, we prove that f~W,m() uniformly converges to fW(), i.e.,

limmsupwf~W,m(w)-fW(w)=0. (9)

When an iid sample of W, W1,,Wn, is available, the log-likelihood function of the parameter set σε,ση,θm given the observed data is

θm;σε,ση=i=1nlog1Aσεσηk=0mθk,mAA+BBetak+1,m-k+1t-BA0+1vϕlogv-tσηϕwi-vσεdvdt. (10)

Given the estimated error standard deviations σ^ε,σ^η, the estimator of the Bernstein polynomial parameters can be obtained by maximizing the log-likelihood function with respect to θm given the degree of the Bernstein polynomial.

Our proposed estimation procedure is to (1) estimate the error variances through methods proposed in Section 2.2 and insert σ^ε,σ^η in the log-likelihood function; (2) obtain the estimated parameters by maximizing the log-likelihood function; (3) select the degree of Bernstein polynomials based on the AIC criteria. Since the log-likelihood function contains double integrals, the computation of maximum likelihood estimator is extremely time-consuming. To reduce the time cost, we apply the Laplace approximation to estimate the integral in the log-likelihood function (10). Details regarding the Laplace approximation can be found in Supplement S3.

Remark 3.1. For convenience, we have used Bernstein polynomials as basis functions, which requires X to have a compact support. This approach is feasible when the support of X can be easily determined. In the situation when it is difficult to know the support of X, we suggest to use alternative basis functions such as Laguerre polynomials, instead of estimating the support of X. Indeed, estimation of the support of an unobserved random variable is challenging and can induce complexity in the subsequent analysis (Kneip et al., 2015; Florens et al., 2020).

4. Error-in-variable Regression Problem

4.1. Error-in-variable Linear Regression Problem

Let Yi,Xi,Zii=1, 2,,n be a sample of n independent and identically distributed triplets, where Yi is the response variable, Xi is univariate, Zi is multivariate and n denotes the sample size. We consider the case where the covariate Xi is error-prone and replaced by the surrogate measurement Wij, and Zi is measured without error. The linear model is given by

Yi=β0+βxXi+βzTZi+ei

with

Wij=Xiηij+εij,i=1, 2,,n,j=1, 2,,k

where k is the number of replicates for each Xi, β0, βx, βz are regression parameters to be estimated, ei,ηij,εij are mutually independent and ei satisfies EeiXi,Zi=0 and it has constant variance σe2.

When the covariates are contaminated by classical additive or multiplicative errors, traditional least-squares methods will produce biased, inconsistent estimators and invalid statistical inference results. In simple linear regression, when covariates are measured with classical additive error, the estimator established based on contaminated data will be biased towards 0. This phenomenon is called attenuation. It can be proved that with both multiplicative and additive errors, the ordinary least-squares estimator β^w*,β^z*TT will also not consistently estimate βx,βzTT and will attenuate to 0. The proof of the attenuation effect can be found in Supplement S4.

4.2. Regression Calibration

Regression calibration (RC) is a statistical method commonly used to address the issue of error-in-variable regression. It was developed by Gleser (1990), Carroll and Stefanski (1990), and others, and is widely used in nutritional epidemiology to correct measurement error bias. The idea of regression calibration is to replace unobserved X by the conditional mean of X given (W,Z).

The general approach of regression calibration involves three steps. First, a calibration model fX(W,Z) is constructed by regressing the true covariate X on {W,Z}. Second, the unknown X is replaced by the fitted value X^ using fX(W,Z), and parameter estimates are obtained by regressing the response Y on X^,Z. Finally, standard errors of the parameter estimates are adjusted to account for uncertainties in both the regression model and the calibration model. The key justification for regression calibration is that the estimates obtained from the regression model of {Y,X^,Z} are consistent to the parameters of the true model {Y,X,Z} for linear models and additive normal errors.

Based on the idea of the best linear approximation proposed by Carroll et al. (2006), we propose the best linear approximation to X given {W¯,Z} in our setting, which is given by

EXW¯,ZEX+σwxΣzxTσww¯2ΣwzΣzwΣzz-1W¯-μwZ-μz,

where

σww¯2=1kEX2e2ση2+eση2+E2(X)eση2+1kσε2

denotes the variance of W¯, Σab denotes the covariance matrix between random vector A, B, and Σzz is the covariance matrix of Z.

Let X^i denote the estimate of EXiW¯i,Zi. We have

X^i=E^X+σ^wxΣ^zxTσ^ww¯2Σ^wzΣ^zwΣ^zz-1W¯i-μ^wZi-μ^z,

where

W¯i1kj=1kWij,σ^wxE^ηE^X2-E^(X)2,μ^w1ni=1nW¯i,μ^z1ni=1nZi,Σ^zxE^(η)-1Σ^wz,E^(X)E^(η)-1μ^w,E^X2E^W12-σ^ε2/a^4,E^W121nkj=1ki=1nWij2,E^ηa^2,Σ^zwΣ^wzT,

σ^ww¯2 is the sample variance of W¯i·, Σ^wz is the sample covariance matrix between W and Z, Σ^zz is the sample variance of Z, k is the number of the replicates, and a^2 and σ^ε2 are from the estimator θ^=a^2,σ^ε2T obtained in Section 2.2.

Replacing the unknown Xi in the linear regression function by the estimates X^i leads to

Yi=β0+βxX^i+βzTZi+ei*,i=1,,n,

then the estimator of regression parameters β^x*,β^z*TT is obtained by the ordinary least-squares method based on Yi,X^i,Zii=1, 2,,n.

Theorem 4.1. The regression calibration estimator β^x*,β^z*TT is consistent.

The proof of Theorem 4.1 is presented in Supplement S4.

The issue here is that the linear regression based on calibration data Yi,X^i,Zii=1, 2,,n} is heteroscedastic due to the non-constant conditional variance of X. As a result, the usual standard errors are not accurate, even though the ordinary least-squares estimator is consistent (Carroll et al., 2006). To address this issue, two methods, the bootstrap method and the sandwich method, were proposed to estimate the standard errors. The bootstrap method requires knowledge of the conditional expectation E(XZ,W) and the conditional variance Var(XZ,W), which are unknown in this case. Hence, we apply the sandwich method to construct the standard errors (Huber, 1967). Let X^*=(1,X^)T and β=β0,βx,βzTT. In the regression calibration model, the estimates X^i depend on the nuisance parameter θ~θT,μw,μz,σww¯2,vecΣzxT,vecΣwzT,vecΣzzTT. If the nuisance parameter happens to be known, the ordinary least-squares estimator can be obtained based on the estimating equation n-1i=1nΨ1i(β)=0 where

Ψ1iβ=Yi-β0-βxX^i-βzTZiYi-β0-βxX^i-βzTZiX^iYi-β0-βxX^i-βzTZiZi.

More realistically, if the nuisance parameter is unknown, according to Section 2.2, the nuisance parameter is estimated based on the estimating function n-1i=1nΨ2i(θ~)=0 where Ψ2i is defined in Supplement S4. Based on the idea of joint estimating equations (Wang, 1999), δβT,θT,μw,μz,σww¯2,vecΣzxT,vecΣwzT,vecΣzzTT can be estimated through solving n-1i=1nΨ~i(δ)=0 where Ψ~i(δ)=Ψ1iT(δ),Ψ2iT(δ)T. Let δ^ denote the resulting estimator. The estimating function n-1i=1nΨ~i(δ) has mean 0 when evaluated at the true parameter δ0, that is, En-1i=1nΨ~iδ0=0.

We can prove that δ^ is a consistent estimator of δ0 and n1/2δ^-δ0dN0,ΣRC where ΣRCΨ~*(δ)-1EΨ~i(δ)Ψ~i(δ)TΨ~*(δ)-T and Ψ~*(δ)/δTE(Ψ~(δ)). We apply the sandwich estimator (Carroll et al., 2006; Huber, 1967) to consistently estimate the variance. Specifically, the sandwich estimator of the covariance matrix is n-1A^n-1B^nA^n-T where

A^n=n-1i=1nδTΨ~iδ^,B^n=n-1i=1nΨ~iδ^Ψ~iTδ^.

The proof of consistency and asymptotic normality is presented in Supplement S4, along with the derivation of the sandwich estimator.

4.3. Simulation Extrapolation

We further consider SIMEX method. SIMEX consists of three steps: simulation, estimation, and extrapolation. During the simulation step, we generate remeasured data that incorporates both multiplicative and additive errors and then demonstrate its compliance with the essential prerequisites of SIMEX. In the simulation step, we will form additional errors that satisfy the basic requirement of SIMEX. In the estimation and extrapolation steps, we will explore various extrapolation functions commonly employed. Besides, we will discuss how to use SIMEX when a data set contains some replicates. In cases where replicates are available, it is crucial to incorporate minor adjustments when constructing remeasured data to fulfill the fundamental requirement of SIMEX. Failure to do so may introduce bias into the analysis (Carroll et al., 2006).

In the simulation step, given the measurement error model (3), we simulate the new data indexed by ξ>0 as

Wb,iξ=Wi+ξεb,ieξlogηb,i,i=1,,n,b=1,,B, (11)

where ηb,i and εb,i are mutually independent errors that are independent of all the observed data, and are identically distributed following logηb,i~N0,ση2, εb,i~N0,σε2. Note that Wb,i(0)=Wi and in Supplement S4 we show that MSEWb,i(-1)Xi=0. Thus, intuitively, when ξ=0, Wb,i(ξ) represents the observed data. As ξ increases, measurement errors are inflated, while at ξ=-1, the errors are reduced to zero.

For each ξ, the estimated model parameters are given by

β^ξ=B-1b=1Bβ^bξ,

where β^b(ξ) is obtained using least-squares method based on data Wb,ii=1n.

In the extrapolation step, the extrapolation function models the β^(ξ) as a function of ξ. In this study, quadratic extrapolation is used. Setting ξ=-1 in the extrapolation function yields the SIMEX estimator, which we donate β^simex.

In Supplement S4.5, we prove that nβ^simex-β0 follows an asymptotic normal distribution. However, the variance of the SIMEX estimator involves derivations based on high-dimensional matrices. To address this, we apply the Jackknife-type method to estimate the variance of the SIMEX estimator. Following page 392 in Carroll et al. (2006),

Varβ^simexVarβ^true+Varβ^simex-β^true, (12)

where β^true is the least-squares estimator based on Yi,Zi,Xii=1n. To simplify notation, let τ2=Varβ^true and τ^b2(ξ)=Var^β^b(ξ), τ^2(ξ)=limBB-1b=1Bτ^b2(ξ).

To estimate the first term Varβ^true on the right side of equation (12), an extrapolant model is fit to the components of τ^2ξm,ξm1M and the estimate of Varβ^true is the modeled value at ξ=-1. For the second term Varβ^simex-β^true, it is proved that

Varβ^simex-β^true=-limξ-1EsΔ2ξ,

where

sΔ2(ξ)=(B-1)-1b=1Bβ^bξ-β^ξβ^bξ-β^ξT.

Hence, the estimate of Varβ^simex can be obtained by extrapolating the components of the differences, τ^2(ξ)-sΔ2(ξ), to ξ=-1. It is important to emphasize that the entire procedure is an approximation and is typically only valid in situations where the sample size is large and the measurement error is small (Carroll et al., 2006).

In the case of k replicates, incorporating the mean of the replicates into equation (11) becomes challenging due to the fact that the distribution of η no longer follows a Log-Normal distribution. To address this issue, we can utilize the estimator β^(ξ) to improve the accuracy of the extrapolation function. We define β^b,j(ξ) as the estimator obtained based on the remeasured data from the jth replicates, denoted by Wb,i,j(ξ)i=1n. For each ξ and the jth replicate, the estimated parameters are given by

β^(ξ)j=B-1b=1Bβ^b,jξ.

We then conduct a quadratic regression of β^(ξ) on ξ based on the set of data points ξi,β^ξij, i=1,,m, j=1,,k using nonlinear least-squares method to get the extrapolation function, where k denotes the number of replicates and m denotes the number of chosen values of ξ. Setting ξ=-1 in the extrapolation function gives the SIMEX estimator β^simex.

This procedure allows us to obtain more accurate coefficients by considering all the replicates instead of just one. A similar approach is applied to extrapolate the variance as well.

5. Simulation

5.1. Error Variance Estimation

We conduct simulations to evaluate the performance of the estimator σ^η2,σ^ε2 obtained in Section 2.2. Since estimating the error variances requires X to be either positive or negative, we generated X from different distributions with positive supports. We define the noise-to-signal ratio as σw2/σx2-1. We specify the noise-to-signal ratio to take values from the set {0, 0.25, 0.50, 0.75}, and then we examine different combinations of standard deviations for two types of errors while controlling the noise-to-signal ratio. For each simulation setting, 500 datasets were generated independently for three sample sizes n={300,500,1000}. We present the bias, relative bias, empirical standard deviations, and empirical mean squared error as evaluation criteria for the estimates. Figures 1 and 2 present the simulation results for the estimation when X~LogNormal(0,0.5) and X~LogNormal(0,1), respectively, with sample sizes of 500 (gray line) and 1000 (black line). The subfigures in each column show the estimation results under different combinations of ση and σε, corresponding to the noise-to-signal ratios 0.25, 0.5, 0.75 respectively. Since we control the ratio to be these three levels, when ση decreases, σε increases correspondingly. The true values of ση and σε are shown respectively in the x-axis of the plots. More comprehensive simulation results can be found in Table S4Table S21 of Supplement S7.

Figure 1:

Figure 1:

Simulation results for estimating ση and σε when X~LogNormal(0,0.5), based on sample sizes of 500 and 1000, under various combinations of ση and σε corresponding to noise-to-signal ratios of 0.25, 0.5 and 0.75.

Figure 2:

Figure 2:

Simulation results for estimating ση and σε when X~LogNormal(0,1), based on sample sizes of 500 and 1000, under various combinations of ση and σε corresponding to noise-to-signal ratios of 0.25, 0.5 and 0.75.

We observe that, in the majority of cases, the estimators effectively capture the underlying parameters with minimal bias and low standard deviations. The biases and standard deviations of σ^η tend to increase as the noise-to-signal ratio increases, but they remain stable across different true values when the same noise-to-signal ratio is maintained. The performance of σ^ε deteriorates with higher noise-to-signal ratios and increasing ση/σε ratios. Notably, when the true value of σε is zero, the estimator tends to overestimate the parameter.

We conducted additional simulations to separately investigate the impact of ση with fixing σε and the impact of σε with fixing ση. The results, provided in Supplement S7, show a similar pattern to the fixed noise ratio scenarios. The bias and the standard deviation of σ^η increase as ση increases but are not affected by changes in σε. On the other hand, the performance of σ^ε is highly influenced by changes in ση. As ση increases, the estimator σ^ε becomes more biased and unstable regardless of the real σε. Given ση, the bias and empirical standard deviation of σ^ε tend to decrease as σε increases, but this improvement becomes smaller as ση increases.

The reduced performance of σ^ε may be attributed to the inflation of the variance of Xη when the variance of η increases, while the variance of the additive error ε is relatively low compared to Xη, which makes it difficult to estimate precisely.

Comparing the estimation results of different distributions of true measurement X, for unconstrained distribution, the Log-Normal(0,0.5) yields the best performance with the lowest bias and standard deviation. The estimators obtained based on the true measurement X with constrained support outperform those based on X with unconstrained distributions.

Overall, when the true value of σε is not zero, the estimator σ^η,σ^ε demonstrates high accuracy. Our simulation results suggest that the performance of these estimators is influenced by several factors, including the distribution of the true measurement variable X, the sample size, the noise-to-signal ratio, and the ratio between ση and σε. Specifically, an increase in the noise-to-signal ratio and in the variance of variable X has a detrimental effect on the estimator’s performance. Conversely, the estimator performs better as the sample size increases.

We further perform hypothesis tests to examine the presence of measurement error, specifically testing whether a2=1 or σε2=0. Test 1 and Test 2 correspond to tests for multiplicative and additive errors, respectively, which are given by

Test1:H0:a2=1versusH1:a2>1, (13)
Test2:H0:σε2=0versusH1:σε2>0. (14)

Instead of using the asymptotic variance expressions, we adopted a bootstrap version for its better approximation. We described the bootstrap details in the Supplement S2. For each dataset, 1000 bootstrap samples were used. Three significance levels, α=0.01,0.05,0.1, were examined. The results are shown in Table 1 and Tables S22S23. For Test 1, the size of the test is close to the corresponding significance level, and the difference between the size and the significance level decreases as the sample size increases. The power of Test 1 is relatively high, and it increases as the noise-to-signal ratio decreases and as the significance level rises. Additionally, as the sample size increases, the performance of the hypothesis tests improves. Note that in Table 1, the sample size is set to 5000, because, compared to the error variances estimation, the hypothesis test, specifically Test 2, requires a larger sample size to achieve reasonable performance. For Test 2, the observed sizes exceed the nominal significance levels, but align more closely when sample sizes increased to 5000 and 10000. This trend suggests that the statistics tend to the follow the asymptotic distribution only at very large sample size. The requirement of very large sample size might be caused by the complex calculations involving higher-order moments, and the combined impact of multiplicative and additive errors.

Table 1:

Simulation results regarding the hypothesis tests for samples of size 5000 with two replicates

Test 1 Test 2
Distribution NS ratio ση σε α=0.01 α=0.05 α=0.1 α=0.01 α=0.05 α=0.1
LogNorm(0,0.5) 0.25 0.106 0.000 1.000 1.000 1.000 0.046 0.102 0.149
0.102 0.038 1.000 1.000 1.000 0.271 0.392 0.455
0.092 0.075 1.000 1.000 1.000 0.944 0.965 0.970
0.070 0.113 1.000 1.000 1.000 0.999 0.999 0.999
0.000 0.151 0.004 0.045 0.093 1.000 1.000 1.000
0.5 0.207 0.000 1.000 1.000 1.000 0.058 0.110 0.181
0.201 0.075 1.000 1.000 1.000 0.299 0.424 0.500
0.180 0.151 0.999 1.000 1.000 0.947 0.962 0.969
0.139 0.226 1.000 1.000 1.000 1.000 1.000 1.000
0.000 0.302 0.013 0.049 0.098 1.000 1.000 1.000
0.75 0.301 0.000 1.000 1.000 1.000 0.063 0.118 0.181
0.292 0.113 1.000 1.000 1.000 0.305 0.418 0.490
0.264 0.226 1.000 1.000 1.000 0.970 0.978 0.981
0.205 0.340 1.000 1.000 1.000 1.000 1.000 1.000
0.000 0.453 0.011 0.055 0.096 1.000 1.000 1.000
Exp(0.5) 0.25 0.143 0.000 0.998 1.000 1.000 0.046 0.106 0.152
0.138 0.125 1.000 1.000 1.000 0.464 0.563 0.618
0.124 0.250 1.000 1.000 1.000 0.985 0.989 0.992
0.095 0.375 0.999 1.000 1.000 1.000 1.000 1.000
0.000 0.500 0.010 0.044 0.089 1.000 1.000 1.000
0.5 0.276 0.000 1.000 1.000 1.000 0.064 0.127 0.194
0.268 0.250 0.999 1.000 1.000 0.436 0.546 0.612
0.242 0.500 1.000 1.000 1.000 0.974 0.986 0.987
0.187 0.750 1.000 1.000 1.000 1.000 1.000 1.000
0.000 1.000 0.008 0.043 0.097 1.000 1.000 1.000
0.75 0.395 0.000 1.000 1.000 1.000 0.080 0.155 0.222
0.384 0.375 1.000 1.000 1.000 0.459 0.559 0.617
0.349 0.750 0.999 1.000 1.000 0.971 0.980 0.983
0.274 1.125 0.999 1.000 1.000 1.000 1.000 1.000
0.000 1.500 0.014 0.051 0.116 1.000 1.000 1.000
χ32 0.25 0.132 0.000 1.000 1.000 1.000 0.043 0.096 0.143
0.128 0.153 0.999 1.000 1.000 0.441 0.538 0.609
0.115 0.306 1.000 1.000 1.000 0.989 0.992 0.993
0.088 0.459 1.000 1.000 1.000 1.000 1.000 1.000
0.000 0.612 0.015 0.059 0.101 1.000 1.000 1.000
0.5 0.257 0.000 0.999 1.000 1.000 0.055 0.117 0.164
0.249 0.306 1.000 1.000 1.000 0.411 0.535 0.594
0.224 0.612 1.000 1.000 1.000 0.984 0.987 0.990
0.174 0.919 1.000 1.000 1.000 1.000 1.000 1.000
0.000 1.225 0.009 0.045 0.100 1.000 1.000 1.000
0.75 0.369 0.000 1.000 1.000 1.000 0.075 0.137 0.198
0.359 0.459 1.000 1.000 1.000 0.462 0.553 0.602
0.325 0.919 1.000 1.000 1.000 0.977 0.986 0.987
0.255 1.378 1.000 1.000 1.000 0.999 0.999 0.999
0.000 1.837 0.014 0.042 0.097 1.000 1.000 1.000

5.2. Density Estimation

Simulations were conducted to evaluate the model fitting based on Bernstein polynomials. Since the use of Bernstein polynomials requires X to have compact support and the error variance estimation requires X to be positive or negative, we generated X from various distributions with positive, compact support, including 2 Beta(1, 2) + 2, 2 Beta(3, 2) + 1, 2 Beta(2, 2) + 1, Normal(3, 1.5) truncated at (2, 4), Normal(4, 0.5) truncated at (3, 5), and Exp(2) truncated at (1, 2), with a sample size of 1000. For each setting, we generated 300 replications. The true variables were contaminated by both multiplicative and additive errors with a noise-to-signal ratio of 0.25. Under this ratio, σε is set to 50% of its maximum achievable value, and ση is set to the corresponding value. The detailed values of ση and σε are presented in Figure 3. For each subject X, two replicates Wi were generated, and multiple data sets were generated for each simulation setting.

Figure 3:

Figure 3:

Simulation results for density estimation. In each plot, the gray lines show the estimated densities from 300 replications, the black solid line is the true density, the red dotted-dashed line is the pointwise median curve, and the blue dashed lines are the 5% and 95% pointwise quantile curves.

The estimator σ^η,σ^ε was obtained based on the method in Section 2.2, and then plugged into the model to estimate the density function of the variable X. Different degrees m={0,1,2,3,4,5,6} of Bernstein polynomials were used, and the criteria AIC were used to select the degree. Following Kekeç and Van Keilegom (2022), we evaluate the density estimation via the mean integrated absolute error (MIAE), presented in Figure 3, where MIAE=1Nr=1NfX(x)-f^X,m,r(x)dx,f^X,m,r(x) denotes the estimated density using the rth replication and the selected degree m, and N is the total number of replications. The results of all estimated density functions are presented in Figure 3. The settings with Beta distributions and the exponential distribution yield the best results. In each setting, both the estimated curves and the pointwise median curves closely align with the true density, and the MIAE values are relatively low. The results indicate that the density of X can be properly estimated through the Bernstein polynomial method.

5.3. Error-in-variable Regression Problem

In this section, simulations were conducted to evaluate the performance of the regression calibration method and simulation extrapolation. We compared traditional least-squares (LS) estimation, RC, and SIMEX based on true parameters and estimators.

The values of X were generated independently from three different distributions: exponential distribution, chi-square distribution, and Log-Normal distribution, and we varied the sample size to test the performance of different methods. For each subject, two repeated measurements Wij were obtained for the unobserved Xi. 100 replicates were generated for each simulation configuration. Estimators of regression parameters are compared: the least-squares estimator based on Yi,W¯i, the regression calibration estimators, and the SIMEX estimators using the true nuisance parameter and the estimate of the nuisance parameter respectively. The empirical bias, empirical standard error, mean of the estimated standard errors, and standard deviation of estimated standard errors of the five estimators are presented, respectively.

Two linear regression models were considered, Y=β0+βxX+βzZ+e and Y=β0+βxX+βzZ+e, where e~N0,σe. The values of σe for the error term e in the linear regression models were set to 0.4 and 0.04. The true parameter values were set to β0=1,βx=0.8,βz=0.6 and β0=1,βx=0.8. The sample size n was set to n={100,500,2000}, and two noise-to-signal ratios, 0.25 and 0.5, were considered. Under each ratio, σε is set to 50% of its maximum achievable value, and ση is set to the corresponding value. The results are presented in Table 2 and Table S24S30 in Supplement S7.

Table 2:

Simulation results with Y=1+0.8X+e where e~N(0,0.04) and the noise-to-signal ratio 0.25

Least-squares Yi,W-i Regession Calibration ση and σε Regession Calibration σ^η and σ^ε SIMEX ση and σε SIMEX σ^η and σ^ε
Distribution Sample Size β0 βx β0 βx β0 βx β0 βx β0 βx
LogNorm(0,0.5) n=100 Bias 0.020 −0.020 −0.013 0.014 −0.013 0.013 −0.002 0.003 −0.002 0.004
S.D. 0.025 0.024 0.026 0.025 0.026 0.025 0.027 0.026 0.027 0.025
MeanS.E.^ 0.137 0.122 0.021 0.021 0.023 0.022 0.030 0.029 0.028 0.028
S.D.S.E.^ 0.008 0.009 0.005 0.006 0.005 0.006 0.006 0.006 0.007 0.008
n=500 Bias 0.023 −0.024 −0.003 0.003 −0.003 0.003 0.000 0.000 −0.000 0.000
S.D. 0.011 0.011 0.011 0.011 0.011 0.012 0.011 0.012 0.012 0.013
MeanS.E.^ 0.091 0.080 0.011 0.010 0.011 0.011 0.014 0.014 0.014 0.014
S.D.S.E.^ 0.002 0.003 0.002 0.002 0.002 0.002 0.003 0.003 0.003 0.004
n=2000 Bias 0.023 −0.024 −0.001 0.001 −0.001 0.001 0.000 −0.001 0.000 −0.000
S.D. 0.004 0.005 0.005 0.005 0.005 0.005 0.005 0.005 0.005 0.006
MeanS.E.^ 0.064 0.057 0.006 0.005 0.006 0.006 0.007 0.007 0.007 0.007
S.D.S.E.^ 0.001 0.001 0.001 0.001 0.001 0.001 0.001 0.001 0.001 0.002
Exp(0.5) n=100 Bias 0.036 −0.023 −0.017 0.010 −0.017 0.010 0.004 −0.001 0.004 −0.001
S.D. 0.041 0.024 0.044 0.026 0.044 0.026 0.048 0.029 0.049 0.029
MeanS.E.^ 0.185 0.110 0.037 0.023 0.040 0.025 0.054 0.034 0.053 0.033
S.D.S.E.^ 0.012 0.007 0.011 0.007 0.011 0.008 0.016 0.009 0.018 0.012
n=500 Bias 0.033 −0.022 −0.008 0.005 −0.008 0.005 −0.000 0.001 −0.002 0.002
S.D. 0.018 0.010 0.018 0.011 0.018 0.011 0.019 0.011 0.018 0.011
MeanS.E.^ 0.124 0.073 0.018 0.011 0.020 0.012 0.025 0.016 0.026 0.016
S.D.S.E.^ 0.004 0.002 0.004 0.002 0.004 0.003 0.007 0.004 0.007 0.003
n=2000 Bias 0.037 −0.024 0.000 0.000 0.000 0.000 0.001 −0.000 0.001 −0.000
S.D. 0.011 0.006 0.011 0.007 0.011 0.007 0.012 0.007 0.012 0.007
MeanS.E.^ 0.088 0.052 0.010 0.006 0.011 0.007 0.013 0.008 0.014 0.008
S.D.S.E.^ 0.001 0.001 0.002 0.001 0.002 0.001 0.003 0.002 0.003 0.002
χ32 n=100 Bias 0.058 −0.024 −0.026 0.009 −0.026 0.009 0.006 −0.002 0.005 −0.002
S.D. 0.060 0.023 0.065 0.025 0.065 0.026 0.067 0.025 0.066 0.026
MeanS.E.^ 0.215 0.110 0.050 0.020 0.055 0.022 0.075 0.031 0.077 0.030
S.D.S.E.^ 0.013 0.007 0.014 0.006 0.015 0.007 0.019 0.007 0.024 0.010
n=500 Bias 0.059 −0.025 −0.003 0.001 −0.003 0.001 0.006 −0.003 0.004 −0.001
S.D. 0.028 0.011 0.030 0.012 0.030 0.012 0.031 0.012 0.031 0.013
MeanS.E.^ 0.146 0.073 0.026 0.010 0.027 0.011 0.035 0.014 0.036 0.015
S.D.S.E.^ 0.004 0.002 0.006 0.002 0.006 0.002 0.007 0.003 0.009 0.004
n=2000 Bias 0.055 −0.024 −0.004 0.001 −0.004 0.001 −0.000 −0.000 −0.002 0.000
S.D. 0.011 0.005 0.012 0.005 0.012 0.005 0.013 0.005 0.012 0.005
MeanS.E.^ 0.103 0.052 0.013 0.005 0.014 0.006 0.018 0.007 0.018 0.007
S.D.S.E.^ 0.002 0.001 0.002 0.001 0.002 0.001 0.003 0.001 0.004 0.002

Here “Bias” denotes the average of β^-β, “S.D.” is the standard deviation of the 1000 estimates, “MeanS.E.^” denotes the average of 1000 standard error estimates, and S.D.S.E.^ is the standard deviation of 1000 standard error estimates.

The traditional least-squares method yields seriously biased estimates. The biases of the estimators of βx based on the LS method are negative, which verifies the attenuation symptom. Both the RC method and SIMEX we proposed can effectively correct the bias caused by measurement errors. The bias and the empirical standard deviation decrease with a larger sample size. The estimation methods’ performance depends on the distribution of X. However, as the sample size increases, the biases of the estimators converge to zero for all distributions.

For both RC and SIMEX, the estimator of βx obtained based on σ^η,σ^ε yields similar biases to the one based on the true parameters. This suggests that using the estimator σ^η,σ^ε will not cause significant loss. Comparing the results from these two methods, when the sample size is small, the SIMEX estimator performs better. However, as the sample size increases, the RC estimator outperforms the SIMEX estimator slightly.

For the RC method, the simulation results show that the differences between the estimates using the sandwich method and the empirical standard deviations are small, indicating that the sandwich standard error estimator can estimate the standard error well. Particularly, whether using the σ^η,σ^ε or not, the estimates β0 obtained through the RC method are identical. We explain the reason in Supplement S4. In the case of the SIMEX method, during the estimation of variances in the extrapolation step, it is possible to encounter negative extrapolated variances. To address this issue, we repeat the SIMEX method until a positive extrapolated variance is obtained. As shown from the simulation results, the extrapolated standard errors are similar to the empirical standard deviations, indicating that the SIMEX method accurately estimates the standard error. To conclude, performances of both RC SIMEX improves as the sample size increases and is negatively associated with the variances of measurement errors and the standard error of the regression noise.

6. Analysis of a genetic data set

We proceed by applying the proposed methods to genetic data, GeneRepeat, provided by the R package augSIMEX (Zhang and Yi, 2019). The GeneRepeat dataset is adapted from the outbred Carworth Farms White (CFW) data Parker et al. (2016). The original data were analyzed to explore the relationship between genotype and behavioral, physiological, and gene expression traits in outbred CFW mice.

The dataset consists of two parts: the main study data, which includes 672 observations, and the validation data, which includes 339 observations. Here, we only use the main study data. The main study data include measurements for the genotype of the SNP rs223979909, which serves as the response variable Y in this context. The genotype is a continuous variable ranging from 0 to 2. The covariates in the main study data include error-prone measurements W of the tibia length X, collected repeatedly at 5-minute intervals over a period of 30 minutes W1,W2,W3,,W6, and the body weight Z of the mice.

Prior to applying the proposed method, it is necessary to study the distributions of the repeated measurements W1,W2,W3,,W6 since it is assumed that the replicates are identically distributed. Supplement S6 shows the descriptive statistics and density plots. The analysis indicates that the distributions of the six replicates are not identical. The replicates W4, W5, W6 exhibit the most similar densities among the six replicates. Hence, we use W4,W5,W6 as the replicates of the true tibia length. We apply the methodology for three replicates presented in Supplement S5 to W4,W5,W6. We also apply the methodology for two replicates to W4,W5 and W5,W6 and the results are in Supplement S6. The covariate W ranges from 2000 to 5000. To enhance accuracy, we adjust their scale by dividing by 1000.

We first estimate the corresponding standard deviations by the estimation method described in Section 2.2. Subsequently, we apply the two bootstrap hypothesis tests to the repeated measurements to investigate the existence of measurement errors in data. As we aim to simultaneously test two hypotheses, it becomes essential to adjust the significance level to account for multiple tests. Various multiple correction methods can be employed, taking into consideration the specific practical scenario. In this case, we employ the Bonferroni correction method and set the significance level for each test to 0.025. The estimates and the equal tail 95% bootstrap confidence intervals are given in Table 3. Notably, the lower bounds of 95% equal-tail confidence intervals for both Test 1 and Test 2 exceed the null values a02=1 and σε02=0 for W4,W5,W6, which suggests that the true measurement is contaminated by both multiplicative and additive measurement errors.

Table 3:

Results of the estimation of standard deviations and the confidence intervals for tibia length

σ^η σ^ε a^2 σ^ε2 95% CI for a^2 95% CI for σ^ε2
W4,W5,W6 0.087 0.377 1.008 0.142 [1.004, 1.011] [0.096, 0.188]

With the standard deviations, we proceed with the linear regression using RC and SIMEX on the response variable genotype Y and the replicated observations of tibia length, while also incorporating the instrumental variable body weight Z. Additionally, we also employ the ordinary least-squares method (LS) without considering measurement errors, to estimate the regression parameters. The results of the estimated regression coefficients and their corresponding standard errors based on W4,W5,W6 are presented in Table 4, and the results based on W4,W5 and W5,W6 are shown in Supplement S6.

Table 4:

Regression coefficient estimates and estimated standard deviations based on W4,W5,W6.

β0 βx βz
RC Estimate 0.604 −0.0000341
S.E.^ 0.077 0.0000189
SIMEX Estimate 0.606 −0.0000348
S.E.^ 0.075 0.0000198
LS Estimate 0.599 −0.0000327
S.E.^ 0.074 0.0000182
RC Estimate 0.809 −0.0000350 −0.00808
S.E.^ 0.257 0.0000188 0.00950
SIMEX Estimate 0.807 −0.0000333 −0.00806
S.E.^ 0.243 0.0000196 0.00910
LS Estimate 0.805 −0.0000321 −0.00810
S.E.^ 0.242 0.0000182 0.00909

As shown in Table 4, the RC method and SIMEX method yield similar estimates of regression parameters and standard errors. Comparing the results of different methods on different combinations of replicates, we find that the RC and SIMEX estimates of the coefficient βx are both smaller than the estimates obtained through the ordinary least squares method. In other words, the absolute values of the estimated β^x obtained through the RC and SIMEX methods are greater than the estimates obtained using the LS method. This finding confirms that regression calibration and SIMEX can effectively correct the attenuation effect caused by measurement errors, as discussed in Section 2.1. The estimates for the coefficients β0 and βz obtained through RC, SIMEX, and LS methods exhibit similar results. This consistency aligns with the conclusion that the presence of both types of errors has no significant impact on the estimation of β0 and βz.

7. Conclusion

In this work, we studied the measurement error problem when the true measurement variable is subject to both additive and multiplicative errors. We proposed a method to estimate the standard deviations of additive error and multiplicative error based on replicated data. We proved the identifiability of the proposed model and the consistency of the obtained estimator. We conducted hypothesis tests to test the significance of the variances of the two types of errors, which enabled us to determine the type of measurement errors.

Further, we applied approximate MLE on Bernstein polynomials to estimate the density function of the true measurement X with compact support. We then investigated the effect of both types of errors on the estimation of regression parameters in the error-in-variable regression problem. We also adjusted the correction methods, RC and SIMEX, to correct the bias caused by both types of errors. We combined the correction methods with the variance estimator to correct the bias. Compared to previous studies, which relied on extra assumptions or specific application scenarios, our method is more versatile and can be applied in various situations. Moreover, we are the first to apply the mixture of two types of errors in regression, investigate the effect, and adjust the correction methods to make them suitable for this type of error.

The simulation results showed that our estimator performed well in various cases. The hypothesis tests correctly identified the type of measurement error present in the data, and the density of the variable can be estimated well through the combination of the estimator and the Bernstein polynomials. In the simple linear regression model, the correction methods significantly reduced the bias when the covariate is prone to two types of errors, and the combination with the proposed estimator does not cause significant loss. The method is also applied to analyze a genetic data set.

Supplementary Material

1

References

  1. Andrews DWK (2002). Generalized method of moments estimation when a parameter is on a boundary. Journal of Business & Economic Statistics, 20(4):530–544. [Google Scholar]
  2. Bertrand A, Van Keilegom I, and Legrand C (2019). Flexible parametric approach to classical measurement error variance estimation without auxiliary data. Biometrics, 75(1):297–307. [DOI] [PubMed] [Google Scholar]
  3. Brenner Miguel S, Comte F, and Johannes J (2023). Linear functional estimation under multiplicative measurement error. Bernoulli, 29(3):2247–2271. [Google Scholar]
  4. Buonaccorsi JP (2010). Measurement error: models, methods, and applications. Chapman and Hall/CRC. [Google Scholar]
  5. Butucea C and Matias C (2005). Minimax estimation of the noise level and of the deconvolution density in a semiparametric convolution model. Bernoulli, 11(2):309–340. [Google Scholar]
  6. Carroll R, Ruppert D, Stefanski L, and Crainiceanu C (2006). Measurement error in nonlinear models: A modern perspective, second edition. Chapman and Hall/CRC. [Google Scholar]
  7. Carroll RJ and Stefanski LA (1990). Approximate quasi-likelihood estimation in models with surrogate predictors. Journal of the American Statistical Association, 85(411):652–663. [Google Scholar]
  8. Florens J-P, Simar L, and Van Keilegom I (2020). Estimation of the boundary of a variable observed with symmetric error. Journal of the American Statistical Association, 115(529):425–441. [Google Scholar]
  9. Gleser LJ (1990). Improvements of the naive approach to estimation in nonlinear errors-in-variables regression models. Contemporary Mathematics, 112:99–114. [Google Scholar]
  10. Huber P (1967). The behavior of maximum likelihood estimates under nonstandard conditions. Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, University of California Press, 1:221–233. [Google Scholar]
  11. Hunter N, Muirhead CR, and Miles JC (2011). Two error components model for measurement error: application to radon in homes. Journal of environmental radioactivity, 102(9):799–805. [DOI] [PubMed] [Google Scholar]
  12. Iturria SJ, Carroll RJ, and Firth D (1999). Polynomial regression and estimating functions in the presence of multiplicative measurement error. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 61(3):547–561. [Google Scholar]
  13. Kekeç E and Van Keilegom I (2022). Estimation of the variance matrix in bivariate classical measurement error models. Electronic Journal of Statistics, 16(1):1831–1854. [Google Scholar]
  14. Kneip A, Simar L, and Van Keilegom I (2015). Frontier estimation in the presence of measurement error with unknown variance. Journal of Econometrics, 184(2):379–393. [Google Scholar]
  15. Lyles RH and Kupper LL (1997). A detailed evaluation of adjustment methods for multiplicative measurement error in linear regression with applications in occupational epidemiology. Biometrics, 53:1008–1025. [PubMed] [Google Scholar]
  16. Marques TA (2004). Predicting and correcting bias caused by measurement error in line transect sampling using multiplicative error models. Biometrics, 60(3):757–763. [DOI] [PubMed] [Google Scholar]
  17. Parker CC, Gopalakrishnan S, Carbonetto P, Gonzales NM, Leung E, Park YJ, Aryee E, Davis J, Blizard DA, Ackert-Bicknell CL, et al. (2016). Genome-wide association study of behavioral, physiological and gene expression traits in outbred CFW mice. Nature genetics, 48(8):919–926. [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. Pierce DA, Stram DO, Vaeth M, and Schafer DW (1992). The errors-in-variables problem: considerations provided by radiation dose-response analyses of the A-bomb survivor data. Journal of the American Statistical Association, 87(418):351–359. [Google Scholar]
  19. Rocke DM and Durbin B (2001). A model for measurement error for gene expression arrays. Journal of Computational Biology, 8(6):557–569. [DOI] [PubMed] [Google Scholar]
  20. Royden HL (1968). Real analysis. Macmillan, New York, 2d ed. edition. [Google Scholar]
  21. Stram DO and Kopecky KJ (2003). Power and uncertainty analysis of epidemiological studies of radiation-related disease risk in which dose estimates are based on a complex dosimetry system: some observations. Radiation Research, 160(4):408–417. [DOI] [PubMed] [Google Scholar]
  22. Subar AF, Kipnis V, Troiano RP, Midthune D, Schoeller DA, Bingham S, Sharbaugh CO, Trabulsi J, Runswick S, Ballard-Barbash R, et al. (2003). Using intake biomarkers to evaluate the extent of dietary misreporting in a large sample of adults: the open study. American Journal of Epidemiology, 158(1):1–13. [DOI] [PubMed] [Google Scholar]
  23. Tang L, Tian Y, Yan F, and Habib E (2015). An improved procedure for the validation of satellite-based precipitation estimates. Atmospheric Research, 163:61–73. [Google Scholar]
  24. Tian Y, Huffman GJ, Adler RF, Tang L, Sapiano M, Maggioni V, and Wu H (2013). Modeling errors in daily precipitation measurements: Additive or multiplicative? Geophysical Research Letters, 40(10):2060–2065. [Google Scholar]
  25. Van der Vaart AW (2000). Asymptotic statistics. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, Cambridge. [Google Scholar]
  26. Vinberg ĖB (2003). A course in algebra. American Mathematical Soc. [Google Scholar]
  27. Wang C (1999). Robust sandwich covariance estimation for regression calibration estimator in Cox regression with measurement error. Statistics & probability letters, 45(4):371–378. [Google Scholar]
  28. Yi GY, Delaigle A, and Gustafson P (2021). Handbook of Measurement Error Models. Chapman and Hall/CRC. [Google Scholar]
  29. Zhang D, Lin X, and Dunson DB (2008). Variance component testing in generalized linear mixed models for longitudinal/clustered data and other related topics. In Random Effect and Latent Variable Model Selection, pages 19–36. Springer New York, New York, NY. [Google Scholar]
  30. Zhang Q and Yi GY (2019). R package for analysis of data with mixed measurement error and misclassification in covariates: augsimex. Journal of Statistical Computation and Simulation, 89(12):2293–2315. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

1

RESOURCES