Abstract
This paper extends the line-segment parametrization of the structural measurement error model to situations in which the error variance on both variables is not constant over all observations. Under these conditions, we develop a method-of-moments estimate of the slope, and derive its asymptotic variance. We further derive an accurate estimator of the variability of the slope estimate based on sample data in a rather general setting. We perform simulations which validate our results and demonstrate that our estimates are more precise than estimates under a different model when the measurement error variance is not small. Lastly, we illustrate our estimation approach using real data involving heteroscedastic measurement error, and compare its performance to that of earlier models.
Keywords: Delta method, heteroscedasticity, measurement error, method of moments, slope estimation
1 Introduction
Consider an observational study in which information on two variables is collected from random samples of n distinct groups within a population. Suppose a researcher is given only a set of summary statistics on the observed variables for each sample, along with the corresponding sampling errors. He wishes to determine whether there is a significant linear association between the two variables, and if so, to model that association with an accurate slope estimate. The sampling error involved with each variable eliminates the applicability of the simple linear regression approach, and calls for the implementation of a measurement error model. Moreover, the variability of the error on each observation for each variable is likely to be different, due to different sample sizes and other influences, so that a model that accounts for heteroscedastic measurement error is required.
Heteroscedastic measurement-error (ME) models have been developed for estimating the slope in such scenarios. Kulathinal, et al., apply this approach in [1] to the data collected in the WHO MONICA Project (2000) on cardiovascular disease and its risk factors. Patriota, et al., also examine this data in [2], as well as astronomical data obtained from the Chandra observatory, accounting for the heteroscedastic measurement error that characterizes each data set. Under this approach, each observation (xi, yi) is modeled as
where are the independently-distributed heteroscedastic measurement errors, γi ~ 𝒩(0, ζ2) is the independently-distributed equation error, and all three errors are mutually independent. To assure identifiability, it is assumed that the measurement error variances (σi, τi), i = 1, …, n, are known or can be estimated independently. Under the structural model the χi are independent and distributed as , while the φi are also independent and distributed as . Both maximum-likelihood and method-of-moments estimators of the slope have been derived under the structural heteroscedastic measurement error (ME) model, as presented in [1, 2, 3] using this conventional equation error model. However, as will be demonstrated in an analysis of the WHO MONICA data, this model is not robust against misspecification, so that the inclusion of the equation error component makes it susceptible to underestimation of the slope when the unknown data-generation mechanism does not actually have the equation-error structure.
In this paper, we present an extension of the structural line-segment model for homoscedastic measurement error, as introduced by Davidov in [4], to this heteroscedastic ME scenario. This alternative approach omits the equation error component and is symmetric in the two variables. These features make the line-segment model more robust against misspecification, and thus preferable when the data-generation mechanism is completely unknown. We assume that the points (ℰXi, ℰYi) are randomly distributed on a line segment having latent endpoints at (ηX, ηY) and (ξX, ξY), with ηX ≤ ξX. Let δX = ξX − ηX ≥ 0 and δY = ξY − ηY. We then have the model
(1) |
where λ1, …, λn are independently drawn from a common distribution G and take values in the unit interval, and for each i. We also assume that σ1, …, σn are independently drawn from a common distribution, and likewise for τ1, …, τn. Again, to assure identifiability, we assume that these error variances are known for both variables. In practice, the error variance is rarely known, but may be accurately estimated using repeated measurements, or, when the observations are statistics based on samples, by using sampling errors. Let μλ and denote the mean and variance of λi, respectively. Since λi is bounded, these and all higher moments must be finite. Finally, assume that λi, σi and τi are mutually independent for all i.
Figure 1 provides a diagram of these two structural models. In this diagram, (xi, yi) is the observed data pair, while the linear association between X and Y is represented by the dashed trendline Y = α + βX, upon which the line segment having endpoints at (ηX, ηY) and (ξX, ξY) rests. The conditional expectation of (xi, yi), given χi, onto the line/segment is the point (χi, α + βχi) under the equation error model. This point is equivalent to the point (ηX + δX λi, ηY + δY λi), which is the conditional expectation of (xi, yi), given λi, under the line-segment model. This point lies on the dotted line between (xi, yi) and the line segment from the perspective of the line-segment model. But under the equation error model, it lies on the vertical dotted line through the unobserved point (χi, φi).
Figure 1.
Diagram of the equation-error and line-segment structural models with heteroscedastic measurement error.
Trivially, the path from the observed pair (xi, yi) to the line segment, corresponding to the line-segment model, is always smaller than the path that goes from (xi, yi) to (χi, φi), and then to the line, corresponding to the equation error model. When there is a strong correlation between the two variables, this difference in path lengths is almost negligible when using either model to compute an estimate of the slope, since the errors are small. In such cases, the dominating influence on the variance of the slope estimate is the overall dispersion in X — the larger the spread, the smaller the variance. Since the equation error is a vertical displacement, its inclusion in the model attenuates the effect of the dispersion in X, giving it an advantage over the line-segment model. However, when the errors become larger, the path difference becomes the dominating influence on the variance of the slope estimate among the two models. Hence in any scenario where the variances of the measurement errors are not small, the line-segment model will provide a more precise estimate of the slope. This improvement is demonstrated using simulation studies.
The line-segment model provides an additional benefit, whether or not there is measurement error or heteroscedasticity. The equation-error model assumes a specific structure for the underlying data-generation mechanism, a structure which may not not properly explain the association between the two variables. In this case, points that lie far from the line and toward the horizontal extremes exert an excessive influence on the slope estimate in equation-error models due to the effect of including the equation errors. But in the line-segment model, no such effect occurs. Consequently the influence of such points on the slope estimate is attenuated. Therefore the line-segment model is more robust against outliers at the horizontal extremes, and is better able to explain the association between the two variables in situations in which the unknown data-generation mechanism does not have the equation-error structure. This benefit will be illustrated in the application of our model to the same WHO MONICA data, which contain influential points that have caused incorrect slope estimates under the equation-error models.
In the next section we derive a method-of-moments estimate of the slope under our extension of the line-segment model, and obtain its asymptotic variance in Section 3 using the delta method. We derive a large-sample estimate of the variance of our slope estimate in Section 4 which may be used in real data analysis. In Section 5 we perform simulation studies which verify the accuracy and precision of our estimates when the data-generation mechanism conforms to the line-segment model, and we show that the precision of our slope estimate is superior in this setting to that of a method-of-moments estimate obtained through the equation-error model, using assorted ranges of measurement error variances. In Section 6 we illustrate the application of our slope estimation to real data, and compare our estimates with those derived using several equation-error methods, and using the line-segment model when homoscedastic errors are naïvely assumed. We summarize and discuss our results briefly in Section 7.
2 Point estimation of the slope
Given the structural line-segment model described above, we have
so that
Hence
and
where
We then solve for δX (which we take to be nonnegative) and δY, and use the sign of the mean covariance to determine the sign of δY, to get
and
where [·]+ = max(0, ·). Then the slope of the line segment is β = δY/δX, provided δX > 0. Note that in the ratio δY/δX the common factor drops out, so that knowing the moments of λ is unnecessary for slope estimation. However, estimates of are required.
Following the approach described in [5] and [4], we employ method-of-moments estimators, by equating sample moments with theoretical moments:
Hence
(2) |
so that the estimated slope of the line segment is
(3) |
When the measurement errors are homoscedastic, (3) agrees with (2.12) in [4].
3 Variance of the slope estimate
We continue to follow [4] in our derivation of the variance of our estimate. Since the slope estimate β̂ is location invariant, we may without loss of generality set ηX = ηY = 0 in our derivation of Var(β̂). We define
and let
This structure on Tn is more complex than its five-dimensional counterpart in the homoscedastic case. Nevertheless, with , it is straightforward to show that
based on the model assumptions in (1), and the assumed existence of . The central limit theorem then assures us that the distribution of converges to the 𝒩(0, Σ) distribution as n → ∞, where Σ is the 7 × 7 covariance matrix of Zi with entries Σij = Cov(zi, zj), 1 ≤ i, j ≤ 7.
Now define . Expanding each of these components and applying some algebraic manipulation, we find that Sn may be expressed as a function H of Tn = (t1, t2, t3, t4, t5, t6, t7), namely,
so that, when H is applied to ℰ(Tn) = μ, we have
Applying the delta method and Slutzky’s theorem to Sn, we have
where
The law of large numbers assures that , and the continuous mapping theorem then implies that , so that β̂ is a consistent estimator of β. A second application of the delta method then gives us
where
Hence
is the desired asymptotic variance of .
We substitute
into this expression, where , and, assuming the existence of each, we define
and . Note that (στ)⋆ = σ⋆τ⋆ by the independence of σi and τi for all i.
After much labor, we obtain
(4) |
where . The Cauchy-Schwartz inequality guarantees that σ⋆⋆ ≥ (σ⋆)2 and τ⋆⋆ ≥ (τ⋆)2, so that the expression σ⋆⋆/(σ⋆)2 + τ⋆⋆/(τ⋆)2 − 2(στ)⋆/σ⋆τ⋆ will be nonnegative, and will vanish in the case of homoscedasticity. Although the ratio (στ)⋆/σ⋆τ⋆ equals one, we keep it in the form of (4) because it will need to be estimated in the next section. Therefore, the variance of the estimated slope based on n independent observed pairs (xi, yi) with heteroscedastic measurement error is Var(β̂) = ω/n.
We note that in the case of homoscedastic error variance, (4) reduces to (3.3) given in [4].
4 Estimating the variance of the slope estimate
To obtain an estimate of Var(β̂), we replace each occurrence of δX and δY in (4) with the corresponding estimates given in (2). We also replace σ⋆, τ⋆, σ⋆⋆, τ⋆⋆ and (στ)⋆ with , respectively. After some simplification we obtain
(5) |
To derive the estimate , we first require an estimate of λi for i = 1, …, n. Suppose we observe (xi, yi), while ℰ(Xi, Yi) = (ηX + λiδX, ηY + λiδY). In our model, λi must be the unique value in [0, 1] that minimizes the Euclidean distance between (xi, yi) and the line segment whose endpoints are at (ηX, ηY) and (ξX, ξY). Simple calculus gives us
so that
We then have
and
Thus
so
(6) |
We may then substitute (6) into (5) to compute .
However, we have sidestepped the issue of estimating (ηX, ηY), which is necessary for computing ψi, i = 1, …, n. One remote possibility is that ηX and ηY are known. Another approach, given in [5] under the assumption that μλ and are known, uses the Method of Moments to obtain η̂X = x̅ − μλδ̂X and η̂Y = y̅ − μλδ̂Y. Thirdly, one may construct consistent estimators of ηX and ηY using nonparametric estimation of G, as discussed in [5]. When information about G cannot be determined, we suggest a heuristic alternative.
The line segment in our model rests on the line having slope β and passing through the point (ηX + μλδX, ηY + μλδY). We estimate the location of this segment with a line having slope β̂ and passing through the point (x̅, y̅), whose equation is thus y = β̂(x − x̅) + y̅. Now, for i = 1, …, n, consider the perpendicular line having slope −1/β̂ which passes through the point (xi, yi), whose equation is thus y = −1/β̂(x − xi) + yi. Then both lines contain the orthogonal projection of (xi, yi) onto the estimated line segment. We substitute into each equation in place of (x, y), and set their right-hand sides equal, to get . Solving for and simplifying gives us
This value can then be used to compute . If we let , we have . If β̂ ≥ 0, we let . Otherwise, we let . In either case, . Hence is a biased but consistent estimator of (ηX, ηY), so we propose the use of in the computation of ψi, i = 1, …, n when nothing is known about the distribution of λi.
5 Simulation study
To confirm the accuracy of these estimates, we generate 50 data pairs using (1) with ηX = 0, ηY = 0, δX = 9, and δY = 6, so that β = 2/3. For i = 1, …, n, λi is drawn from a beta distribution with both parameters equal to 2, σi is drawn from a uniform distribution on (a, 2a), and τi is drawn from a uniform distribution on (b, 2b), where a and b are fixed but arbitrary positive numbers. Hence Γλ can be computed from the known moments of a beta distribution, while σ⋆, τ⋆, σ⋆⋆ and τ⋆⋆ are derived from the known moments of a uniform distribution. We choose a range of values for a and b, then use (3), (4), (5) and (6) to compute β̂, its variance Var(β̂), and the estimate of this variance. For this first simulation we assume ηX and ηY are known. We repeat 500 times, and record the estimates at each iteration.
For each run of the simulation,, Table 1 presents the selected values of a and b, which control the magnitude of the error variances, in the first two columns. Column 3 gives the median value of β̂, computed using (3), over 500 iterations, along with the sample variance of this estimate in column 4. Column 5 provides the expected value of Var(β̂), based on (4) divided by n. Column 6 provides the median value of , computed using (5), over 500 iterations. Ideally, the value in column 3 matches the true value of the slope, i.e., 2/3, and the values in columns 4 and 6 are both close to the value in column 5 for each run. For all choices of a and b, our estimate of the line-segment slope is centered very near the true value of 0.667. Moreover, the observed variance among 500 computed values of β̂ (in column 4) is consistently close to the expected variance of β̂ (in column 5), and the median estimate of the variance of β̂ over 500 iterations (in column 6) also proves to be quite accurate, with a gradual loss of accuracy as the error variances grow. To the extent that our estimates of the variance are off-target, they are consistently a bit high, and hence give us more conservative estimates. The precision of our estimates even as the error variance grows is remarkable given our need to estimate many parameters. Hence our first simulation confirms the reliability of our estimates when ηX and ηY are known.
Table 1.
Simulation results for several choices of a and b, based on 500 iterations, with (ηX, ηY) known and β = 2/3. (S.V. = sample variance)
a | b | median(β̂) | (S.V.) | Var(β̂) | median |
---|---|---|---|---|---|
0.05 | 0.03 | 0.669 | (0.00086) | 0.00087 | 0.00083 |
0.35 | 0.25 | 0.662 | (0.00224) | 0.00204 | 0.00219 |
0.50 | 0.55 | 0.663 | (0.00564) | 0.00550 | 0.00610 |
0.75 | 0.70 | 0.666 | (0.01077) | 0.00982 | 0.01085 |
0.85 | 0.90 | 0.657 | (0.02037) | 0.01618 | 0.01801 |
In our second simulation, we set ηX = 2, ηY = 3, δX = 9, and δY = −6, so that β = −2/3. This time we assume ηX and ηY are unknown and employ our heuristic estimates. All other conditions are the same as above. Table 2, which has the same structure as Table 1, displays a summary of our results. Despite our need to use biased estimates of ηX and ηY to compute the values in column 6, these estimates of the variance of β̂ are centered at values only slightly larger than the expected values listed in column 5. Our sample variance of β̂ over 500 iterations, given in column 4, also corresponds well with the expected variance in every run given in column 5, although it becomes inflated as the error variances grow. Moreover, the sample median of β̂ is consistently accurate even as the variance of the measurement errors increases. Hence our derivations are strongly validated by these simulations.
Table 2.
Simulation results for several choices of a and b, based on 500 iterations, with (ηX, ηY) unknown and β = −2/3. (S.V. = sample variance)
a | b | median(β̂) | (S.V.) | Var(β̂) | median |
---|---|---|---|---|---|
0.05 | 0.04 | −0.667 | (0.00086) | 0.00087 | 0.00085 |
0.30 | 0.33 | −0.665 | (0.00228) | 0.00239 | 0.00257 |
0.46 | 0.52 | −0.660 | (0.00581) | 0.00490 | 0.00528 |
0.71 | 0.65 | −0.673 | (0.00976) | 0.00849 | 0.00968 |
0.98 | 0.93 | −0.664 | (0.02160) | 0.01871 | 0.02178 |
We then compare the performance of our slope estimate under the line-segment model to the performance of the method-of-moments slope estimate proposed recently in [2], based on the equation-error structural heteroscedastic ME model described in the introduction. An EM algorithm to compute maximum-likelihood estimates for these model parameters was proposed in [1], and additional estimation methods were presented in [3]. In [2], Patriota, et al., derive both method-of-moments and maximum-likelihood estimates for the parameters and provide performance comparisons with the earlier approaches. We will focus on their method-of-moments estimate, which we designate here as MM-P, as a contrast to ours, since that estimate is provided in closed form, and the authors found its performance to be superior to that of alternate equation-error models when the Gaussian error assumption is valid.
Using our notational equivalents, the MM-P slope estimate is , where , and (needed below) . The asymptotic variance corresponding to the MM-P slope estimate is , where . For our line-segment model, and the equation error ζ2 equals zero. Hence we may compute the expected large-sample variance of β̂ under MM-P for our simulation scenario using the known moments of the beta and uniform distributions, once a and b are specified. Moreover, using the proposed estimates of these parameters, the MM-P estimate of Var(β̂) is
(7) |
When we repeat the first simulation using the MM-P estimates, with the same choices for a and b, the median of the 500 point estimates of β is consistently close to the true value of 2/3. The first two columns of Table 3 show the expected value of Var(β̂) under the MM-P model for each pair (a, b), along with the corresponding median of the 500 estimates of Var(β̂) computed using (7). As the magnitude of the error variance grows, the difference between these two values grows, but remains within reason. But note that when we compare these values with those in the last two columns, which are the corresponding columns imported from Table 1, we find that the MM-P expected and estimated variance of β̂ is much smaller than ours when a and b are small (first two rows), about the same when a and b are close to 0.5 (middle row), and much larger when a and b are larger than 0.5 (last two rows). In other words, the MM-P estimates are more precise when the error variances are quite small, but our estimates have greater precision when the error variances become appreciable. Hence we have evidence that the line-segment model will provide a better estimate of the slope when the measurement error variance is not small, as is frequently the case.
Table 3.
Simulation variance estimates for several choices of a and b, based on 500 iterations, using the MM-P model and our model, with β = 2/3.
MM-P Model | Our Model | ||||
---|---|---|---|---|---|
a | b | Var(β̂) | median | Var(β̂) | median |
0.05 | 0.03 | 0.00002 | 0.00002 | 0.00087 | 0.00083 |
0.35 | 0.25 | 0.00150 | 0.00151 | 0.00204 | 0.00219 |
0.50 | 0.55 | 0.00569 | 0.00555 | 0.00550 | 0.00610 |
0.75 | 0.70 | 0.01248 | 0.01259 | 0.00982 | 0.01085 |
0.85 | 0.90 | 0.02043 | 0.02003 | 0.01618 | 0.01801 |
Of course, our estimates from Table 1 are based on the assumption that ηX and ηY are known. However, even in the second simulation scenario, in which we must estimate (ηX, ηY), our estimate of Var(β̂) still returns a smaller value than the MM-P version once the error variances become appreciable, as Table 4 shows. In this scenario we observe that the variance estimate is indeed smaller under the MM-P model when the σi and the τi are restricted to small values, but when the σi are drawn from the interval (0.46, 0.92) and the τi are drawn from (0.52, 1.04), the estimates of the variance of β̂ are approximately the same under both methods. Then, when the σi are drawn from the interval (0.71, 1.42) and the τi are drawn from (0.65, 1.30), the variance estimate under our method is significantly smaller, and when the σi are drawn from the interval (0.98, 1.96) while the τi are drawn from (0.93, 1.86), our estimate of the variance of β̂ is almost half of the corresponding estimate under the MM-P model. Hence we recommend our approach to slope estimation when the heteroscedastic measurement error is not small, as our method will yield narrower confidence intervals for the slope.
Table 4.
Simulation variance estimates for several choices of a and b, based on 500 iterations, using the MM-P model and our model, with β = −2/3.
MM-P Model | Our Model | ||||
---|---|---|---|---|---|
a | b | Var(β̂) | median | Var(β̂) | median |
0.05 | 0.04 | 0.00003 | 0.00003 | 0.00087 | 0.00085 |
0.30 | 0.33 | 0.00184 | 0.00179 | 0.00239 | 0.00257 |
0.46 | 0.52 | 0.00488 | 0.00472 | 0.00490 | 0.00528 |
0.71 | 0.65 | 0.01057 | 0.01102 | 0.00849 | 0.00968 |
0.98 | 0.93 | 0.02660 | 0.02685 | 0.01871 | 0.02178 |
It should be noted that in this simulation the data-generation mechanism was specified according to the line-segment model, while the MM-P procedure is based on the equation-error model. Thus our estimation procedure had a built-in advantage. When the data-generation mechanism is unknown, one must take care in choosing a model, as the results from disparate models can be quite different. The next section demonstrates this issue.
6 Data application
The World Health Organization (WHO) established the Multinational MONItoring of trends and determinants of CArdiovascular disease (MONICA) during the 1980s to study the association between known risk factors, like smoking and obesity, and trends in cardiovascular disease. The linear association between data on the average annual change in the observed risk score (X) and the average annual change in event rate (Y), both given as percentages, was modeled in [1] and [2] using equation-error measurement error procedures, with the sampling error in the trend estimates taken as the heteroscedastic measurement error. Figure 2 displays the data for males (N = 38) and for females (N = 36) separately, with the magnitude of the measurement errors indicated by the crosshairs.
Figure 2.
Scatterplot of change in event rate versus change in risk score, with standard errors, from WHO MONICA project, and lines having estimated slopes under three models, for males and females.
Table 5 displays the estimated slope and its estimated standard error for each gender computed under several different models. The ordinary least squares (OLS) method disregards measurement error. K-2000 and K-2002 represent maximum likelihood estimates provided in [1], while MM-P and ML-P represent the method-of-moments and maximum-likelihood estimates reported in [2], using measurement error models. Finally, the MM-LS estimates are those obtained using our line-segment model. We add the lines having the estimated slopes under our model and under the ML-P and MM-P models to the plots in Figure 2. For the MM-LS results, we pass the lines through the means (X̅, Y̅) for each gender, while for the ML-P and MM-P results we use the estimated intercepts provided in that paper.
Table 5.
Estimates of the slope and standard errors of the estimates for the WHO MONICA data on males and females, based on seven models.
Males | Females | |||||
---|---|---|---|---|---|---|
Model | β̂ | β̂ | ||||
OLS | 0.31 | 0.20 | 0.51 | 0.33 | ||
K-2000 | 0.43 | 0.22 | 0.57 | 0.33 | ||
K-2002 | 0.47 | 0.23 | 0.68 | 0.24 | ||
MM-P | 0.35 | 0.22 | 0.58 | 0.38 | ||
ML-P | 0.47 | 0.23 | 0.68 | 0.41 | ||
MM-LS | 1.76 | 0.23 | 2.10 | 0.38 |
It is quite surprising that the estimated slopes under the line-segment model are dramatically steeper than those obtained under all the other models — more than three times the size — while the standard errors on the estimates are about the same for all models. While this initially makes the MM-LS results appear to be in error, inspection of the data scatter in each plot of Figure 2 reveals that these results are more consistent with the observable trend. Indeed, the steeper slopes send an even stronger message to the public about the urgency of maintaining cardiovascular health. While it is difficult to ascertain which model is most appropriate for the latent WHO MONICA data-generation mechanism, many tools are available for assessing model fitness.
For example, when we apply diagnostic tools designed for identification of influential points under the ordinary least-squares framework, such as the Cook’s distance, DFFITS, DFBETA and the covariance ratio, we discover that three points are designated as influential among the male data, and four points among the female data. While these diagnostic tools are not intended for ME models, they imply that methods based on models which incorporate equation error are very vulnerable to the influence of outliers at the horizontal extremes. Under models that rely on such methods, the combined effect of these influential points is to rotate the line in the clockwise direction, resulting in a flatter slope. In contrast, the line-segment model does not involve equation error and is thus robust against such influences. This illustrates the effect of using equation error in ME models when support for this assumption is lacking, as discussed in the introduction. In a case such as that considered here, when the latent mechanism which has generated the data is unknown, it is safer to rely on a model which, like the line-segment model, is symmetric and less rigid.
We may demonstrate the robustness of the line-segment model by deleting each of the identified influential points one at a time and computing the slope estimate on each subset of the data. We do likewise for the MM-P model, and display the results in Table 6. Deletion of influential points alters the slope under our line-segment model by 5% to 14% for males and by −8% to 14% for females. But under the MM-P model the slope is altered by −34% to 29% for males and by −24% to 24% for females. This illustrates the robustness of the line-segment model to influential points and helps explain the disparity between the slope estimates from the two models. Given that the precision of the slope estimates is equivalent between the two models, this robustness property recommends the line-segment ME model over the others considered.
Table 6.
Estimates of the slope and standard errors of the estimates for the WHO MONICA data on males and females, based on two models, when no points are deleted, and when individual influential points are deleted.
Males | |||||
---|---|---|---|---|---|
Model | Full Data | Point 1 | Point 2 | Point 3 | |
MM-P | 0.35(0.22) | 0.45(0.24) | 0.31(0.25) | 0.23(0.25) | |
MM-LS | 1.76(0.23) | 1.85(0.25) | 1.86(0.26) | 2.01(0.27) | |
Females | |||||
Model | Full Data | Point 1 | Point 2 | Point 3 | Point 4 |
MM-P | 0.58(0.38) | 0.72(0.35) | 0.65(0.41) | 0.63(0.35) | 0.44(0.44) |
MM-LS | 2.10(0.38) | 1.93(0.38) | 2.22(0.41) | 2.10(0.37) | 2.39(0.50) |
7 Discussion
Many scientific investigations involve assessing the association between two variables when measurements recorded on both variables are subject to random error. When the variance of this error differs from one subject to another, a heteroscedastic measurement error model is appropriate. Conventional ME models incorporate an equation error component, which involves making potentially insupportable assumptions about the unknown data-generation mechanism. Misspecification of the chosen model can lead to significant underestimation of the slope, regardless of the estimation procedure used.
In this paper we have provided an alternate heteroscedastic ME model based on a line-segment parameterization. This model does not incorporate equation error and is symmetric in both variables. For any setting in which this model corresponds well with the underlying data-generation mechanism, we have provided an accurate estimate of the linear association between the two variables, signified by the slope of the line segment, along with a reliable estimate of its precision. We have demonstrated through simulations that, under conditions when the line-segment model is properly specified, the corresponding variance estimate will yield smaller confidence intervals than will equation-error models when the error variances are not small. This novel estimation procedure enables an investigator to make precise inferences about the slope when heteroscedastic ME models are applied to scientific data, and thereby draw conclusions that will generally be more trustworthy than those derived using other ME models. Moreover, because the line-segment parameterization is robust against influential points which may plague equation-error models when their implementation is misspecified, the advantages of our model are further reinforced.
Acknowledgment
The authors thank Kari Kuulasmaa for making the WHO MONICA data available, and Alexandre Patriota for sharing the data.
This research is supported in part by the NSF under Grant No: BCS 0527766 and HSD 0826844, and by the NIH under Grant No: 1R01AG025218-01A2.
References
- 1.Kulathinal SB, Kuulasmaa K, Gasbarra D. Estimation of an errors-in-variables regression model when the variances of the measurement errors vary between the observations. Statistics in Medicine. 2002;21:1089–1101. doi: 10.1002/sim.1062. DOI: 10.1002/sim.1062. [DOI] [PubMed] [Google Scholar]
- 2.Patriota AG, Bolfarine H, de Castro M. A heteroscedastic structural errors-invariables model with equation error. Statistical Methodology. 2009;6:408–423. DOI: 10.1016/j.stamet.2009.02.003. [Google Scholar]
- 3.Cheng CL, Riu J. On estimating linear relationships when both variables are subject to heteroscedastic measurement errors. Technometrics. 2006;48:511–519. DOI: 10.1198/004017006000000237. [Google Scholar]
- 4.Davidov O. Estimating the slope in measurement error models—a different perspective. Statistics & Probability Letters. 2005;71:215–223. DOI: 10.1016/j.spl.2004.11.011. [Google Scholar]
- 5.Davidov O, Goldenshluger A. Fitting a line segment to noisy data. Journal of Statistical Planning and Inference. 2004;119:191–206. PII: S0378-3758(02)00409-3. [Google Scholar]