Abstract
This paper investigated consequences of measurement error in the pretest on the estimate of the treatment effect in a pretest–posttest design with the analysis of covariance (ANCOVA) model, focusing on both the direction and magnitude of its bias. Some prior studies have examined the magnitude of the bias due to measurement error and suggested ways to correct it. However, none of them clarified how the direction of bias is affected by measurement error. This study analytically derived a formula for the asymptotic bias for the treatment effect. The derived formula is a function of the reliability of the pretest, the standardized population group mean difference for the pretest, and the correlation between pretest and posttest true scores. It revealed a concerning consequence of ignoring measurement errors in pretest scores: treatment effects could be overestimated or underestimated, and positive treatment effects can be estimated as negative effects in certain conditions. A simulation study was also conducted to verify the derived bias formula.
Keywords: analysis of covariance, structural equation model, measurement error in pretest, bias in treatment effect
The pretest-and-posttest design with treatment and control groups is a simple and popular program evaluation study design in psychology and education and the goal of the model is to estimate the treatment effect accurately. One of the best strategies to achieve such a goal is to use random assignments. For example, What Works Clearinghouse (2020) provides the highest rating “Meets Standards” to a well-designed and executed randomized controlled trials (RCTs). Under a successfully implemented RCT, a simple analysis like an independent sample t-test will produce a valid, unbiased, and consistent estimate for the treatment effect. Random assignment, however, can be imperfect or often cannot be feasible for logistics and/or ethical reasons in educational, social, and psychological research contexts. In such cases, a study results in a quasi-experiment.
In quasi-experimental studies, the preexisting differences in pretest between groups need to be taken into account to obtain an unbiased and consistent estimate of the treatment effect. Along with other choices (e.g., propensity score analysis, and regression discontinuity analysis), a quasi-experiment study relies often on the analysis of covariance (ANCOVA) to adjust preexisting differences in covariates, such as pretest scores.
An ANCOVA can produce an unbiased and consistent estimate of the treatment effect in the pretest–posttest design with treatment and control groups under certain assumptions, such as no other confounding variables, no measurement error in pretest, and the homogeneity of regression slopes are met. Suppose the model meet the homogeneity of regression slopes assumptions, has no other confounding variables but its pretest variable includes a measurement error. It is generally known that the measurement error in pretest would produce the bias in the treatment effect; however, the nature of the bias, such as the direction and the size, is not well known. But for practitioners who will design the study and interpret the results and policy makers who will evaluate the efficacy of the treatment and will decide the public policy based on the results, it is very important to know about it to make good judgment. Thus, it is essential to describe in detail about the direction and the size of the bias caused by the unreliability of the pretest scores under various conditions and research designs.
There are some studies that investigated issues related to the bias due to measurement errors of predictors in regression-type models, such as linear, generalized linear, and multilevel models. For example, some studies attempted to correct the bias on parameter estimates (Battistin & Chesher, 2014; Battauz et al., 2011; Devanarayan & Stefanski, 2002; Hong et al., 2019; Lockwood & McCaftrey, 2014; Rabe-Hesketh, Skrondal, & Pickles, 2003; Sengewald et al., 2019). Also, Culpepper and Aguinis (2011) conducted detailed investigations on the magnitude of bias due to measurement errors of predictors. However, none of these studies investigated the direction of the bias. Therefore, this study investigated both the direction and the magnitude of the bias caused by the measurement error of a covariate in an ANCOVA under various conditions.
The rest of the present article will be organized as follows: We first summarize the basic facts about the measurement error issues in a linear regression model and then we will review the literature regarding the measurement error in predictors in regression models and summarize the major points in the literature. Then, we will analytically derive the formula for the asymptotic bias for the treatment effect. Using the derived bias formula, we will consider the direction and the size of the bias. After that, we will use a simulation study to confirm that the derivation was correct. In addition, in the simulation study, we will show that the structural equation model (SEM)-based method for correcting the bias works well. Finally, we will state the conclusions and the recommendations for the practitioners.
Impact of Measurement Errors in Linear Regression
In a multiple regression model, measurement errors in the dependent variable will not have much consequences compared to those in the independent variables (Kumenta, 1986). The measurement errors in the dependent variable will be absorbed in the residual term, as the residual term will become the sum of the disturbance that comprised on many unobserved factors that influence the dependent variable and the random measurement error in the dependent variable. Therefore, the measurement error in the dependent variable will not cause bias on regression parameters. It only makes the estimates less precise with increased standard errors (Wooldridge, 2020). On the other hand, measurement errors in independent variables have serious consequences, producing bias in parameter estimates. The bias cannot be eliminated no matter how much we increase the sample size (Greene, 2008).
Measurement error is ubiquitous in social and behavioral sciences in which human behaviors and social phenomena are objects of studies. Many characteristics of those human behaviors and social phenomena are modeled by unobserved constructs, such as intelligence, personality, and social class. When we try to measure those unobserved constructs, there are measurement errors associated with them since they are latent and elusive and thus hard to define explicitly and objectively. Therefore, if variables that represent those unobserved constructs are used as independent variables in a simple bivariate regression analysis, the bias problem in parameter estimates will arise. This problem is well known as attenuation (Crocker & Algina, 1986) and attenuation bias (Wooldridge, 2020). The procedure for correcting the bias using the externally obtained reliability coefficient is referred to as the correction for attenuation.
To illustrate the attenuation for a linear regression with one independent variable, we write a regression model below. Here, we assume that an independent variable X is measured with errors. Although it would be realistic to assume that the dependent variable Y also has measurement errors, we assume Y does not have any measurement error for simplicity, since measurement errors in the dependent variable do not produce any bias in regression parameters as mentioned earlier. The regression equation is
| (1) |
and
| (2) |
where the prediction error is represented as ε. Here, Y can be an observed posttest score, and X can be an observed pretest score in a simple pretest-and-posttest design. Since X is measured with errors, X consists of its true score T and random measurement error U in Equation 2, just like the true score model in the classical test theory (Crocker & Algina, 1986). Note that in Equation 1, Y is regressed on the true score T, not the observed score X. We assume that T and ε, T and U, and U and ε are uncorrelated. Furthermore, we assume that ε has the mean of 0 with variance σ2ε (i.e., E (ε) = 0, Var (ε) = σ2ε), and the measurement error U has the mean of 0 and variance σ2U (i.e., E (U) = 0, Var (U) = σ2U). Based on the classical test theory, the reliability of X (pretest scores) denoted λ can be expressed as
| (3) |
where σ2 x is the variance of the observed pretest score (Var (X) = σ2 x ) and σ2U is the variance of the true pretest score (Var (T) = σ2 T ), which has a restricted range between 0 and 1. As an example of a rule of thumb, if at least 80% of the observed score consist of the true score variance (λ≥ 0.8), the test score is considered to have good reliability (Nunnaly & Bernstein, 1994).
When the above model represents the phenomena occurring in the population, but if we regress Y on the observed pretest score X, not the true score T, that is,
| (4) |
then the regression coefficient β2 for X becomes
| (5) |
indicating that the absolute value of β′2 will be smaller than that of β2 and is toward zero by the factor of λ considering the fact that 0 ≤λ≤ 1, which is attenuation in a form of negative bias.
If there are more than one independent variable in the model, and there are measurement errors in some or all of the independent variables, the literature has indicated that the effect of measurement errors would be more complicated and difficult to predict. Even if only one of the independent variables has measurement error, the influence spreads out to all the regression parameters. It is an important point to remember and Greene (2008) described this phenomenon as follows: A badly measured variable contaminates all the least squares estimates (p. 327). The regression coefficient on the badly measured variable is still biased toward zero and the other coefficients are all biased as well, although in unknown directions. It is also very important to keep in mind that, unlike the bivariate regression case, the direction of the bias can go either positive or negative depending on conditions and is not easy to predict (Greene, 2008). Darlington (1990) described this malicious influence of measurement error as the most serious weakness of a multiple regression model.
Literature on Measurement Error in Predictors in Regression
There are comprehensive accounts for the issues associated with the measurement error in linear models (Fuller, 1987) and the nonlinear models (Carroll et al., 2006) which also includes the accounts regarding the linear models. Both provide rigorous and technical accounts on the issues and cover wide range of the measurement error problems and solutions. Beyond that, most of the recent studies on the impacts of the measurement error in the independent variables regression coefficients parameters focused on technical methodologies on how to obtain the unbiased or at least consistent estimators by comparing their proposed method with alternative methods that have been available. For example, Schafer and Purdy (1996) formulated a model that covariates have measurement error and estimated the parameters via maximum-likelihood method with expectation–maximization (EM) algorithm, which was similar in SEM framework, but they used a simpler classical test theory (i.e., true-error scores) measurement model instead of factor analysis model as a measurement model that includes a measurement error as is done in the SEM framework. Devanarayan and Stefanski (2002) presented a variation of the simulation-extrapolation (SIMEX) algorithm originally developed by Cook and Stefanski (1994) appropriate for the case in which the measurement error variance(s) are unknown but replicate measurements are available. Rabe-Hesketh, Pickles, and Skrondal (2003) and Rabe-Hesketh, Skrondal, and Pickles (2003) considered the model with covariate measurement error in generalized linear model context which includes logistic regression. The model was estimated by both maximum likelihood with numerical integration with quadrature and non-parametric maximum-likelihood estimation (NPMLE) which was useful for non-normal distribution of the measurement error. Lockwood and McCaffrey (2014) considered conditional standard errors of measurement of the pretest scores’ plausible values instead of the regular standard errors of measurement to better adjust for the measurement errors in pretests. Battauz et al. (2011) considered a measurement error in achievement test scores in multilevel repeated-measures context to estimate the value-added by teachers and classes, which included the initial achievement as the baseline predictor and the achievement in subsequent periods as the dependent variable. Also, there are works that considered covariate measurement error problem in Rubin causal inference framework. For example, Battistin and Chesher (2014) investigated the effect that covariate measurement error has on a treatment effect analysis. Similarly, Sengewald et al. (2019) investigated the impact of measurement error on the estimation of the average treatment effect (ATE) in quasi-experiments. Hong et al. (2019) illustrated the impact of multiple error-prone covariances that are correlated in propensity score methods in causal inference in nonexperimental studies. Thus, all references listed above focused on how to correct the bias due to the measurement error in covariates in various scenarios, such as the dependent variables are not continuous or computing the propensity scores or plausible values by taking into accounts the measurement errors for the covariates that constitute the propensity score or plausible values.
The only one exception that more focuses on the nature of the bias, such as the direction and the size is Culpepper and Aguinis (2011). They explicitly derived a formula for the exact treatment effect squared bias and showed useful figures that depict the relationship between Type I error rate as a function of the covariate (e.g., pretest) reliability, group mean difference in covariate, and correlation between covariate and the dependent variable of interest. They also compared relative absolute bias introduced by different method of correcting or not correcting measurement error in covariate. It was quite a useful article to understand the nature of the bias, but it was not clear the interaction between the direction of the baseline imbalance (i.e., which group has higher mean in covariate; Egbewale et al., 2014) and the direction of the bias since the baseline imbalance was represented as the squared part correlation that represents the unique effect of treatment beyond the covariate. Thus, the sign of the bias was suppressed.
For practitioners of evaluation studies, however, it is important to understand and be able to predict first, which direction the bias occurs, and then second, how large the bias can be. Though there is such importance, as reviewed above, the literature focused on how to correct the covariate measurement error bias, and it is not well known (or addressed) about the direction and the size of the bias, to the best of our knowledge. This is the focus of the present article. Thus in this article, in a setting of two-group pretest–posttest design, a typical research design used in evaluation studies, we will investigate which direction the bias occurs and how much the bias will be in various conditions of the factors if we conduct the analysis to evaluate the treatment effect without taking into account the measurement error in the pretest. As the factors, we will include the ones, such as the pretest reliability, the baseline imbalance, and the correlation between pretest and posttest, which are key factors based on the literature. This paper is organized in the following order. We first set up the scenario of two-group pretest–posttest design, and then specify the data generation model that includes a pretest covariate having a measurement error. Next, we derive the asymptotic bias that will be introduced if we conduct an ANCOVA by ignoring the measurement error in pretest. Based on the formula derived, we will investigate the direction and the size of the bias and depict them in tables and figures. Third, to confirm the accuracy of the analytically derived formula for the bias, we conduct a simulation study to empirically obtain the bias. As a comparison, we will also conduct the analysis for the same datasets using an SEM approach, which takes into accounts the measurement error in pretest by including an externally provided reliability coefficient, such as Cronbach’s alpha. The SEM approach is a popular approach in social and behavioral sciences for correcting measurement error bias. When we examine the bias for the results of the ANCOVA and SEM approaches, we will also compare the efficiency, overall estimation accuracy by root mean-squared error, power, and Type I error rate between two approaches as supplemental analyses. Then, we will draw conclusions and provide some guidelines for practitioners.
Analysis of Covariance
ANCOVA model can be used to evaluate the treatment effects in pretest–posttest two-group design in experimental or quasi-experimental studies. Since ANCOVA is one type of multiple regression model with two or more independent variables, it assumes all independent variables are measured without error. If this assumption does not hold, a consistent and unbiased estimate of the treatment effect will not be guaranteed.
The ANCOVA for the pretest–posttest two-group design is a multiple linear regression model with two independent variables, where one is a qualitative categorical variable of group membership (control vs treatment) and another is a continuous variable (the pretest score). Assuming the pretest scores are measured with errors, the ANCOVA model can be expressed as
| (6) |
and
| (7) |
where Y is the observed posttest score, D is a dummy variable of group membership (D = 0 for control group and D = 1 for treatment group). The ANCOVA is considered as a model to estimate the treatment effect β1 by adjusting the true pretest score T as a covariate. This is because E (Y|D = 1, T)–E (Y|D = 0, T) = β1, if Equation 6 holds in the population.
However, X is the observed pretest score with measurement error. Thus, X consists of its true score T and random measurement error U as in Equation 7. In Equation 6, Y is regressed on the true score of pretest T, not on the observed score X, just like Equation 1 in the previous section. Here, D does not involve any measurement error as the treatment status is likely a known fact in a typical research setting. We also assume regular assumptions on multiple linear regression such as, the exogeneity assumption of Cov (D, ε) = 0 and Cov (T, ε) = 0 and the assumption on the disturbance ε, that is, E (ε) = 0 and Var (ε) = σ2ε. Furthermore, it is assumed that T and U, and ε and U are independent. The mean of the true pretest score will be denoted as E (T|D = 1) = μ1 for treatment group (group 1) and E (T|D = 0) = μ2 for control group (group 2). We assume that the within-groups variances for the true score T are the same for both groups, that is, . Finally, it is also assumed that the measurement error of the pretest has mean of zero ( ) and variance Var (U) = σ2 U .
If Equations 6 and 7 hold in the population (i.e., the pretest is the only predictor that explains the difference between groups before the intervention), and if the random assignment is performed and is working, then it means that the mean true pretest difference is zero (μ1–μ2 = 0), and there is no correlation between T and the dummy variable D (ρ DT = 0). On the other hand, if there was no random assignment or if it is not working properly, then μ1–μ2≠ 0 and, therefore, ρ DT ≠ 0.
Now, suppose that we regressed Y on D and the observed pretest score X, not the true score T by ignoring its measurement error. This means that we will run the following multiple regression model:
| (8) |
This regression analysis examines the coefficient β′1, which is denoted differently from β1 because we now control for observed pretest scores X instead of the true pretest scores T. Then, naturally the following questions arise: Does any bias emerge because the fallible measure X was used as the covariate? If so, is that positive or negative bias? In what conditions does it appear? How large is it? To our best of knowledge, much is unknown to date that when and in what conditions the positive or negative bias occurs and how much the magnitude of the bias would be. Therefore, a systematic study on this issue needs to be conducted to address these questions. In the present study, we will address the questions via algebraic derivation and a simulation study.
Analytical Results of Bias
Using the general formulas for the probability limit for the ordinary least squares (OLS) estimators of the regression coefficients provided by Greene (2008), we derived the bias for the treatment effect induced by running the ANCOVA model in Equation 8. The detailed derivation is provided in Online Appendix B, but the end results of asymptotic bias for the treatment effect was
| (9) |
which is the Equation B-5 in Online Appendix B, where is the OLS estimate for the coefficient for the dummy variable for group membership D when the fallible measure for the pretest (X) was used as the covariate as in Equation 8; β2 is the regression coefficient for the true pretest score T, which appeared in Equation 6; σT and σ D are the population standard deviation for true pretest score T and dummy variable D; λ is the reliability of the pretest score (0 ≤λ≤ 1); and ρ TD is the population correlation between T and D. The key points of the derivation shown in Online Appendix B included to consider the probability limit of the OLS estimators for Equation 8, and then find the inverse of the probability limit of the design matrix Q*, where Q*= plim\frac{1}{n}X*TX*, and then multiply it to the probability limit of \frac{1}{n}X*Ty to obtain the probability limit of the OLS estimate of betas by plimβ = (plim\frac{1}{n}X*TX*)-1 (plim\frac{1}{n}X*Ty).
Equation 9 indicates that the sign of the bias for the treatment effect is determined by the sign of the slope for the true pretest score β2 and the sign of the correlation between true pretest score T and the grouping indicator variable D (ρ TD ). In other words, if β2 and ρ TD are both positive, the bias will be positive, that is, the treatment effect will be overestimated. For β2, it is likely to be positive in educational achievement test setting since students who performed well in a pretest are likely to perform well in a posttest too. As for the correlation between true pretest score T and the grouping indicator variable D (ρ TD ), a positive ρ TD indicates that the treatment group (D = 1) tends to have higher pretest scores than the control group (D = 0). On the other hand, if the correlation ρ TD is negative, indicating that the treatment group (D = 1) tends to have lower pretest scores than the control group (D = 0), then the bias will be negative, that is, the treatment effect will be underestimated, assuming that β2 is positive. Note that ρ TD can be negative in many real-world settings, because the assignment to the treatment group can be primarily individuals with lower pretest scores if the treatment is a remediation program, such as a Head Start Program (Office of Head Start, 2020).
Another important implication of Equation 9 is that the estimated treatment effect would be biased in either direction if the pretest was measured with error (i.e., λ < 1) and if there was non-zero correlation between true pretest score T and the treatment group membership (D) (i.e., ρ TD ≠ 0). In other words, if either one of these two conditions are not met, then there will be no bias no matter what the other condition is. That is, if the pretest is perfectly reliable (i.e., λ = 1), then there will be no bias even if ρ TD ≠ 0 in the ANCOVA analysis and if the correlation ρ TD = 0, then the bias will be zero even if the pretest is not perfectly reliable (λ < 1).
It should be emphasized that the direction of bias goes with the same sign of ρ TD assuming that β2 > 0. This has an important practical implication, especially when ρ TD is negative. For example, for a remediation program like the Head Start, the treatment effect will be underestimated if the pretest is not perfectly reliable. The underestimated program effect may impact the evaluation of the program in negative way.
Now, though Equation 9 is a general formula that works for any type of variables of T and D, we can modify it specific to the dichotomous D and continuous T because the correlation (ρ TD ) translates to a mean difference between the groups. Thus, using the standardized mean difference on true pretest scores between treatment and control group ( ) the asymptotic bias formula in Equation 9 can be re-expressed as follows:
| (10) |
which is Equation B-7 in Online Appendix B, where is the standardized mean difference on true pretest scores between treatment and control group as just mentioned, ρ YT|D is the correlation between true pretest score and posttest score, and σ Y|D is the standard deviation of posttest for each group. In Equation 10, the quantity, such as ρ TD , β2, and σ T do not appear. The detailed derivation of Equation 10 is shown in Online Appendix B.
The two plots based on Equation 10 shown in Figure 1A and 1B illustrate how the value of changes with different value of and λ, when ρ YT|D = 0.9, σ Y|D = 1.
Figure 1.
Graphical Illustrations of Theoretical Asymptotic Bias of .
The plot on the left panel (Figure 1A) demonstrates that the asymptotic bias is positive when is positive and it is negative if is negative. And within the range of −0.8 through 0.8 in , the line is almost linearly related to , which passes through the origin. That means that the size of the bias is proportional to the mean difference in true pretest scores with the same sign. When the pretest score has perfect reliability (i.e., λ = 1), the straight line is flat with zero bias, which indicates that if the pretest score is perfectly reliable, then there will be no bias no matter which group has the higher mean in pretest nor how large it is. If the reliability (λ) goes down from 1.0 to 0.7, that is, the unreliability (1–λ) gets larger, the slope gets steeper, which can be seen from the numerator in Equation 10.
The plot on the right panel (Figure 1B) represents the relation between the asymptotic bias and the reliability of the pretest score (λ) for the range of (0.7 ≤λ≤ 1.0) for seven different values of ranging from −0.8, −0.5, −0.2, 0, 0.2, 0.5, and 0.8. The horizontal axis is the reliability of pretest score (λ) and the vertical axis is the asymptotic bias (plimβ1). We see seven nearly straight lines that converges to the single point of (λ, plimβ1) = (1.0, 0.0), which indicates that no matter how much the preexisting difference exits between groups ( ), as far as the pretest score is perfectly reliable, there would be no asymptotic bias. The nearly straight lines mean that the asymptotic bias is almost proportional to the unreliability ( , which can be seen from Equation 10. In fact, the factor {1 +(dPre/2)}2/{1+ (1–λ)(dPre/2)2} is approximately constant about 1 with the range of its values between the minimum of 1 and the maximum of 1.35 within all the combinations of values of λ and stated before. Now, the flat line with plimβ1 = 0 in the right middle of the almost straight lines means that as far as = 0, then we will get asymptotically unbiased results no matter how unreliable the pretest score is. There are three lines above the flat line of plimβ1 = 0 and three lines below it. Those above the flat line of plimβ1 = 0 correspond to the cases for = 0.8, 0.5, and 0.2 from the top and those below the flat line correspond to the cases for = − 0.8, − 0.5, and − 0.2 from the bottom. The nearly straight lines indicate that the relation between Bias (plimβ1) and the unreliability 1−λ is almost linear for the range of values of λ between 0.7 and 1. Also, the sign of the asymptotic bias is determined by the sign of the standardized mean difference of true pretest scores ( ). The sign of the asymptotic bias is the same as the sign of when ρ YT|D > 0, which is typically the case. In other words, assuming that β1 > 0 (positive treatment effect), the treatment effect will be overestimated if the treatment group had higher mean on pretest. On the other hand, the treatment effect is underestimated, if the treatment group had lower mean on pretest. Furthermore, the size of the bias is larger if there were larger magnitude in the pretest true means ( ).
Simulation Study
To demonstrate that the formula for the asymptotic bias analytically derived in the previous section was accurate, in this section, we conducted a simulation study by considering various conditions that reflect realistic situations. Specifically, we generated the data that followed the model Equations 6 and 7, but conducted the ANCOVA analysis by Equation 8 that ignored the measurement error in pretest, which would give us the estimate for the treatment effect. By replicating 2,000 times and we computed the bias empirically as the average of the estimates minus the true parameter value. In addition, we compared the results from the ANCOVA regression model that ignored the measurement error in pretest represented by Equation 8 and those from SEM analysis that takes the measurement error into account represented by Equations 6 and 7. Given that the reliability of the pretest score is externally provided, we can use an SEM with a single indicator to obtain the estimate of the parameters in Equations 6 and 7 by fixing the error variance of U in Equation 7. Such a model can be graphically presented in a path diagram as in Figure 2.
Figure 2.
Path Diagram for the Structural Equation Model That Involves Measurement Error.
In the diagram, rectangles (Y, D, and X) are observed variables, and circles (ε, T, and U) are unobserved latent variables. Regression coefficients (β1 and β2) are indicated to reflect the model Equations 6 and 7. σ2 T and σ2 D are the variance of T and D, respectively, and ρ TD is the correlation between T and D. The variance of U (σ2 U ) is computed based on an externally provided λ value by (1–λ) σ2 X as suggested in Little (2013, pp. 87–88). Note that this formula comes from a well-known classical test theory.
We evaluated the performance of each model in terms of bias, and the simulation results were compared with the analytically derived formula presented in the previous section. We also evaluated other evaluation criteria, including squared standard error, mean squared error (MSE), power, and coverage. Results for these evaluation criteria are reported in the Supplemental Materials.
Simulation Design
We considered four simulation factors that may influence the estimates of the treatment effects: (a) the reliability of pretest scores, (b) true pretest standardized mean difference between the treatment group and control group, (c) the sample size, and (d) correlation between posttest score and true pretest score. The specifications are summarized in Table 1.
Table 1.
Summary of the Simulation Design.
| Factor | Levels |
|---|---|
| Reliability coefficient for pretest (λ) | 0.7, 0.8, 0.9, 1.0 |
| Standardized mean difference on true pretest scores between treatment and control group (dPre) | − 0.8, − 0.5, − 0.2, 0, 0.2, 0.5, 0.8 |
| Sample size per group (N/2) | 35, 70, 150 |
| Correlation between posttest score and true pretest score and posttest score (ρ YT|D ) | 0.6, 0.75, 0.9 |
| Standardized mean true score change within each group (d) | Control group 0, treatment group 0.5 (fixed) |
First, we considered λ = 0.7, 0.8, 0.9, and 1.0. In psychometric literature, λ = 0.7 is a value that is minimally acceptable as a reliable measure for research purpose, λ = 0.8 is good, λ = 0.9 is excellent, and λ = 1.0 as perfect reliability. Second, we considered scenarios for the standardized mean difference in true pretest scores between treatment and control group ( ) where was defined as
| (11) |
for μ1 is the true pretest score mean for the treatment group (group 1), μ2 is the true pretest score mean for the control group (group 2), and σT|D is the common within-group true pretest score standard deviation (SD) ( ) and we set σT|D = 1 Following Cohen’s (1988) convention, the value was set to three levels, that is, 0.2 (small effect), 0.5 (medium effect), and 0.8 (large effect). Since the direction of bias can be different depending on which group’s mean is higher, we also considered the negative sign for these values. In addition, we considered the case of no difference in mean true pretest scores between groups ( = 0). If a random assignment is successfully implemented, it falls into this scenario.
Though the two factors above were our main interest, we also considered three different sample sizes, since it is known that the SEM generally works well for large sample size. Those were 35, 70, and 150 for each group and thus 70, 140, and 300 as the total sample sizes. Those are referred to as small, medium, and large sample size. Finally, the last factor we considered was correlation between posttest score and true pretest score within group (ρ YT|D ), since this correlation can affect the size of the treatment effect β1 in Equation 6. 1 Typically, this correlation is positive and relatively high in educational achievement tests, since high achievers in one time point tend to be high achievers in later time point as well. We considered three conditions, 0.60, 0.75, and 0.90.
Finally, as for the treatment effect, we considered it from the true mean score change rather than specifying the parameter value of β1 in Equation 6, so that, we would not give any advantage on the SEM model being a model that generated the data. Specifically, we considered the case that for the treatment group (group 1), the standardized true score mean change d1 = 0.5 and for the control group (group 2), d2 = 0, which can be considered as a medium effect size of the treatment effect, according to the Cohen’s (1988) rule of Thumb. To standardize the mean change, we assumed that the variance does not change between pretest and posttest as well as its homogeneity between groups for simplicity. This setup of the simulation scenario implies that we assumed the variance of the true test scores stays the same before and after the treatment for both groups. Also, it implies that only the mean level of the true test score increases for the treatment group, while it stays the same for the control group. The symbols for the means and the within-group variance (therefore, standard deviation) and the assumptions for the standard deviations are summarized in Table 2. In the simulation, we set the value of this common variance (or standard deviation) as 1, that is, . Thus, we assumed that for the treatment group, the standardized mean change d1 = (μY1–μT1)/τ = 0.5 and for the control group, d2 = (μY2–μT2)/τ = 0 by defining the standardized mean change from pretest to posttest as
Table 2.
Summary of Assumptions for the Means and Variances of Pretest and Posttest.
| Group | Pretest true score T | Posttest score |
|---|---|---|
| Treatment group (group 1) | E(T|D=1) = μ1, Var(T|D = 1) = σ2T|D | E(Y|D=1) = μY1, Var(Y|D = 1) = σ2Y|D |
| Control group (group 2) | E(T|D=0) = μ1, Var(T|D= 0) = σ2T|D | E(Y|D=0) = μY2, Var(Y|D= 0) = σ2Y|D |
| (12) |
As for the grouping dummy variable D, which takes values of D = 0 for the control group and D = 1 for the treatment group, it was assumed to be from a Bernoulli distribution with the parameter π, that is, D ∼ Bernoulli (π) where π is the probability of D taking value of 1. Then, the mean is
| (13) |
and the variance
| (14) |
In our simulation, we considered a scenario of splitting the sample into half. Then, π = ½, and therefore, μ D = 1/2 and = 1/2.
Finally, we defined the following standardized mean difference between two groups at the posttest stage:
| (15) |
which is similar to the standardized true mean difference between groups for the pretest , as in Equation 10.
Data Generation
By the setups described in the previous section, we obtained key quantities that were required for conducting a simulation study, such as the means and variances for each group for pre- and posttest scores. For our simulation study, we fixed = 5.0, μY1 = 6.3, μY2 = 5.0, and , while we varied to produce varied standardized true pretest score mean difference ( ) values. For example, if μ1 = 5.8, dPre = (μ1–μ2)/σT|D = (5.8–5.0)/1 = 0.8, while the standardized change from pretest to posttest true score ( ) was d1 = (μY1–μ1)/τ = (6.3–5.8)/1 = 0.5 for group 1 (treatment group) and d2 = (μY2–μ2)/τ = (5–5)/1 = 0 for group 2 (control group), and the standardized (true) mean posttest score difference between groups ( ) was .
Finally, the regression coefficients in Equation 6 were obtained by,
| (16) |
and
| (17) |
where ρ TD is the correlation between T and D, ρ YD is the correlation between Y and D, ρ YT is the correlation between Y and T, σ Y is the standard deviation of Y, σ D is the standard deviation of D, and σ T is the standard deviation of T. These quantities were derived as functions of specified quantities for the simulation study. The derivations for Equations 16 and 17 and their required quantities are presented in the Online Appendix A (Part 1). Note that β1 is the true treatment effect, which is our interest to estimate.
Data generation was replicated 2,000 times for each of the conditions and the treatment effect was estimated by the two methods (SEM and ANCOVA) for each replication.
Results
We will present the results only for the case where the correlation between posttest and pretest was 0.9 (ρYT|D = 0.9), since the results for the other two levels of this factor (ρYT|D = 0.75 and 0.6) exhibited nearly identical patterns in all aspects of evaluation criteria. The case with ρYT|D = 0.9 exhibited the largest differences in the results for the combinations of other factors. For example, we saw that the slope difference in bias as a function of the standardized mean difference in true pretest scores between treatment and control groups ( ) got smaller as ρYT|D got smaller. Similarly, the spread of bias across different levels of shrunk as ρYT|D got smaller.
Figure 3A summarizes the results of the simulation regarding the bias for the regression coefficient for the treatment group dummy variable D (β1 in Equation 6 and β′1 in Equation 8, respectively), which represents the treatment effect in correctly specified model in Equation 6.
Figure 3.
(a) Bias in the Treatment Effect: ρYT|D = 0.9. (b) Bias in Treatment Effect: ρYT|D = 0.9 and N/2 = 35 (Reliability in the Horizontal Axis).
The panel above denoted as Model.Reg represents the bias for the ANCOVA regression model, and the panel below denoted as Model.SEM represents the bias for the SEM. The values 35, 70, and 150 appeared in three columns of the graphs represent the sample size per group. Furthermore, the vertical axis for each six graphs represents the bias for the estimate of β1 and the horizontal axis indicates the standardized mean difference between groups of true pretest scores ( ). Finally, four lines in each graph represent the relationship between and bias for four reliability coefficients for the pretest scores (λ = 0.7, 0.8, 0.9, and 1.0). From this figure, we observe the following four characteristics.
First, when the SEM is used for estimation, there will be no bias regardless of pretest’s reliability values, true score differences between groups, and sample sizes. This is represented in the nearly flat straight lines at around Bias = 0 in the three bottom panels in Figure 3A.
Second, when the reliability of pretest is 1, that is, if the covariate is measured without measurement error, the ANCOVA regression can estimate the treatment effect without bias. From the figure, this can be seen in three upper panels, where there is a flat straight line with Bias = 0 when the reliability = 1.
Third, on the other hand, when there is no group difference in true score means in pretest ( = 0), all the lines pass through the origin (see three upper panels of Figure 3A). This indicates that when there is no group mean difference in pretest, the measurement error in the pretest does not produce bias in estimating the treatment effect.
In order to visualize the third observation above more explicitly, Figure 3B presents the same results by placing pretest reliability in the horizontal axis for the case that the sample size for one group is 35. The figure in the left-hand side is for the ANCOVA regression and one in the right-hand side is for SEM. If we focus on the lines for = 0, we can see that the lines for both models are almost flat straight lines with Bias = 0. This implies that if there is no group mean difference ( = 0), that is, if a random assignment is working, then there would be no bias in treatment effect even if we use the ANCOVA regression analysis that uses the pretest that involves measurement error as the covariate, and no bias no matter what the reliability coefficient is within the range of study design with λ = 0.7 ~ 1.0.
Fourth, from the three graphs in the upper panel of Figure 3A, we can see that there are positive bias when > 0 and negative bias when < 0. And the size of the bias gets larger as the absolute value of ( ) gets larger with the almost linear relationship in the range of . If we evaluate this as the relative bias defined as (%), the maximum relative bias was 36.1% (which was attained at 0.22 bias for reliability (λ) = 0.7 (top left in Model.Reg in Figure 3B), and the largest negative relative bias in terms of absolute value was –39.8%, which was attainted at −0.232 for λ = 0.7 (bottom left in Model.Reg in Figure 3B). More than 30% in relative bias in both sides seems to be substantial.
Now, we will compare the figures for the bias between theoretically derived ones and ones that were created empirically using the simulation results. First, comparison between Figure 1A and any one of the figures in the upper panel of Figure 3A shows that not only the patterns of the graphs are the same but also the values almost match. For example, when λ = 0.8, the asymptotic bias in Figure 1A indicated the value of 0.1 at = 0.5, the same value can be seen from Figure 3A. Comparison between Figure 1B and the left panel of Figure 3B also shows the close match. For example, when = 0.8 and the value of the asymptotic bias is about 0.22 in Figure 1B for λ= 0.7 and the corresponding value is slightly above 0.2 in the left panel of Figure 3B.
Discussion and Conclusion
The major purpose of the current study was to examine the direction and size of the bias that appear when there was a preexisting difference between the treatment and control groups and there was a measurement error in pretest, but the analysis was conducted by an ANCOVA that ignores the measurement error in two-group pretest and posttest design. We first derived the analytical closed form formula for the asymptotic bias and then the accuracy of the formula was confirmed by a simulation study. Furthermore, the simulation compared the performance of the two models, the ANCOVA and the SEM regression that takes into accounts the measurement error in the pretest score. The following summarizes the major findings of the analytic derivation and the simulation study.
First, we found that there will be a bias whenever there is a measurement error in pretest and when there is a preexisting difference in pretest means between the treatment and control groups if ANCOVA was used as the analytical model. However, we also found that there are two scenarios that would not produce the bias even if ANCOVA was used as the analytical model. The two scenarios are that: when the reliability of the covariate (pretest) is perfect or when there is no preexisting difference between groups in pretest ( = 0). The first scenario confirms a frequently mentioned statement that a multiple regression produces unbiased estimates when predictors are measured without error, which is one of the assumptions made for multiple linear regression (Cohen et al., 2003, p. 119), along with the tenability of the other standard assumptions made for the regression model. The second scenario would encourage us to recommend the use of random assignment in evaluation studies. If a random assignment is properly done and is working, we do not need to worry much about pretest reliability in terms of consistency and unbiasedness of the treatment effect. Note that we also found that if the SEM was used as the analytical model, then there are no bias of estimating the treatment effect for SEM regardless of the level of pretest reliability, and regardless of whether there is a preexisting difference between groups in pretest ( ) (recall Figure 3A and 3B). This is one of the major benefits of using the SEM.
In terms of the direction of the bias of the estimated treatment effect, which was one of the major goals of the present study, we found a simple but a useful rule. That is, assuming that the treatment effect (β1 in Equation 6) is positive, when the pretest group difference is positive (i.e., dpre > 0), ANCOVA exhibits positive bias, meaning that the treatment effect is overestimated when measures are unreliable. On the other hand, when the pretest group difference is negative (dpre < 0), ANCOVA exhibits negative bias, meaning that the treatment effect is underestimated when pretest measures are unreliable. Now, in terms of the magnitude of the bias, it was found that the lower the reliability in pretest score (λ) and the larger the standardized mean difference in pretest ( ), the larger the bias of the estimated treatment effect, which was consistent with the findings in previous studies (e.g., Culpepper & Aguinis, 2011).
It should be emphasized, though, that the important point here is the direction of the bias represented by its sign. That is, when the treatment effect is positive (β1 > 0), the positive mean difference leads to positive bias and the negative mean difference leads to negative bias. In the former scenario (i.e., > 0), the resulting bias makes treatments look effective even when they have no effect at all, which is the case that Marsh (1998) demonstrated, and in the latter scenario (i.e., < 0), a positive treatment may look “harmful” as Campbell and Erlebacher (1970) and Campbell and Boruch (1975) cautioned.
As mentioned in the previous paragraph, under a well-executed RCT (i.e., baseline equivalence is established, and thus dpre = 0), the measurement error/unreliability do not introduce bias in the estimate of the treatment effect by ANCOVA. Although they were not reported here, it was revealed that a simpler ANCOVA model can produce a result that is more efficient (i.e., smaller SE), more accurate (i.e., smaller MSE), and more powerful (i.e., higher probability of detecting a treatment effect) than a more complex SEM model in certain situations such as the case when is close to 0. Results on these criteria are available in the Supplemental Materials.
Practical Implications
Based on the findings of this paper, researchers should critically evaluate past and future program impact findings in light of bias possibly created by the unreliable measurement of the pretest scores when a certain degree of preexisting differences between groups may exist. This is especially relevant for quasi-experimental design (QED) studies where the baseline equivalence of the groups is difficult to establish, even though data from QED studies are often analyzed by incorporating matching techniques (e.g., propensity score matching analysis) and ANCOVA to reduce the bias in program impacts that was induced by the influence of unbalanced covariates. As an example, program effect findings from the pretest–posttest QED data from the Head Start program could be critically reevaluated in light of our findings. The program is a compensatory preschool education for socio-economically disadvantaged children. Prompted by the Head Start program evaluation report by Cicirelli et al. (1969), which indicated the negative program impact of the Head Start, Campbell and Erlebacher (1970) showed how this result could be artificially caused by matching on observed covariates such as socio-economic status which involved the measurement errors. That is, the measurement errors in covariates might have made Head Start look more harmful than it may actually is (also see Campbell & Boruch, 1975). Their insights were supported by our simulation study (see Figure 3A), which were also theoretically derived (see Equation 10). The plot based on Equation 10 clearly indicated its negative bias in the case of < 0 (see Figure 1A).
The opposite case is a study that found a positive program effect despite that the effects were unlikely to exist. This was demonstrated by a simulation study that mimicked a gifted education program (Marsh, 1998). Marsh suspected that a meta-analysis of the effects of programs for gifted and talented (G&T) children could have been overestimated. In his study, a simulation was conducted by mimicking a typical setting of G&T program evaluation that used matching on observed pretest scores. The results revealed that matching on observed pretest scores which involved measurement errors created artificial positive treatment effects, which is the phenomenon that occurred in our simulation study and it was expected from the theoretical formula of the bias (see Equation 10). Its positive bias can be clearly seen from Figure 1A for the case of > 0.
The findings of this study should also help researchers better design a study, choose an appropriate statistical model, and interpret the program effect findings in terms of how pretest measurement reliability introduces bias. Not so stringent threshold is currently used for measurement reliability in many program evaluation areas and consequences of this have not received due attention. What Works Clearinghouse Standards, for example, states that internal consistency statistics, such as Cronbach’s alpha, must not be lower than .50 (p. 83, What Works Clearinghouse, 2020). This relatively liberal threshold is problematic as it significantly amplifies the bias of the treatment effect when the baseline equivalence of the groups based on the pretest measure is not established. If a certain level of the baseline equivalence was established, the bias is kept to a minimum. If not, substantial bias may be introduced to the estimation of program impact and researchers may consider remedies to minimize the size of bias.
At the time of designing a study, the researchers might know the reliability of the instruments they choose. If the high reliability is not expected, the researchers should either consider using randomized control trials (RCT) or use matching for a QED study to minimize the pretest group differences, in addition to trying to improve the reliability by revising the instruments. That is, a concern about pretest measure reliability should have a part in the discussion of choosing appropriate study designs. As seen in our derivation and simulation results, an RCT study can better handle poor reliability level of pretest measures.
Analytical models may also be chosen by taking a concern of pretest measurement reliability into consideration. The researchers may consider using a model that takes measurement errors into accounts, such as SEM, which, in our simulation, was found to effectively minimize the bias associated with unprecise pretest measurement. Unfortunately, statistical models that assume no measurement errors on the predictor side of the equation, such as multilevel modeling (ML), dominate in educational evaluation areas, which may introduce unignorable bias in the estimated treatment effect when there are measurement errors in the predictors. In such scenarios, multilevel SEM that takes into account the measurement errors in predictors should be utilized. Of course, simple models, such as ANCOVA, may be better choices if the level of reliability is sufficiently high and the baseline group differences is small as we found in our simulation results (see Supplemental Materials).
Our findings also help researchers understand the relationship between the expected direction of the bias in the estimated treatment effect and the direction of pre-existing group differences in the pretest scores. Understanding this relationship would be very important and helpful when considering the unexpected or puzzling results. That is, when pretest measurement was not reliable, the treatment group advantage in the outcome at the time of pretest will be carried over to the direction of the bias in the impact estimate. In other words, the group whose pretest measure is higher tends to end up higher observed posttest outcome than the actual treatment effect predicts. Importantly, the opposite scenario is common in a real-world intervention where a target population is students at risk. In this case, as we demonstrated in our derivation of the bias and its simulation results, a positive treatment effect will be attenuated to be statistically non-significant or display an effect even in opposite direction. If sufficient attention is not paid to this fact, researchers may struggle explaining why the program impact is not strong enough as expected. Our finding suggests that the lack of program impact may be explained by the fact that treatment group’s pretest mean is lower than that of the control group and the pretest score reliability is not high. Thus, when interpreting the unexpected results, such as non-significant program impact, program impacts of negligible size, or even negative program impact, researchers should take the initial direction and the size of the pretest group differences into consideration in addition to the level of the unreliability of the pretest.
Supplemental Material
Supplemental material, sj-docx-1-epm-10.1177_00131644211068801 for Bias for Treatment Effect by Measurement Error in Pretest in ANCOVA Analysis by Yasuo Miyazaki, Akihito Kamata, Kazuaki Uekawa and Yizhi Sun in Educational and Psychological Measurement
This point becomes clearer later from Equation 16 and Equation A-19 in Part 3 in Online Appendix A.
Footnotes
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding: The author(s) received no financial support for the research, authorship, and/or publication of this article.
ORCID iDs: Yasuo Miyazaki
https://orcid.org/0000-0001-8781-387X
Akihito Kamata
https://orcid.org/0000-0001-9570-1464
Supplemental Material: Supplemental material for this article is available online.
References
- Battauz M., Bellio R., Gori E. (2011). Covariate measurement error adjustment for multilevel models with application to educational data. Journal of Educational and Behavioral Statistics, 36(3), 283–306. [Google Scholar]
- Battistin E., Chesher A. (2014). Treatment effect estimation with covariate measurement error. Journal of Econometrics, 178(2), 707–715. [Google Scholar]
- Campbell D. T., Boruch R. F. (1975). Making the case for randomized assignment to treatments by considering the alternatives: Six ways in which quasi-experimental evaluations in compensatory education tend to underestimate effects. In Bennett C. A., Lumsdaine A. A. (Eds.), Evaluation and experiment: Some critical issues in assessing social programs (pp. 195–296). Academic Press. [Google Scholar]
- Campbell D. T., Erlebacher A. (1970). How regression artifacts in quasi-experimental evaluations can mistakenly make compensatory education look harmful. In Hellumuth J. (Ed.), The disadvantaged child: Vol. 3, compensatory education: A national debate (pp. 185–210). Brunner/Mazel; New York [Google Scholar]
- Carroll R. J., Ruppert D., Stefanski L. A., Crainiceanu C. M. (2006). Measurement error in nonlinear models: A modern perspective (2nd ed.). Chapman and Hall/CRC. [Google Scholar]
- Cicirelli V. G., Evans J. W., Schiller J. S. (1969). The impact of Head Start: An evaluation of the effects of Head Start on children’s cognitive and affective development (Vols. 1–2). Ohio University and Westinghouse Learning Corporation. [Google Scholar]
- Cohen J. (1988). Statistical power analysis for the behavioral sciences. Lawrence Erlbaum. [Google Scholar]
- Cohen J., Cohen P., West S. G., Aiken L. S. (2003). Applied multiple regression/correlation analysis for the behavioral sciences (3rd ed.). Lawrence Erlbaum. [Google Scholar]
- Cook J.R., Stefanski L.A. (1994). Simulation-Extrapolation Estimation in Parametric Measurement Error Models. Journal of the American Statistical Association, 89, 1314–1328. [Google Scholar]
- Crocker L., Algina J. (1986). Introduction to classical and modern test theory. Harcourt Brace Jovanovich. [Google Scholar]
- Culpepper S. A., Aguinis H. (2011). Using analysis of covariance (ANCOVA) with fallible covariates. Psychological Methods, 16(2), 166–178. [DOI] [PubMed] [Google Scholar]
- Darlington R. B. (1990). Regression and linear models. McGraw-Hill. [Google Scholar]
- Devanarayan V., Stefanski L. A. (2002). Empirical simulation extrapolation for measurement error models with replicate measurements. Statistics & Probability Letters, 59(3), 219–225. [Google Scholar]
- Egbewale B. E., Lewis M., Sim J. (2014). Bias, precision and statistical power of analysis of covariance in the analysis of randomized trials with baseline imbalance: A simulation study. BMC Medical Research Methodology, 14(1), 1–12. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Fuller W. A. (1987). Measurement error models. Wiley. [Google Scholar]
- Greene W. H. (2008). Econometric analysis (6th ed.). Pearson. [Google Scholar]
- Hong H., Aaby D. A., Siddique J., Stuart E. A. (2019). Propensity score–based estimators with multiple error-prone covariates. American Journal of Epidemiology, 188(1), 222–230. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kumenta J. (1986). Elements of econometrics (2nd ed.). University of Michigan Press. [Google Scholar]
- Little T. D. (2013). Longitudinal structural equation modeling. Guilford press. [Google Scholar]
- Lockwood J. R., McCaffrey D. F. (2014). Correcting for test score measurement error in ANCOVA models for estimating treatment effects. Journal of Educational and Behavioral Statistics, 39(1), 22–52. [Google Scholar]
- Marsh H. W. (1998). Simulation study of nonequivalent group-matching and regression-discontinuity designs: Evaluations of gifted and talented programs. The Journal of Experimental Education, 66(2), 163–192. [Google Scholar]
- Nunnaly J. C., Bernstein I. H. (1994). Psychometric theory (3rd ed.). McGraw-Hill. [Google Scholar]
- Office of Head Start. (2020). About the office of head start. U.S. Department of Health & Human Services. https://www.acf.hhs.gov/ohs/about [Google Scholar]
- Rabe-Hesketh S., Pickles A., Skrondal A. (2003). Correcting for covariate measurement error in logistic regression using nonparametric maximum likelihood estimation. Statistical Modelling, 3(3), 215–232. [Google Scholar]
- Rabe-Hesketh S., Skrondal A., Pickles A. (2003). Maximum likelihood estimation of generalized linear models with covariate measurement error. The Stata Journal, 3(4), 386–411. [Google Scholar]
- Schafer D. W., Purdy K. G. (1996). Likelihood analysis for errors-in-variables regression with replicate measurements. Biometrika, 83(4), 813–824. [Google Scholar]
- Sengewald M., Steiner P. M., Pohl S. (2019). When does measurement error in covariates impact causal effect estimates? Analytic derivations of different scenarios and an empirical illustration. British Journal of Mathematical and Statistical Psychology, 72(2), 244–270. [DOI] [PubMed] [Google Scholar]
- What Works Clearinghouse. (2020). What works clearinghouse standards handbook, Version 4.1. The Institute of Education Sciences. https://ies.ed.gov/ncee/wwc/Docs/referenceresources/WWC-Standards-Handbook-v4-1-508.pdf [Google Scholar]
- Wooldridge J. M. (2020). Introductory econometrics: A modern approach (7th ed.). Cengage Learning. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Supplemental material, sj-docx-1-epm-10.1177_00131644211068801 for Bias for Treatment Effect by Measurement Error in Pretest in ANCOVA Analysis by Yasuo Miyazaki, Akihito Kamata, Kazuaki Uekawa and Yizhi Sun in Educational and Psychological Measurement



