Skip to main content
Educational and Psychological Measurement logoLink to Educational and Psychological Measurement
. 2017 Jun 6;78(1):70–79. doi: 10.1177/0013164417713571

Modifying Spearman’s Attenuation Equation to Yield Partial Corrections for Measurement Error—With Application to Sample Size Calculations

W Alan Nicewander 1,
PMCID: PMC5965627  PMID: 29795947

Abstract

Spearman’s correction for attenuation (measurement error) corrects a correlation coefficient for measurement errors in either-or-both of two variables, and follows from the assumptions of classical test theory. Spearman’s equation removes all measurement error from a correlation coefficient which translates into “increasing the reliability of either-or-both of two variables to 1.0.” In this inquiry, Spearman’s correction is modified to allow partial removal of measurement error from either-or-both of two variables being correlated. The practical utility of this partial correction is demonstrated in its use to explore increasing the power of statistical tests by increasing sample size versus increasing the reliability of the dependent variable for an experiment. Other applied uses are mentioned.

Keywords: reliability, Spearman correction, classical test theory, statistical power

Introduction

One of psychology’s earliest contributions to statistics is the Spearman (1904) correction for attenuation, which corrects a correlation coefficient for measurement errors in either-or-both of the variables. This correction of correlations for measurement error follows from the assumptions of classical test theory,

ρ(τx,τy)=ρ(x,y)ρ(x,x)ρ(y,y)orρ2(τx,τy)=ρ2(x,y)ρ(x,x)ρ(y,y),

where τx and τy represent true values of X and Y, free of measurement errors, and εx and εy are measurement errors assumed to be uncorrelated with τx or τy and other observed variables (see Lord & Novick, 1968). The quantities,

ρ(x,x)=σ2(τx)/[σ2(τx)+σ2(εx)]=σ2(τx)/σ2(x)

and

ρ(y,y)=σ2(τy)/[σ2(τy)+σ2(εy)]=σ2(τy)/σ2(y)

are the reliability coefficients for X and Y.

Derivation of a Partial Correction for Measurement Error

Equation (1) answers the question, “If X and Y had perfect reliabilities equal to 1.00, what would the correlation between X and Y be?” However, suppose one has two measures X and Y that have unsatisfactory reliabilities, then a relevant question might be, “If X could be improved to have reliability equal to .90, and Y improved to have reliability equal to .80, what would the correlation between X and Y be?” Questions about the degree to which correlations can be improved by increasing the reliabilities of the X and Y variables to higher than current values (but less than 1.00) can be answered using the modified spearman correction derived below. This partial correction for measurement error is quite simple, and it is somewhat surprising that it has not been derived in the past; however, the author’s search of the literature did not discover any previous derivations of this equation.

Let Xc and Yc be unobservable variables for which

ρ(x,x)ρ(xc,xc)1,ρ(y,y)ρ(yc,yc)1,

and

ρ(τxc,τx)=ρ(τyc,τy)=1.

In other words, Xc and Yc are variables having higher reliabilities than the original variables, X and Y, and the true scores for Xc and X correlate perfectly as do the true scores for Y and Yc. Now, imbedding ρ2(xc,yc) in Spearman’s equation in order to obtain a solution for this correlation,

ρ2(τxc,τyc)=ρ2(xc,yc)ρ(xc,xc)ρ(yc,yc).

In (5), we seek to solve for ρ2(xc,yc); however, this appears to be unsolvable because it contains two unknowns, ρ2(xc,yc) and ρ2(τxc,τyc). Using assumption in (4), we can substitute ρ2(τxc,τyc)=ρ2(τx,τy) in (5) and solve for ρ2(xc,yc), viz.,

ρ2(τxc,τyc)=ρ2(τx,τy)=ρ2(xc,yc)ρ(xc,xc)ρ(yc,yc),or,ρ2(xc,yc)=ρ2(τx,τy)[ρ(xc,xc)ρ(yc,yc)]=ρ2(x,y)ρ(x,x)ρ(y,y)[ρ(xc,xc)ρ(yc,yc)].

For example, assume that variables X and Y have ρ2(x,y) = .10, and their reliabilities are .60 and .70, respectively: (a) if all measurement errors were removed from X and Y, the squared correlation would be ρ2(τx,τy)=.10/(.6)(.7)=.238; (b) if all measurement errors in X alone were eliminated, the correlation between τxandY would be ρ2(τx,y)=.10/(.6)=.167. In practice, the above values of correlation can never be attained because X and Y cannot be modified to give perfect reliabilities, but what would the squared correlation be if the reliabilities of X and Y were increased to .80 and .90, respectively? Using (6), the modified Spearman correction is

ρ2(xc,yc)=ρ2(τx,τy)[ρ(xc,xc)ρ(yc,yc)]=.238(.8)(.9)=.173.

Thus, a squared correlation of .10 becomes .238 under a complete correction for unreliability in both variables, and .173 under partial correction.

An Application of the Partial Spearman Correction: Increasing the Power of Statistical Tests for Linear Statistical Models

It is now demonstrated how the modified Spearman attenuation correction may be used in determining the decrease in sample size produced by increasing the reliability of the dependent variable (DV) measure. It has long been known that measurement error in an experiment’s DV decreases the power of statistical tests—or conversely, that increasing the reliability of a DV can increase power and reduce the sample size required to achieve a specified degree of statistical power. In one of the earliest studies of measurement error, Sutcliffe (1958) showed that measurement error in a DV does not bias estimates of effects but does decrease the power of statistical tests. Cleary and Linn (1969) and Marcoulides (1993) discussed the role of DV reliability in maximizing power subject to constraints on sample size. Maxwell’s (1980) research demonstrated that increasing reliability increases the magnitude of the noncentrality parameter for an F-test—and thereby increases statistical power. Maxwell recommended that (in cases where the DV is an educational or psychological test) increasing the length of a measure will increase DV reliability and statistical power. The approach taken here is apparently unique in that it blends classical test theory (i.e., Spearman’s attenuation equation) with linear statistical models in order to examine reliability–power issues. Specifically, F-ratios are expressed as functions of squared, multiple correlations; these multiple correlations are then subjected to total-or-partial correction for DV unreliability using Spearman’s equation, and the modification thereof, proposed here.

In what follows, examples are used to illustrate how increasing DV reliability can reduce the sample size required for attaining specified degrees of statistical power of an F-test. Examples are presented in the context of one-way ANOVA models. Suppose an F-ratio is being used in an ANOVA with J-treatment conditions, and n observations per treatment (for a total sample size of N = Jn): The simple ANOVA model fitted to the data may be written as

Y=β0+j=1JβjXj+ϵ1,

where, in this context, the Xs are (1, 0) marker variables for presence or absence of treatments, and the βjs are treatment effects. The null hypothesis for testing that the treatment effects are all zero is

H0:β1=β2==βJ=0.

The observed F-ratio for a test of this hypothesis is

F=Rxy2/(J1)(1Rxy2)/(NJ)=Rxy2(1Rxy2)(NJJ1),

where Rxy2 is the squared, multiple correlation (for several X-variables and a single Y) that is associated with the model above; the numerator and denominator degrees of freedom are dfnum=J1 and dfden=NJ. In order to correct the F-ratio for measurement error in Y, all that needs to be done is to correct Rxy2 for measurement error in Y using Spearman’s modified correction in (1), viz.,

Fadjusted=Rxy2/ρy1(Rxy2/ρy)(NJJ1)=Rxy2(ρyRxy2)(NJJ1),

where ρy is a reliability coefficient associated with the variable, Y:

ρy={1,if no correction for the reliability ofYis wanted;ρ(y,y),if a complete Spearman correction is wanted;ρ(yc,yc),if a partial Spearman adjustment is wanted.

(Note: Nicewander (1975)—see also Guttman (1945)—have shown that ρ(y,y)Ry.x2, and hence, in theory, the F-ratio in (10) can never be negative.) The partial Spearman correction can be used in statistical power calculations wherein the effects of DV reliability can be evaluated in terms of the sample size needed to achieve a certain degree of power. In order to perform power calculations for varying values of sample size, n (per treatment group) and reliability adjustments, one needs the following:

  1. An effect size—Cohen’s f2= R2xy/(1 −R2xy)—was used here (Cohen, 1988). Specify a value of DV reliability—ρ(y,y),ρ(yc,yc)or1.0—and values of J and N = Jn are chosen.

  2. An F-ratio—adjusted for some specified value of DV reliability—is then computed using the values above,

    Fadj.=R2[ρyR2](NJJ1)
  3. A critical value of F—for testing the null hypothesis at some level, α, of Type I error—is determined:

    Fcritical=Fα,dfnum,dfden
  4. An estimate of the noncentrality parameter, λ—adjusted for the specified reliability, ρy—is computed in order to specify the appropriate noncentral F-distribution for computing power,

    λadj.(J1)Fadj.

    (See Liu & Raudenbush, 2004.).

  5. Finally, a value of power is computed, viz.,

    Power=10Fα,dfnum,dfdenuφ(u|dfnum,dfden,λadj)du,

    where

    φ(u|df1,df2,λadj)

    is a noncentral pdf for F.

In the examples below, the sample size, n, for producing power = .80 was calculated using a do-loop (do while (power <= .80)) in SAS® that encompassed Steps 1 through 5, using the SAS routines, FINV (to compute the critical values of F), and PROBF (to compute power).

Examples of Using the Complete and Partial Spearman Adjustments in Power Calculations

In the following tables, there are either two or four treatment conditions. The “original” DV, Y, has reliability equal to .60, and the effect, on power, of partially adjusting reliability to .70, .75, .80, .85, and .90 were examined—as well as the case where there was complete (Spearman) adjustment of reliability to a value of 1.00, or no correction whatsoever. The value of the original reliability specified to be .60 is based principally on the fact that in psychometrics, .60 is generally considered to be low, and one might consider revamping the measure in order to decrease measurement error. Often, .70 is considered as minimally acceptable. The value of Cohen’s effect size used here was .11 (corresponding to an R2 of .10), and by convention, .11 would be considered to be a small effect (Cohen, 1988). The sample size (per treatment condition) needed to achieve power equal to .80, n(.8) at an α level of .05 is noted. Also, as the reliability of Y is increased from .60 to .70, .75, .80, .85, .90, and 1.00, the percentage reduction in n(.8) is noted, which indicates increased power as measured by savings in sample size.

Results

Two-Treatment Experiment

For this case, the effect size was small (f2 = .11; R2 = .10), and the reliability of the DV was .6. The first section of Table 1 summarizes the results for the analysis of the two-treatment design. The sample size, n(.8), was equal to 38 per treatment condition with no reliability adjustment, and a perfectly reliable Y required far fewer observations, n(.8) = 22/treatment—a decrease in sample size of 42% as indicated. If the reliabilities of Y were .9, .85, .8, .75, or .7, the sample size calculations yielded n(.8) values of 24, 28, and 32, respectively, with corresponding reductions in n(.8) of 37%, 32%, 26%, 21%, and 16% compared with no reliability correction in Y.

Table 1.

Sample Size, per Group, n(.8), Necessary to Attain Power = .80 for Various Levels of Reliability Correction for a Small Effect Size, f2 = .11

Small effect (f2 = .11); R2 = .10; ρ(y, y′) = .60
Number of groups (J) Reliability correction in Y n(.8), α = .05 Reduction in n(.8)
2 No correction 38
2 Complete correction 22 42%
2 ρ(yc, yc) = . 90 24 37%
2 ρ(yc, yc) = . 85 26 32%
2 ρ(yc, yc) = .80 28 26%
2 ρ(yc, yc) = .75 30 21%
2 ρ(yc, yc) = .70 32 16%
4 No correction 27
4 Complete correction 16 41%
4 ρ(yc, yc) = .90 18 33%
4 ρ(yc, yc) = .85 19 30%
4 ρ(yc, yc) = .80 20 26%
4 ρ(yc, yc) = .75 22 18%
4 ρ(yc, yc) = .70 23 15%

Four-Treatment Experiment

In the second section of Table 1, the results for the statistical analysis of this experimental design were very similar to those from the two-treatment results. With reliability = .60, n(.8) was equal to 27/treatment with no adjustment to reliability, and equal to 16/treatment if Y had no measurement error (perfect reliability)—a percentage reduction in n(.8) of 41%. Reliabilities for Y of .9, .85, .8, .75, and .7 yielded n(.8) values of 18, 19, 20, 22, and 23, respectively, with corresponding reductions in n(.8) of 33%, 30%, 26%, 18%, and 15% compared with no reliability correction in Y.

The results on sample size shown in Table 2 are commensurate with those given in Table 1: The percentage decreases in n(.8), as effect size increases in Table 2, are virtually identical to those in Table 1 for the same degree of increase in the DV reliability. However, even though the percentage decreases in n(.8) are nearly the same as those in Table 1, the absolute decreases in n(.8) are considerably smaller, and become trivially small as effect size becomes large—as can be seen in Table 3 where the results for the largest effect size (.33) are summarized. In cases of large effect size, increasing the reliability of the DV will only be economically viable in cases where the subjects needed for an experiment are extremely rare, and increasing sample size is much more difficult and expensive than increasing the reliability of the DV.

Table 2.

Sample Size, per Group, n(.8), Necessary to Attain Power = .80 for Various Levels of Reliability Correction for a Medium Effect Size, f2 = .18.

Medium effect (f2 = .18); R2 = .15; ρ(y, y′) = .60
Number of groups (J) Reliability correction n(.8), α = .05 Reduction in n(.8)
2 No correction 24
2 Complete correction 14 42%
2 ρ(yc, yc) = .90 16 33%
2 ρ(yc, yc) = .85 17 29%
2 ρ(yc, yc) = .80 18 25%
2 ρ(yc, yc) = .75 19 21%
2 ρ(yc, yc) = .70 20 17%
4 No correction 18
4 Complete correction 10 44%
4 ρ(yc, yc) = .90 12 33%
4 ρ(yc, yc) = .85 13 28%
4 ρ(yc, yc) = .80 13 28%
4 ρ(yc, yc) = .75 14 22%
4 ρ(yc, yc) = .70 15 17%

Table 3.

Sample Size, per Group, n(.8), Necessary to Attain Power = .80 for Various Levels of Reliability Correction for a Large Effect Size, f2 = .33.

Large effect (f2 = .33); R2 = .25; ρ(y) = .60
Number of groups (J) Reliability correction n(.8), α = .05 Reduction in n(.8)
2 No correction 14
2 Complete correction 8 43%
2 ρ(yc, yc) = .90 9 36%
2 ρ(yc, yc) = .85 10 29%
2 ρ(yc,yc) = .80 10 29%
2 ρ(yc, yc) = .75 11 21%
2 ρ(yc, yc) = .70 12 14%
4 No correction 10
4 Complete correction 6 40%
4 ρ(yc, yc) = .90 7 30%
4 ρ(yc, yc) = .85 7 30%
4 ρ(yc, yc) = .80 8 20%
4 ρ(yc, yc) = .75 8 20%
4 ρ(yc, yc) = .70 9 10%

Summary and Conclusions

Spearman’s correction for attenuation has proven useful over the years primarily because it allows researchers to determine what the linear relationship would be between two variables, X and Y, if all measurement errors were removed from one or both of the variables. The correction is especially useful for looking at the strength of theoretical relationships between variables undistorted by measurement errors; for example, “Does the ACT college admission tests measure the same underlying construct as the Wechsler Adult Intelligence Scale?” The Spearman equation, modified in the present inquiry, allows the estimation of the effect, on correlations, of partial corrections for unreliability. These partial corrections are probably more useful in applied situations than in the examination of theoretical relationships. For example, the partial correction allows one to deal with questions such as, “If we can improve the reliability of these measures a bit, what would their correlation look like?” For example, the correction could be useful for dealing with questions such as, “If we could increase the reliability of course grades to a level approaching that of the ACT, what would the validity coefficient (correlation between ACT scores and grades) be for the ACT?” Another example of the applied usage of the partial correction for attenuation is, of course, the one used here in which it was used to examine the degree to which the sample size in an experiment can be reduced by increasing the reliability of the DV.

A final note on the role of DV reliability in statistical tests: Increasing the reliability coefficient of a DV in an experiment does not necessarily mean that the power of a statistical test will be increased. For given treatment effects in an experiment, the power of a statistical test is determined by magnitude of the “within treatment” variance, and this variance is composed of the sum of true-and-error variances. Spearman’s correction for attenuation—and the partial correction derived here—can increase power by decreasing the magnitude of the measurement error variance (i.e., increasing the reliability coefficient). However, when experimental “controls for individual differences”—such as the analysis of difference scores in a two-treatment repeated measures experiment, blocking in ANOVA, and so on are used—“controlling individual differences” translates into “reducing true variance.” Therefore, there are situations where increased statistical power is accompanied by a decrease in the reliability of the DV (Nicewander & Price, 1983).

Footnotes

Declaration of Conflicting Interests: The author declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding: The author received no financial support for the research, authorship, and/or publication of this article.

References

  1. Cleary T. A., Linn R. L. (1969). Error of measurement and the power of a statistical test. British Journal of Mathematical and Statistical Psychology, 22, 49-52. [Google Scholar]
  2. Cohen J. (1988). Statistical power analysis for the behavioral sciences. Mahwah, NJ: Lawrence Erlbaum. [Google Scholar]
  3. Guttman L. (1945). A basis for analyzing test-retest reliability. Psychometrika, 10, 255-282. [DOI] [PubMed] [Google Scholar]
  4. Liu X., Raudenbush S. (2004). A note on the noncentrality parameter and effect size estimates for the F test in ANOVA. Journal of Educational and Behavioral Statistics, 29, 251-255. [Google Scholar]
  5. Lord F. M., Novick M. (1968). Statistical theories of mental test scores. Reading, MA: Addison-Wesley. [Google Scholar]
  6. Marcoulides G. A. (1993). Maximizing power in generalizability studies under budget constraints. Journal of Educational and Behavioral Statistics, 18, 197-206. [Google Scholar]
  7. Maxwell S. F. (1980). Dependent variable reliability and determination of sample size. Applied Psychological Measurement, 4, 253-259. [Google Scholar]
  8. Nicewander W. A. (1975). A relationship between Harris factors and Guttman’s sixth lower bound to reliability. Psychometrika, 40, 197-203. [Google Scholar]
  9. Nicewander W. A., Price J. M. (1983). Reliability of measurement and the power of statistical tests: Some new results. Psychological Bulletin, 94, 524-533. [Google Scholar]
  10. Spearman C. (1904). The proof and measurement of association between two things. American Journal of Psychology, 15, 72-101. [PubMed] [Google Scholar]
  11. Sutcliffe J. P. (1958). Error of measurement and the sensitivity of a test of significance. Psychometrika, 23, 9-17. [Google Scholar]

Articles from Educational and Psychological Measurement are provided here courtesy of SAGE Publications

RESOURCES