Skip to main content
Journal of Research of the National Bureau of Standards logoLink to Journal of Research of the National Bureau of Standards
. 1982 Sep-Oct;87(5):377–385. doi: 10.6028/jres.087.022

Consensus Values and Weighting Factors

Robert C Paule 1,*, John Mandel 1,*
PMCID: PMC6768160  PMID: 34566088

Abstract

A method is presented for the statistical analysis of sets of data which are assembled from multiple experiments. The analysis recognizes the existence of both within group and between group variabilities, and calculates appropriate weighting factors based on the observed variability for each group. The weighting factors are used to calculate a “best” consensus value from the overall experiment. The technique for obtaining the consensus value is applicable to either the determination of the weighted average value, or to the parameters associated with a weighted least squares regression problem. The calculations are made by using an iterative technique with a truncated Taylor series expansion. The calculations are straightforward, and are easily programmed on a desktop computer.

An examination of the observed variabilities, both within groups and between groups, leads to considerable insight into the overall experiment and greatly aids in the design of future experiments.

Keywords: ANOVA (within-between), components of variance, consensus values, design of experiments, pooling of variance, weighted average, weighted least squares regression

1. Introduction

The purpose of this article is to discuss the problem of calculating “best” estimates from a series of experimental results. It will be convenient to refer to these estimates as consensus values. Since experimental data frequently come from many different sources, with each having its own characteristic variability, the statistician’s problem centers on the appropriate weighting of the data to obtain the consensus value(s). In order to achieve this aim, the statistical analysis should recognize the existence of both within group and between group variability. Both types of variability are considered here to be random effects and are described by their associated components of variance: the within set component of variance for group i, σwi2, and the between set component of variance, σb2.

Early attempts to solve the consensus value problem have not explicitly recognized the existence of the between set component of variance. Attempts, such as the Birge ratio method [1],1 are adversely affected by this omission.

In this article we will deal with two types of problems. The first is essentially the calculation of a weighted average value and its statistical uncertainty. The second problem arises in the fitting of a straight line or curve to a set of data. The estimates both for the average and the parameters of the fitted curve can be thought of as consensus values, i.e., they should be the consensus of the observed data. Both problems can involve several sources of error. It will turn out that, because of the similarity of the weighting problems in both types of situations, a theoretical solution common to both problems can be derived. A solution for a more restricted form of the first problem has been previously reported [2,3]. The mathematical aspects of the general problem are outlined in the current paper.

An understanding of the nature of random error processes associated with measurement systems is required to develop appropriate weighting factors. The weighting factors that are derived are not arbitrary, but are controlled by the nature of the variability of the data.

2. Illustrative Examples

For purposes of exposition, artificial examples of data sets will be used as illustrative material. Later in this paper it will be shown that the procedures developed are useful for the analysis of actual laboratory data.

To develop some feeling for “what is appropriate weighting”, let us examine two specific examples. For the first example, consider that three measurements are each made by method A and by method B, and that the method A results are more precise. Assume, both here and throughout this manuscript, that the relative accuracies of the methods are not known. Let the following be the measured values, and the corresponding coded values which are obtained by subtracting 200 from the measured values.

Method A B
Measured Values 201.1 201.9 201.5 216 225 203
Coded Values 1.1 1.9 1.5 16 25 3

For ease of presentation, our evaluations will be made using the coded values. Giving the same weight to all six coded values result in a straight average of 8.1. Note that the addition or loss of a single method B measurement would likely result in a relatively large change of this average. This is not desirable. Intuitively, we know that we should give greater weight to the more precise (and stable) method A results.

For the second example, consider that the measurements by method A and B are equally precise, but that many more measurements were made by method A than by method B.

Method A B
Coded Values 2.0 1.0 1.5 1.8 1.2 1.7 16.3 16.8
(Measured Values—200.)

In this example, the values by the two methods differ widely. Here again, we should not take a straight average (Y¯=5.3). To do so would strongly favor method A, and we have no basis for preferring this method. For these data, it is better to take separate averages for each method, and to then average the two averages (Y¯=9.0). Note that this is a form of weighting, and that it does not let the larger number of A measurements overpower the B measurements.

3. Basic Statistics of Weighted Averages

It is well known that the weighted average of n values of Y is calculated by the formula:

Y˜=i=1nωiYii=1nωi=a1Y1+a2Y2+

where ωi is the weight associated with the value Yi and the a’s are the corresponding coefficients. Statistical theory shows that the variance of this weighted average is minimized when the individual weights are taken as the inverse of the variance of the individual Yi, that is, ωi = l/Var(Yi ). Low weights are given to values with high variance.

Next consider the weighted average of m average values, Yi¯:

Y˜=i=1mωiY¯ii=1mωi (1)

where now ωi=1/Var(Y¯i). If both the Y¯i and the Var(Y¯i) are known, then the weighted average of (1) is easily calculated, and this is the consensus value. The estimation of the proper value for the variance of Y¯i, however, is not always a simple process. To better understand the problem let us return to the second example, given above. It is easy to obtain Y¯A=1.533 and the variance estimate

s2(Y¯A)=s2(YA)nA=i=16(YiY¯A)2(61)6=0.0238=(0.154)2

and Y¯B=16.55 and s2(Y¯B)=0.0625=(0.250)2, but one questions the reasonableness of the estimated variances. How can the variances be so small, and the two averages be so far apart? The answer is that the above variance calculations only describe the internal variability of the A or the B measurements, and do not recognize the variability between the sets of measurements. It is quite common, even among very good measurements, to find large differences between different sets of measurements [4]. To obtain a realistic estimate of Var(Y¯i), or correspondingly of ωi, one must evaluate a between set component of variance and include it in Var(Y¯i).

The use of a between set component of variance, in effect, treats the collection of systematic errors from the various measurement sets as a source of random variability. The existence of systematic errors in a measurement process may require a word of explanation. The systematic errors that are being described are errors that remain after the extensive scientific development of a measurement process. All major sources of error should have been eliminated, and calibrations with multiple standards should have been made. What frequently happens in this process is that many within set errors are eliminated along with the larger between set errors, such that the sensitivity of the analytical method increases to the point that a lower level of systematic error can now be detected. There are practical limitations to the pursuit of this process, and frequently one must live with a certain detectable level of between set systematic error. Effects such as interferences due to minor sample components will vary in different laboratory environments and these effects are extremely difficult to eliminate.

An essential point in our analysis is the assumption that no information on systematic errors is available that would allow us to place more confidence in any one set of measurements as compared to the others. Thus, in this analysis all sets have equal standing with regard to their possible systematic errors. Our technique can, however, be extended to cover situations involving different assumptions.

The calculation of the between set component of variance is readily accomplished by an iterative procedure, described in section 4. The sample estimate of Var(Y¯i) for method i, is obtained by combining the within set component of variance, swi2, and the between set component of variance, sb2. For the second example:

s2(Y¯A)=swA26+sb2
s2(Y¯B)=swB22+sb2

The within set component of variance for method A is:

swA2=i=16(YiY¯A)2(61)=0.1427

Similarly,

swB2=0.1250

and swA2/6 and swB2/2 are equal to 0.0238 and 0.0625, the quantities that we had previously (and incorrectly) called s2(Y¯A) and s2(Y¯B). For the proper s2(Y¯i) one needs to add in sb2. With an available sb2 one calculates estimates for s2(Y¯A) and s2(Y¯B) and the corresponding heights, ωA and ωB, and then proceeds by eq (1) to obtain a valid estimate of the consensus value, Y˜.

If the swi2 are quite similar, as they are in the above example, one can make an improvement by using a more stable pooled sw2. There should, of course, be a reasonable scientific and statistical basis for pooling the within set variability. For the current example,

sw2=i=16(YAiY¯A)2+i=12(YBiY¯B)2(61)+(21)=0.1398 (2)

so that

s2(Y¯A)=0.13986+sb2

and

s2(Y¯B)=0.13982+sb2

To summarize: The weighting constants used to calculate the consensus value are obtained by taking the inverse of the variances of the various set Y¯i. The proper variances are a combination of the within and the between set components of variance. Under certain circumstances, a more stable pooled within set component of variance may be used.

4. Calculation of the Between Set Component of Variance

The proper weight for Y¯i is ωi=1/Var(Y¯i) and the estimate of this quantity is:

ωi=[swi2ni+sb2]1 (3)

Depending on the nature of the data, the within set variance may or may not be pooled. For either case, however, sb2 must be evaluated. This is accomplished in the following way.

From the definition of ωi we obtain the relation:

ωiVar(Y¯i)=1 (4)

or equivalently

Var(ωiY¯i)=1

For any given set of ωi, this variance can be estimated from the sample by the formula

s2(ωiY¯i)=i=1mωi(Yi¯Y˜)2(m1)

Equating this estimate to its expected value (unity, see eq (4)), we obtain

i=1mωi(Y¯iY˜)2(m1)=1 (5)

where Y˜ is the estimate of the consensus value as given by eq (1). The estimate of Y˜ depends on knowing the ωi. These can be calculated from eq (3), once sb2 is known. Thus, the only problem is to estimate sb2. Equation (5) provides the means for calculating sb2 through an iterative process.

Define the function:

F(sb2)=i=1mωi(Y¯iY˜)2(m1) (6)

In view of eq (5), sb2 must be such that F(sb2)=0. For ease of notation let sb2=v. Start with an arbitrarily selected initial value, vo. It is desired to find an adjustment, dv, such that F(vo + dv) = 0. Using a truncated Taylor series expansion, one obtains:

F(vo+dv)Fo+(Fv)odv=0

and

dv=(FoFv)o

Evaluating the partial derivative in this equation, one obtains:

dv=Fo[i=1mωi2(Y¯iY˜)2]o (7)

The adjusted (new) value for v is:

Newvo=Oldvo+dv

This new value is now introduced in eq (1), (3), (6), and (7) and the procedure is iterated until dv is satisfactorily close to zero. If at any point in the iteration process a negative value is obtained for v, this value should be replaced by zero and the iteration continued. The last v is the sb2 we seek. The ωi and Y are also obtained from this last iteration.

The small data set of our second example will now be used to illustrate the iterative procedure. Let the first estimate for sb2 be 100. In calculating the to ωi from eq (3), it is seen that the first term of the right-hand side is a fixed quantity and that the values for the A and B sets have been previously calculated to be .0238 and .0625, respectively. Thus, ωA = 1/(.0238 +100.) = .0099976 and ωB = .0099938. The Y¯A and Y¯B are 1.533 and 16.550, respectively.

Fromeq(1),Y˜=9.0400Fromeq(6),Fo=.1270Fromeq(7),dv=11.28

The next iteration would start with a value of 111.28 for sb2, and would repeat the above set of calculations with this new value. After two additional iterations, v is equal to 112.7120 and dv is less than .0001. The final Y˜ is 9.0402. (The uncoded Y is, of course, 209.0402.)

In this illustration the initial v value was reasonably close to its final estimate. More discrepant initial values will require only a few additional iterations. It can be shown that the iteration process always converges.

The pooled estimate of the sw2 (= 0.1398) could have been used for the above iterative calculations. The final results would be very similar (sb2=112.7085 and Y˜=9.0399). For either case, the iterative calculations are easily programmed on a desktop computer.

5. Discussion

The above iterative calculations for the weights and the weighted average are recommended. The calculations are based on the recognition of both within and between group variability. The calculated consensus value is, in general, neither the grand average of all measurements, nor the average of measurement set averages. These overall averages merely describe two opposite weighting situations from our more general weighting eq (3). To illustrate this point consider the case where a pooled sw2 is used in eq (3). When the sb2 term of this equation is zero, the weights for the Y¯i are all proportional to ni. All individual measurements are therefore weighted equally. When, however, sb2 is relatively large, the sw2/ni term of eq (3) is essentially without effect, and all the measurement set averages are weighted equally. Equation (3) also gives proper weighting for all intermediate cases. In addition, it describes the situation where the within set components of variance are different for different sets of measurements, and takes account of any differences in the number of replicates (ni) in the various groups.

The ready availability of programmable desktop computers strongly encourages the use the iterative approach. Since one can easily do the calculations, there is little reason to not use proper weighting.

The examples to this point have been chosen to be easily worked by hand. They describe situations where the intuitive answers are obvious. The examples use of only two measurement sets, however, is not recommended in practice since there is a very limited sampling of measured differences between sets. Such a limited sampling results in a sb2 estimate that is quite uncertain. The use of many sets of measurements is recommended since this results in greater stability of the estimates.

6. Calculation of the Standard Error of the Weighted Average

All practical applications of the weighted average will require some estimate of its uncertainty. Accordingly, the standard error (standard deviation) of the weighted average should be calculated. The derivation of the standard error of Y˜ is straightforward if one considers the final ωi estimates as constants.

Y˜=iωiY¯iiωi

and

Var(Y˜)=iωi2Var(Y¯i)(iωi)2=iωi2(1/ωi)(iωi)2=1iωi

The sample estimate of the standard error of Y˜ is easily obtained from the final iteration of Y˜. It is simply the inverse of the square root of the sum of the weights. Note that the latter quantity has already been calculated as the denominator of Y˜.

StandardError=1iωi

The standard error is reduced by the use of a larger number of sets of measurements, i.e., by more ωi.

The standard error associated with the Y˜=209.04 from our previously worked example is calculated as follows:

StandardError=1(.0238+112.7120)1+(.0625+112.7120)1=7.51

This value is seen to be quite reasonable when one remembers that the uncoded group averages for methods A and B were 201.53 and 216.55. Notice in this example that the between set component of variance is the predominant factor in the standard error.

7. Example of an Interlaboratory Experiment Using the Weighted Average

Five laboratories have made a number of determinations for the heat of vaporization of cadmium [5]. In this experiment, each laboratory had a noticeably different replication precision, and each performed a different number of determinations to obtain its average value. We now wish to determine the consensus value (weighted average) from this interlaboratory experiment. The information from the experiment is listed below, along with the sb2 calculated by the iterative procedure.

Lab i Avg. Value ni swi2/ni sb2
1 27,044 6 376464314}×103 105×103
2 26,022 4
3 26,340 2
4 26,787 2
5 26,796 4

In the process of examining the data, the following three averages were calculated.

Average of averages 26,598
Average of individual measurements 26,655
Iterative weighted average 26,713

One notes that the iterative weighted average does not fall between the other two averages. How can this happen? Basically, it is caused by the recognition of the individual within group variances in the weights for the iterative weighted average. To better understand the three averaging processes, let us order the laboratory heat values and include the three sets of “weights” that were used for the averages.


Weights For Avg. of

Ordered Avg. Values Avg. of Averages Measurements Iterative Procedure swi2/ni sb2
26,022 1 4 5.51.89.38.59.3}×106 764643143}×103 105×103
26,340 1 2
26,787 1 2
26,796 1 4
27,044 1 6

Note that the second and third columns contain relative weights while the fourth column contains absolute weights. The relative weights cause no problem for the calculation of the weighted average since inspection of eq (1) shows that any constant multiplier for the relative weights will cancel out. An inspection of the three columns of weights, as well as the ordered laboratory heat values, shows that the weights for the iterative procedure most strongly favor the higher laboratory heat values. Column five of the table, in turn, shows why the iterative weights most strongly favor the higher laboratory heat values; the observed within group variability is smaller for the laboratories that have the higher heat values. This causes the Var(Y¯i) for these laboratories to be relatively small and the weights to be relatively large.

This example with actual laboratory data shows that one cannot automatically assume that the average of averages and the average of measurements will bracket the consensus value (weighted average). The weighted average should be calculated. It is more sensitive to the overall experiment and it responds to both the within- and the between group variability.

It will next be shown that the iterative treatment of weighting factors can be easily extended to the problem of fitting lines by weighted least squares (regression).

8. Fitting Lines by Weighted Least Squares

According to statistical theory, the above defined estimate of the weighted average is the value that minimizes the sum of the weighted squared deviations of the observed data (from the weighted average value). It is a least squares estimate. A similar treatment is used in weighted linear regression. Here, a pair of parameters, namely the intercept and the slope of the line, are estimated, rather than a single average. The procedure, however, is again the minimization of the weighted sum of squares of deviations. Here, too, both within set and between set components of variance should be evaluated.

Consider the situation where a laboratory calibrates an instrument using a series of standards. The laboratory may not always make the same number of replicate measurements with the different standards. Thus there are different sets of replicate instrument measurements (Y) corresponding to a series of accurately determined standard values (X). An example of a linear calibration process is given in figure 1. Let us assume that the linearity of the calibration curve has previously been established. An examination of the figure shows that the variability in the Y direction among replicates obtained at the same X value is relatively small when compared with the scatter of the clusters of points about the straight line. Thus, two sources of variability are suggested by the data. The Y replication variability associated with a given X value is analogous to the previously described within set component of variance, swi2, and the variation shown by the scatter of the clusters of points about the fitted line is analogous to the between set components of variance, sb2.

Figure 1.

Figure 1

The observed variance for the j-th replicate Yij measurement made at a given Xi value will consist of the sum of the within- and the between set components of variance.

s2(Yij)=swi2+sb2

For convenience of calculation, it is desirable to deal with the averages of the replicate measurements. The average of ni replicate measurements is denoted as Y¯i. The observed variances for the averages are given by:

s2(Yi¯)=swi2ni+sb2

The within set variances of the above equation can be evaluated for each distinct Xi value. It is possible, if there is a consistent measurement process over the full range of values, to obtain a pooled estimate of the within set component of variance. This pooled estimate is obtained in the same manner as described by eq (2), above. In the current application, the different Xi values correspond to the previously described different measurement sets and there are now as many summations in the numerator and denominator of eq (2) as there are distinct Xi values.

Let us now assume that an appropriate between set component of variance is available. The weights, ωi=1/Var(Y¯i) can be evaluated, and a standard weighted linear regression of Y¯i on Xi can be carried out (see either the Appendix, or Ref. [6]). Thus, the regression problem using weighted least squares centers on the determination of sb2.

The sb2 value, for the regression case with an intercept and slope, can be determined by the general iterative approach given above. Equation (3) now refers to the within- and between set random errors in the Y measurements. It is now used along with the following modified iteration equations:

F(sb2)=i=1mωi(Y¯iY^i)2(m2) (8)
dsb2=Fo[i=1mωi2(Y¯iY^i)2]o (9)

where

  • Y^i = weighted least squares fitted value, i.e.,

  • Y^i=a+bXi

The major modification is that instead of using Y˜, we use a weighted least squares fitted value Y^i. Equation (8) uses (m-2) rather than (m-1) degrees of freedom since we are now estimating two parameters, i.e., the intercept and the slope.

The procedure for iteration is little changed. An arbitrary initial estimate for sb2 is taken and used with (3) to obtain the weights. Next, a weighted linear regression is made of Y¯i on Xi to obtain estimates a, b, and Y^i. This is followed by the use of eq (8) and (9) to calculate a correction for sb2. The whole procedure is then repeated until the correction for sb2 is negligible. The final sb2, a, b, and Y^i are then saved for further interpretation and use.

The above procedure for performing a weighted linear least squares fit can be easily extended to a weighted quadratic, or higher order, regression of Y¯i on Xi. For example, to fit the equation Yi=a+bXi+cXi2 change, in equation (8), the (m-2) to (m-3) to account for the addition of coefficient c, and use a quadratic fitted Y^i in eq (8) and (9).

9. An Example of a Weighted Least Squares Fit

Let us examine the effect of different weighting factors on the determination of the intercept and slope of a calibration line. A greatly simplified example is shown in figure 2. The true line has both unit intercept and slope. For this example, let us assume that “interferences” for the X = 1 and the X = 5 standard samples are such that the measured values will be about 0.2 units high. Similarly, the X = 2 and X = 4 standards yield results that are about 0.2 units low. Duplicate measurements are made, and for simplicity assume that these measurements have a fixed sw2 value of 0.0008 (as shown in fig. 2). With equal numbers of replicate measurements, both the unweighted and the iterative weighted regression calculations give the correct values for the intercept and the slope.

Figure 2.

Figure 2

Let us now, however, say that the experimenter is particularly interested in determining the intercept and that he/she therefore makes six rather than two replicate measurements using the X = 1 standard. For the sake of simplicity, assume that the six Y measurements again center at 2.2 and that sw2=.0008. Even though everything looks nominally the same, the unweighted regression calculation gives an intercept of 1.145 and a slope of 0.9636. Obviously, the six points at X = 1 have pulled the left side of the line upward. If we carried out the regression calculation using only the average Y value for each X value we would obtain the correct intercept and slope values. The average Y values are not affected by the number of measurements used in each average.

In this example, in which appreciably more measurements were made for one standard than for the others, and the replication error was relatively small, the unweighted regression leads to erroneous results. A proper weighting procedure must prevent the measurements at one standard from unduly influencing the fit. Equation (3) of our iterative weighted regression calculations will properly control the weighting. In this example, the sb2 term in eq (3) dominates the weighting. Use of the iterative weighted linear regression gives a = 1.0008 and b = 0.9998. If the data from this example were real laboratory data, then our calculated a and b would be the appropriate sample estimates.

10. Design of Experiments

The interferences associated with the five standards of the above illustrative example have been ideally and artificially balanced. In real life situations the order in which the interferences will occur will tend to be more random. When the replication error is small, i.e., sw2 is small relative to sb2, the positions of the (Xi,Y¯i) points will be mainly affected by these random sample interferences. In that case, the use of a larger number of standards over the range of measurement interest is recommended since this favors a more even distribution of these interferences, and a more accurate determination of the line. Furthermore when sw2 is small relative to sb2, the use of large numbers of replicate measurements is not recommended since these measurements are very inefficient in determining the position of the (Xi,Y¯i) points.

Consider next the situation shown in figure 3, where sw2 is large relative to sb2. Here all of the average points (Xi,Y¯i) are very uncertain. The interferences of each standard sample is now completely overshadowed by the variability in the replicate measurements. For this situation one should make many replicate measurements with all of the standard samples so as to minimize the replication uncertainty.

Figure 3.

Figure 3

11. Summary and Conclusions

Calculation of consensus values, both in the form of the weighted average or the weighted least squares regression, requires a knowledge of the within- and the between set components of variance. The individual or the pooled within set components of variance can be directly calculated from the experimental data. The between set component of variance can conveniently be calculated from the experimental data using an iterative technique which is based on a truncated Taylor series expansion. Consensus value(s) are also obtained by this iterative technique.

A simple intuitive understanding of the within- and between set components of variance allows one to more efficiently design experiments for obtaining consensus values.

The logical arguments for use of the within- and between set components of variance can be extended to other areas of statistical analysis. Work is in progress for extending the current techniques to nested analyses of variance.

Appendix

The formulas for estimating the slope and intercept by weighted least squares are straightforward. The slope is calculated from the observed m sets of (Xi,Y¯i) points.

b=i=1mωi(XiX˜)(Y¯iY˜)i=1mωi(XiX˜)2

where

X˜=i=1mωiXii=1mωi

and

Y˜=i=1mωiY¯ii=1mωi

The intercept is obtained by the following formula.

a=Y˜bX˜

The interested reader may also wish to calculate the standard errors of the above estimates of the slope and the intercept, the formulas are:

sb=1i=1mωi(XiX˜)2
sa=i=1mωiXi2[i=1mωi][i=1mωi(XiX˜)2]

A more detailed explanation of weighted least squares fitting processes is contained in Ref. [6].

Footnotes

1

Figures in brackets indicate literature references located at the end of this paper.

12. References

  • [1].Birge R. T., The Calculation of Errors by the Method of Least Squares, Phys. Rev., 40, 207–227 (1932). [Google Scholar]
  • [2].Mandel J. and Paule R. C., Interlaboratory Evaluation of a Material with Unequal Numbers of Replicates, Anal. Chem., 42, 1194–7 (1970), and Correction, Anal. Chem., 43, 1287 (1971). [Google Scholar]
  • [3].Cochran W. G., The Combination of Estimates from Different Experiments, Biometrics, 10,101–29 (1954). [Google Scholar]
  • [4].Youden W. J., Enduring Values, Technometrics, 14, 1–11 (1972). [Google Scholar]
  • [5].Paule R. C. and Mandel J., Analysis of Interlaboratory Measurements on the Vapor Pressure of Cadmium and Silver, NBS Special Publication; 260–21 (1971). [Google Scholar]
  • [6].Draper N. R. and Smith H., Applied Regression Analysis, 2nd ed., Section 2.11 (John Wiley and Sons, N.Y., 1981). [Google Scholar]

Articles from Journal of Research of the National Bureau of Standards are provided here courtesy of National Institute of Standards and Technology

RESOURCES