Assessing Differences Between Results Determined According to the Guide to the Expression of Uncertainty in Measurement

Raghu N Kacker; Rüdiger Kessel; Klaus-Dieter Sommer

doi:10.6028/jres.115.031

. 2010 Dec 1;115(6):453–459. doi: 10.6028/jres.115.031

Assessing Differences Between Results Determined According to the Guide to the Expression of Uncertainty in Measurement

Raghu N Kacker ¹, Rüdiger Kessel ¹, Klaus-Dieter Sommer ²

PMCID: PMC4548867 PMID: 27134797

Abstract

In some metrology applications multiple results of measurement for a common measurand are obtained and it is necessary to determine whether the results agree with each other. A result of measurement based on the Guide to the Expression of Uncertainty in Measurement (GUM) consists of a measured value together with its associated standard uncertainty. In the GUM, the measured value is regarded as the expected value and the standard uncertainty is regarded as the standard deviation, both known values, of a state-of-knowledge probability distribution. A state-of-knowledge distribution represented by a result need not be completely known. Then how can one assess the differences between the results based on the GUM? Metrologists have for many years used the Birge chisquare test as ‘a rule of thumb’ to assess the differences between two or more measured values for the same measurand by pretending that the standard uncertainties were the standard deviations of the presumed sampling probability distributions from random variation of the measured values. We point out that this is misuse of the standard uncertainties; the Birge test and the concept of statistical consistency motivated by it do not apply to the results of measurement based on the GUM. In 2008, the International Vocabulary of Metrology, third edition (VIM3) introduced the concept of metrological compatibility. We propose that the concept of metrological compatibility be used to assess the differences between results based on the GUM for the same measurand. A test of the metrological compatibility of two results of measurement does not conflict with a pairwise Birge test of the statistical consistency of the corresponding measured values.

Keywords: Birge test, interlaboratory evaluations, predictive p-value, uncertainty

1. Introduction

To test the proficiency of individual laboratories in conducting specific tasks, interlaboratory comparisons (ILC) are often used. In ILC between measurement laboratories, the task is generally the measurement of a common artifact or fractions of the same sample of material. To develop a certified reference material, a well characterized material is measured by two or more methods in one or more laboratories. In both cases the data consist of multiple results of measurement (measured values with associated uncertainties) of a common measurand. To assess the differences between two or more measured values for the same measurand, metrologists have for many years used a test proposed by physicist Raymond T. Birge in 1932 [1]. Birge introduced the term consistency for lack of significant differences between measured values. The Birge test is based on treating the measured values as realizations of random draws from sampling probability density functions (pdfs). A sampling pdf models possible outcomes for measured values in contemplated replications of the measurement procedure in the same conditions. Therefore, the consistency of measured values assessed by the Birge test is statistical consistency. The Birge test applies to uncorrelated measured values only. In Sec. 2, we review a concept of statistical consistency motivated by the Birge test. The idea of statistical consistency belongs to the period when the error analysis view of measurements was prevalent. The error analysis view of measurements was a hindrance to communicating the results of measurement and in advancing the science and technology of measurement. Therefore leading authorities in the field of metrology developed the Guide to the Expression of Uncertainty in Measurement (GUM) [2]. According to the GUM, a result of measurement consists of a measured value together with its associated standard uncertainty. In the GUM, the measured value is regarded as the expected value and the standard uncertainty is regarded as the standard deviation, both known values, of a state-of-knowledge probability distribution. A state-of-knowledge distribution represented by a result of measurement need not be completely known. We note in Sec. 3 that the Birge test and the concept of statistical consistency motivated by it are not applicable to the results of measurement based on the GUM. Then how can one assess the differences between results based on the GUM for the same measurand? In 2008, the International Vocabulary of Metrology, third edition (VIM3) [3] introduced the concept of metrological compatibility of two or more results of measurement determined according to the (GUM). In Sec. 4, we review the VIM3 concept of metrological compatibility and propose that this concept be used to assess the differences between multiple results based on the GUM for the same measurand. In Sec. 5, we show that a test of the metrological compatibility of two results of measurement does not conflict with a pairwise Birge test of the statistical consistency of the corresponding measured values.

2. The Birge Test and Concept of Statistical Consistency

Suppose x₁, …, x_n are n measured values for a common measurand which is believed to be sufficiently stable. The Birge test is based on regarding the measured values x₁, …, x_n as realizations of random draws from their presumed sampling pdfs. A sampling pdf models possible outcomes in contemplated replications of a measurement procedure subject to random effects in the same conditions. Therefore, the consistency (lack of significant differences between measured values) assessed by the Birge test is statistical consistency. The Birge test is applicable when the sampling pdfs of the measured values x₁, …, x_n are uncorrelated. The Birge test requires knowledge of the variances σ₁², …, σ_n² of the sampling pdfs of respectively. Statistical consistency of the measured values x₁, …, x_n means that their expected values are indistinguishable1 in view of the corresponding variances. Specifically, the Birge test checks whether the measured values x₁, …, x_n may be modeled as realizations from normal (Gaussian) sampling pdfs with unknown but equal expected values and known variances σ₁², …, σ_n². Birge proposed that to check the consistency of the measured values x₁, …, x_n, one can calculate the test statistic

R^{2} = \sum_{i = 1}^{n} w_{i} {(x_{i} - x_{w})}^{2} / (n - 1),

(1)

where w_i = 1/σ_i², for i = 1, 2, …, n, and x_W = Σ_i w_i x_i/Σ_i w_i is the weighted mean of x₁, …, x_n. If the calculated value of R² is substantially larger than one, then the dispersion of x₁, …, x_n is greater than what can be expected from the normal pdfs with equal expected values and known variances σ₁², …, σ_n². In that case the measured values x₁, …, x_n can be declared to be statistically inconsistent.

Statistical interpretation of the Birge test

Birge was a physicist and he proposed his test independently of and before much of the statistical theory as it is known today was established. However, the Birge test of consistency can now be interpreted as a classical (sampling theory) statistical test of hypothesis. The measured values x₁, …, x_n are presumed to have normal sampling pdfs with unknown but equal expected values and variance-covariance matrix τ² × Diag [σ₁², …, σ_n²], where τ² is an unknown parameter and σ₁², …, σ_n² are known. The null hypothesis H₀ is that τ² ≤ 1 and the alternative hypothesis H₁ is that τ² > 1. The null hypothesis H₀ means that the variances of x₁, …, x_n are not greater than σ₁², …, σ_n², respectively. The alternative hypothesis H₁ means that the variances of x₁, …, x_n are greater than σ₁², …, σ_n² [4]. The classical p-value p_C is the maximum probability under the null hypothesis of realizing in contemplated replications of the n measurements a value of the test statistic more extreme than its realized (calculated) value. The classical p-value of a realization of (n – 1) R² is

p_{C} = \Pr {χ_{(n - 1)}^{2} \geq (n - 1) R^{2}},

(2)

where χ²₍_n_–1) denotes a variable with the chi-square probability distribution with degrees of freedom (n – 1) [4]. If the classical p-value p_C is too small, say, less than 0.05, then the null hypothesis is rejected with level of significance 0.05 or less. A rejection of the null hypothesis means that the dispersion of the measured values x₁, …, x_n is greater than what can be expected from normal distributions for x₁, …, x_n with equal expected values and stated variances σ₁², …, σ_n², respectively. The dispersion of x₁, …, x_n can be greater than expected under the null hypothesis because either the variances of x₁, …, x_n are greater than σ₁², …, σ_n² or their expected values are not equal. If the stated variances σ₁², …, σ_n² are not questionable then the assumption that the expected values of x₁, …, x_n are equal appears to be unreasonable. In that case, the measured values x₁, …, x_n can be declared to be statistically inconsistent.

Limitations of the Birge test

A limitation of the Birge test is that it is applicable for uncorrelated measured values x₁, …, x_n only. However, it can be easily generalized to correlated measured values x₁, …, x_n whose covariances denoted by σ₁², …, σ₍_n_–1)_n are known [4]. The Birge test suggests the following notion of the statistical consistency of the measured values x₁, …, x_n: The measured values x = (x₁, …, x_n)^t are said to be statistically consistent if their dispersion is not greater than what can be expected from the normal consistency model which postulates that the joint n-variate sampling pdf of x is normal N(1μ, D) with unknown expected value 1μ and variance-covariance matrix D = [σ_ij], where 1 = (1, …, 1)^t, σ_ij is the covariance between x_i and x_j, and σ_ii = σ_i² for i, j = 1, 2, …, n [4].

Another limitation of the Birge test (and of its generalized version for correlated measured values) is that it is a one sided test of hypothesis which checks whether the dispersion of x₁, …, x_n is more than what can be expected from a normal consistency model. A review of the Birge test in [5] notes that if the realized value of the Birge test statistic R² is substantially less than one, then the stated variances σ₁², …, σ_n² may well be too large. To avoid declarations of statistical consistency from overstated variances, the following definition of statistical consistency was proposed in [6].

Definition of statistical consistency

The measured values x = (x₁, …, x_n)^t are said to be statistically consistent if they reasonably fit the normal consistency model which postulates that the joint n-variate sampling pdf of x is normal N(1μ, D) with unknown expected value 1μ and variance-covariance matrix D = [σ_ij].

This definition requires a different approach for testing statistical consistency than the Birge test and its generalized version for correlated values. A modern method to assess the fit of a statistical model to the data is Bayesian posterior predictive checking [6]. Posterior predictive checking is a Bayesian adaptation of the classical (sampling theory) statistical hypothesis testing. A function of the data (and possibly unknown parameters) called ‘discrepancy measure’ is defined to characterize a potential discrepancy between the statistical model and the data. The posterior predictive p-value p_P of adiscrepancy measure T(x) is the probability of realizing in contemplated replications a value of the discrepancy measure more extreme than its realized value. If the posterior predictive p-value is close to zero (or to one) then the fit of the statistical model to data is suspect.

If the measured values x₁, …, x_n were uncorrelated, then the statistic T_c(x) = (n – 1) R² = Σ_i w_i (x_i – x_W)² is a useful discrepancy measure to check the overall fit of the normal consistency model N(1μ, D) to the measured values x₁, …, x_n. As discussed in [6, Sec. 2.4], the posterior predictive p-value of the realized discrepancy measure T_c(x) = (n – 1) R² is

p_{P} = \Pr {χ_{(n - 1)}^{2} \geq (n - 1) R^{2}} .

(3)

We note that (3) is identical to the classical p-value p_C given in (2). Thus Bayesian posterior predictive checking of the discrepancy measure T_c(x) = (n – 1) R² is equivalent to the Birge test of statistical consistency.

Bayesian posterior predictive checking can be used to investigate any number of potential discrepancies between the statistical model and the data. To assess the difference between two particular measured values x_i and x_j, the statistic T_{i – j}(x) = | x_i – x_j| is a useful discrepancy measure, for i, j = 1, 2, …, n and i ≠ j. The Bayesian posterior predictive p-value of the realized discrepancy measure |x_i–x_j| is

p_{P} = \Pr {Z \geq \frac{| x_{i} - x_{j} |}{\sqrt{σ_{i}^{2} + σ_{j}^{2} - 2 ρ_{i j} σ_{i} σ_{j}}}},

(4)

where ρ_ij is the correlation coefficient between the presumed normal sampling pdfs of x_i and x_j; the covariance between x_i and x_j is σ_ij = ρ_ij σ_i σ_j, and Z denotes a variable with standard normal distribution N(0, 1) [6, Sec. 3.2]. A posterior predictive p-value p_P close to zero suggests that the difference between x_i and x_j is larger than what can be expected from the normal statistical consistency model N(1μ, D). That is, the measured values x_i and x_j do not seem to have the same expected value and hence they are not mutually statistically consistent.

3. Concept of Statistical Consistency Does Not Apply to Results Based on the GUM

A result of measurement determined according to the GUM consists of a measured value together with its associated standard uncertainty. Suppose [x₁, u(x₁)], …, [x_n, u(x_n)] are n results of measurement for a common measurand, where x₁, …, x_n are the measured values and u(x₁), …, u(x_n) are the corresponding standard uncertainties. According to the GUM, a measured value x_i and its associated standard uncertainty u(x_i) represent a state-of-knowledge pdf attributed to the measurand, for i = 1, 2, …, n. Following the GUM, we use the symbol X_i for a quantity as well as for a variable with a state-of-knowledge pdf about the quantity X_i represented by the result [x_i, u(x_i)], for i = 1, 2, …, n. The measured value x_i is regarded as the expected value E(X_i) and the standard uncertainty u(x_i) is regarded as the standard deviation S(X_i) of the pdf of X_i, for i = 1, 2, …, n. The mainstream GUM requires knowledge of only the expected value E(X_i) and the standard deviation S(X_i) of a state-of-knowledge pdf of X_i. The GUM does not require that the state-of-knowledge pdf of X_i be completely known. When the state-of-knowledge pdfs of X₁, …, X_n are correlated, the correlation coefficients are assumed to be known. Following the GUM we denote the correlation coefficient R(X_i, X_j) between the state-of-knowledge pdfs of X_i and X_j by the symbol r(x_i, x_j). Note that {x₁, …, x_n}, {u(x₁), …, u(x_n)}, and {r(x₁, x₂), …, r(x₍_n_–1), x_n)} are symbols for known values.

For many years, metrologists have used the Birge test as ‘a rule of thumb’ to assess the consistency of the measured values by treating the squared standard uncertainties u²(x₁), …, u²(x_n) as the known variances σ₁², …, σ_n² of the presumed normal (Gaussian) sampling pdfs of the measured values x₁, …, x_n; see, for example [8]. The guideline for the analysis of key comparisons developed by the BIPM Director’s Advisory Group on Uncertainties recommends the use of Birge chi-square test to assess the consistency of measured values by treating the squared standard uncertainties as the known variances of the presumed sampling pdfs of the measured values [9]. The consistency of the measured values from CIPM key comparisons and supplementary comparisons is almost always assessed using the Birge test [10].

The squared standard uncertainties u²(x₁), …, u²(x_n) cannot in any logical sense be identified with the known variances σ₁², …, σ_n² of the presumed normal (Gaussian) sampling pdfs of the measured values x₁, …, x_n. The standard deviation of a sampling pdf represents possible dispersion from random variation in contemplated replications of the measurement procedures. A standard uncertainty expresses the dispersion of a state-of-knowledge pdf which could be attributed to the measurand based on all available statistical and non-statistical information. A standard uncertainty includes all significant components whether arising from random effects or from corrections applied for systematic effects. All available statistical and non-statistical information is used to evaluate a standard uncertainty. In measurements done in high echelon laboratories, the component of uncertainty arising from random effects is generally a very small part of the combined standard uncertainty. Treating the squared standard uncertainties u²(x₁), …, u²(x_n) determined according to the GUM as the known variances σ₁², …, σ_n² from random variation (in contemplated replications of the measurements) is a misuse of the standard uncertainties. Also, as noted earlier, the state-of-knowledge pdfs represented by the results [x₁, u(x₁)], …, [x_n, u(x_n)] may not be completely known. Therefore the Birge test and the concept of statistical consistency motivated by the Birge test do not apply to the results of measurement determined according to the GUM.

4. VIM3 Concept of Metrological Compatibility Applies to Results Based on the GUM

A measured quantity value [3, definitions 1.19 and 2.10] is a product of a numerical value and a measurement unit. The measurement unit implies that the measured value is traceable to a reference for that measurement unit. A result of measurement (measured value together with its associated standard uncertainty) is traceable to a reference only if the result can be related to a practical realization of that reference through a documented unbroken chain of calibrations each contributing to the measurement uncertainty [3, definition 2.41]. Two or more results of measurement are metrologically comparable only if they are traceable to the same reference [3, definition 2.46]. Metrological comparability does not imply that the measured values have similar magnitudes. Thus, for example, distance between my apartment and my office expressed in meters is metrologically comparable to the distance between my apartment and the moon also expressed in meters. The concept of metrological compatibility discussed in the next section applies only to those results of measurement for a common measurand which are metrologically comparable. That is, the results must be traceable to the same reference.

The concept of statistical consistency can be applied to any set of numerical values which have similar magnitudes. They do not have to be measured values. Thus, for example, one can test statistical consistency of deviations (or relative deviations expressed as percentage) from a benchmark value. Although a metrologist is expected to assess consistency of only those measured values which have the same measurement unit, it is not a requirement of statistical consistency.

All n results [x₁, u(x₁)], …, [x_n, u(x_n)] for a common measurand must be traceable to the same reference for them to be metrologically comparable [3, definition 2.46]. The VIM3 concept of metrological compatibility is defined for two results of measurement at a time. The following definition is an elaboration of the succinct definition given in VIM3 [3, definition 2.47].

Definition of metrological compatibility

Two metrologically comparable results [x₁, u(x₁)] and [x₂, u(x₂)] for the same measurand are said be metrologically compatible if

ζ (x_{1} - x_{2}) = \frac{| x_{1} - x_{2} |}{\sqrt{u^{2} (x_{1}) + u^{2} (x_{2}) - 2 r (x_{1}, x_{2}) u (x_{1}) u (x_{2})}} \leq κ,

(5)

for a specified threshold κ, where r(x₁, x₂) is a symbol for the correlation coefficient R(X₁, X₂) between the variables X₁ and X₂. The quantity in the denominator of (5) is the standard deviation of the state-of-knowledge pdf for X₁ – X₂, which may be incompletely determined. When the pdfs represented by [x₁, u(x₁)] and [x₂, u(x₂)] are uncorrelated, then R(X₁, X₂) = 0 and (5) reduces to

ζ (x_{1} - x_{2}) = \frac{| x_{1} - x_{2} |}{\sqrt{u^{2} (x_{1}) + u^{2} (x_{2})}} \leq κ .

(6)

A set of metrologically comparable results [x₁, u(x₁)], [x₂, u(x₂)], …, [x_n, u(x_n)] for the same measurand is said be metrologically compatible if for every one of the n(n – 1)/2 pairs of results [x_i, u(x_i)] and [x_j, u(x_j)] we have

ζ (x_{i} - x_{j}) = \frac{| x_{i} - x_{j} |}{\sqrt{u^{2} (x_{i}) + u^{2} (x_{j}) - 2 r (x_{i}, x_{j}) u (x_{i}) u (x_{j})}} \leq κ,

(7)

for a specified threshold κ [3, definition 2.47]. The VIM3 does not discuss how the threshold κ should be determined. A conventional value of κ is two.

The concept of metrological compatibility can be used to assess the differences between the results of measurement based on the GUM for the same measurand. The concepts of metrological comparability and compatibility do not require that the state-of-knowledge pdfs represented by the results [x₁, u(x₁)], [x₂, u(x₂)], …, [x_n, u(x_n)] be completely known. Thus they fit the GUM. When the set of results [x₁, u(x₁)], …, [x_n, u(x_n)] is metrologically compatible, we can say that the differences between the measured values x₁, …, x_n are insignificant in view of the uncertainties u(x₁), …, u(x_n).

To assess metrological compatibility of results based on the GUM using the criteria (5), (6), or (7)), the threshold κ needs to be specified. A proper choice of κ is to a large extent a matter of agreement because it requires accepting the economic consequences of that choice. Although a conventional value of κ is two, depending on the application, the interested parties could agree on a different value for κ. Once the value of the threshold κ is set the conclusion of a test of metrological compatibility based on the VIM3 definition is dichotomous, either a set of results is metrologically compatible or incompatible. The concept of metrological compatibility is being used by metrologists who are familiar with it; see for example [11, 12].

The VIM3 definition of metrological compatibility can be easily extended to metrological compatibility of a set of results and a reference result [x_R, u(x_R)], where x_R is the reference value with standard uncertainty u(x_R). Suppose the pdfs represented by the measurement results are uncorrelated with the pdf represented by the reference result. A set of results [x₁, u(x₁)], …, [x_n, u(x_n)] metrologically comparable with a reference result [x_R, u(x_R)] is compatible if

ζ (x_{i} - x_{R}) = \frac{| x_{i} - x_{R} |}{\sqrt{u^{2} (x_{i}) + u^{2} (x_{R})}} \leq κ,

(8)

for i = 1, 2, …, n [13]. Similarly a set of results [x₁, u(x₁)], …, [x_n, u(x_n)] metrologically comparable with a combined result [x_C, u(x_C)], where x_C is the combined value (such as arithmetic mean or a weighted mean) with standard uncertainty u(x_C) is compatible if

ζ (x_{i} - x_{C}) = \frac{| x_{i} - x_{C} |}{\sqrt{u^{2} (x_{i}) + u^{2} (x_{C}) - 2 r (x_{i}, x_{C}) u (x_{i}) u (x_{C})}} \leq κ,

(9)

where r(x_i, x_C) denotes the correlation coefficient between the pdfs represented by [x_i, u(x_i)] and [x_C, u(x_C)], for i = 1, 2, …, n [13].

5. Concluding Remarks

For many years, metrologists have used the Birge chi-square test as ‘a rule of thumb’ to assess the differences between two or more measured values for the same measurand by pretending that the squared standard uncertainties were the known variances of the presumed normal sampling pdfs of the measured values. This is misuse of the standard uncertainties based on the GUM. The Birge test and the concept of statistical consistency do not apply to the results of measurement based on the GUM. As discussed in this paper, the VIM3 concept of metrological compatibility can be used to assess the differences between the results of measurement determined according to the GUM. Thus metrologists can start using the VIM3 concept of metrological compatibility in place of the Birge test to assess the differences between multiple results of measurement of the same measurand.

The following is a pertinent question. Could the conclusions (about mutual agreement of results) based on the VIM3 concept of metrological compatibility and the Birge test (based on treating squared standard uncertainties as the known variances of sampling pdfs of measured values) differ? It is difficult to directly compare the Birge test and a test of metrological compatibility because the former is defined for an arbitrary positive integer n > 1 and the latter is defined for only two results at a time. For pairwise comparisons (n = 2), the Birge test statistic R² = Σ_i w_i (x_i – x_W)²/(n – 1) reduces to

R^{2} = \frac{{(x_{1} - x_{2})}^{2}}{(σ_{1}^{2} + σ_{2}^{2})},

(10)

which is square of (x₁ – x₂)/√(σ₁² + σ₂²). Under the null hypothesis that the presumed normal sampling pdfs of x₁ and x₂ have the same expected value, the distribution of (x₁ – x₂)/√(σ₁² + σ₂²) is normal N(0, 1). Therefore when n = 2, the normal distribution can be used to assess the absolute difference | x₁ – x₂ |. The square of a normal N(0, 1) variable has the chi-square distribution χ²₍₁₎ with degrees of freedom 1. Therefore the square of the (1 – α/2) × 100-th percentile z_[1–_α_/2] of normal N(0, 1) distribution is equal to the (1 – α) × 100-th percentile χ²_(1)[1–_α_] of χ²₍₁₎ distribution. Thus the realized value of (10) being less than χ²_(1)[1–_α_] is equivalent to the ratio (x₁ – x₂)/√(σ₁² + σ₂²) being less than z_[1–_α_/2]. It follows that declaration of Birge statistical consistency when the classical p-value p_C of the Birge test (2) is less than 0.05 (for example) is equivalent to the realization that

\frac{| x_{1} - x_{2} |}{\sqrt{(σ_{1}^{2} + σ_{2}^{2})}} \leq z_{[0.975]} = 1.96 \approx 2.

(11)

We note from (6) and (11) that if the threshold κ for metrological compatibility is set as κ = 2 then the conclusion of a check of metrological compatibility between a pair of results [x₁, u(x₁)] and [x₂, u(x₂)] would be identical to the assessment of statistical consistency between x₁ and x₂ based on the Birge test by (wrongly) treating u²(x₁) and u²(x₂) as σ₁² and σ₂², respectively (and treating the correlation coefficient R(X₁, X₂) as ρ₁₂ which is zero in the Birge test). Therefore a pairwise Birge test of statistical consistency and a test of metrological compatibility do not conflict.

Acknowledgments

We thank Javier Bernal, Tyler Estler, Walter Liggett, and Raju Datla for their comments on earlier drafts of this paper.

Biography

About the Authors: Raghu N. Kacker is a mathematical statistician in the Information Technology Laboratory of the National Institute of Standards and Technology, Gaithersburg, MD 20899, USA.

Rüdiger Kessel is a guest researcher in the Information Technology Laboratory of the National Institute of Standards and Technology, Gaithersburg, MD 20899, USA.

Klaus-Dieter Sommer is director of the Chemical Physics and Explosion Protection Division of the National Metrology Institute of Germany, Physikalisch-Technische Bundesanstalt, D-38116 Braunschweig, Germany.

The National Institute of Standards and Technology is an agency of the U.S. Department of Commerce.

Footnotes

In statistical literature the term consistency is applied to a statistical estimator. A point statistical estimator is said to be consistent if it approaches the parameter being estimated as the sample size increases.

Contributor Information

Raghu N. Kacker, Email: raghu.kacker@nist.gov.

Rüdiger Kessel, Email: ruediger.kessel@nist.gov.

Klaus-Dieter Sommer, Email: klaus-dieter.sommer@ptb.de.

6. References

[1].Birge RT. The calculation of errors by the method of least squares. Physical Review. 1932;40:207–227. [Google Scholar]
[2].GUM . Guide to the Expression of Uncertainty in Measurement. 2nd ed. Geneva: International Organization for Standardization; 1995. 2008. version available at http://www.bipm.org/utils/common/documents/jcgm/JCGM_100_2008_E.pdf. [Google Scholar]
[3].BIPM/JCGM . International Vocabulary of Metrology—Basic and general concepts and associated terms. 3rd ed. Sèvres: Bureau International des Poids et Mesures, Joint Committee for Guides in Metrology; 2008. available at http://www.bipm.org/utils/common/documents/jcgm/JCGM_200_2008.pdf. [Google Scholar]
[4].Kacker RN, Forbes AB, Kessel R, Sommer K. Classical and Bayesian interpretation of the Birge test of consistency and its generalized version for correlated results from interlaboratory evaluations. Metrologia. 2008;45:257–264. [Google Scholar]
[5].Taylor BN, Parker WH, Langenberg DN. Determination of e / h,Using Macroscopic Quantum Phase Coherence in Superconductors: Implications for Quantum Electrodynamics and the Fundamental Physical Constants. Review of Modern Physics. 1969;41:375–496. [Google Scholar]
[6].Kacker RN, Forbes AB, Kessel R, Sommer K. Bayesian posterior predictive p-value of statistical consistency in interlaboratory evaluations. Metrologia. 2008;45:512–523. [Google Scholar]
[7].Gelman A, Carlin JB, Stern HS, Rubin DB. Bayesian Data Analysis. 2nd ed. Chapman & Hall; 2004. [Google Scholar]
[8].Mohr PH, Taylor BN. CODATA recommended values of the fundamental physical constants: 1998. Reviews of Modern Physics. 2000;72:351–495. doi: 10.1103/RevModPhys.93.025010. current version available at http://physics.nist.gov/cuu/Constants/index.html. [DOI] [PMC free article] [PubMed] [Google Scholar]
[9].Cox MG. The evaluation of key comparison data. Metrologia. 2002;39:589–595. these guidelines were developed by the BIPM Director’s Advisory Group on Uncertainties. [Google Scholar]
[10].The BIPM key comparison database 2010. http://kcdb.bipm.org/
[11].Wellum R, Verbruggen A, Kessel R. A new evaluation of the half-life of 241Pu. Journal of Analytical Atomic Spectrometry. 2009;24:801–807. [Google Scholar]
[12].Datla RU, Kessel R, Smith AW, Kacker RN, Pollock DB. Uncertainty analysis of remote sensing optical sensor data: guiding principles to achieve metrological consistency. International Journal of Remote Sensing. 2010;31:867–880. [Google Scholar]
[13].Kessel R, Kacker RN, Sommer K. Proposal for combining results from multiple evaluations of the same measurand. 2009. submitted for publication. [DOI] [PMC free article] [PubMed]

[b1-v115.n06.a05] [1].Birge RT. The calculation of errors by the method of least squares. Physical Review. 1932;40:207–227. [Google Scholar]

[b2-v115.n06.a05] [2].GUM . Guide to the Expression of Uncertainty in Measurement. 2nd ed. Geneva: International Organization for Standardization; 1995. 2008. version available at http://www.bipm.org/utils/common/documents/jcgm/JCGM_100_2008_E.pdf. [Google Scholar]

[b3-v115.n06.a05] [3].BIPM/JCGM . International Vocabulary of Metrology—Basic and general concepts and associated terms. 3rd ed. Sèvres: Bureau International des Poids et Mesures, Joint Committee for Guides in Metrology; 2008. available at http://www.bipm.org/utils/common/documents/jcgm/JCGM_200_2008.pdf. [Google Scholar]

[b4-v115.n06.a05] [4].Kacker RN, Forbes AB, Kessel R, Sommer K. Classical and Bayesian interpretation of the Birge test of consistency and its generalized version for correlated results from interlaboratory evaluations. Metrologia. 2008;45:257–264. [Google Scholar]

[b5-v115.n06.a05] [5].Taylor BN, Parker WH, Langenberg DN. Determination of e / h,Using Macroscopic Quantum Phase Coherence in Superconductors: Implications for Quantum Electrodynamics and the Fundamental Physical Constants. Review of Modern Physics. 1969;41:375–496. [Google Scholar]

[b6-v115.n06.a05] [6].Kacker RN, Forbes AB, Kessel R, Sommer K. Bayesian posterior predictive p-value of statistical consistency in interlaboratory evaluations. Metrologia. 2008;45:512–523. [Google Scholar]

[b7-v115.n06.a05] [7].Gelman A, Carlin JB, Stern HS, Rubin DB. Bayesian Data Analysis. 2nd ed. Chapman & Hall; 2004. [Google Scholar]

[b8-v115.n06.a05] [8].Mohr PH, Taylor BN. CODATA recommended values of the fundamental physical constants: 1998. Reviews of Modern Physics. 2000;72:351–495. doi: 10.1103/RevModPhys.93.025010. current version available at http://physics.nist.gov/cuu/Constants/index.html. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b9-v115.n06.a05] [9].Cox MG. The evaluation of key comparison data. Metrologia. 2002;39:589–595. these guidelines were developed by the BIPM Director’s Advisory Group on Uncertainties. [Google Scholar]

[b10-v115.n06.a05] [10].The BIPM key comparison database 2010. http://kcdb.bipm.org/

[b11-v115.n06.a05] [11].Wellum R, Verbruggen A, Kessel R. A new evaluation of the half-life of 241Pu. Journal of Analytical Atomic Spectrometry. 2009;24:801–807. [Google Scholar]

[b12-v115.n06.a05] [12].Datla RU, Kessel R, Smith AW, Kacker RN, Pollock DB. Uncertainty analysis of remote sensing optical sensor data: guiding principles to achieve metrological consistency. International Journal of Remote Sensing. 2010;31:867–880. [Google Scholar]

[b13-v115.n06.a05] [13].Kessel R, Kacker RN, Sommer K. Proposal for combining results from multiple evaluations of the same measurand. 2009. submitted for publication. [DOI] [PMC free article] [PubMed]

PERMALINK

Assessing Differences Between Results Determined According to the Guide to the Expression of Uncertainty in Measurement

Raghu N Kacker

Rüdiger Kessel

Klaus-Dieter Sommer

Abstract

1. Introduction

2. The Birge Test and Concept of Statistical Consistency

Statistical interpretation of the Birge test

Limitations of the Birge test

Definition of statistical consistency

3. Concept of Statistical Consistency Does Not Apply to Results Based on the GUM

4. VIM3 Concept of Metrological Compatibility Applies to Results Based on the GUM

Definition of metrological compatibility

5. Concluding Remarks

Acknowledgments

Biography

Footnotes

Contributor Information

6. References

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Assessing Differences Between Results Determined According to the Guide to the Expression of Uncertainty in Measurement

Raghu N Kacker

Rüdiger Kessel

Klaus-Dieter Sommer

Abstract

1. Introduction

2. The Birge Test and Concept of Statistical Consistency

Statistical interpretation of the Birge test

Limitations of the Birge test

Definition of statistical consistency

3. Concept of Statistical Consistency Does Not Apply to Results Based on the GUM

4. VIM3 Concept of Metrological Compatibility Applies to Results Based on the GUM

Definition of metrological compatibility

5. Concluding Remarks

Acknowledgments

Biography

Footnotes

Contributor Information

6. References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases