Abstract
Proficiency testing (PT) determines the performance of individual laboratories for specific tests or measurements and it is used to monitor the reliability of laboratories measurements. PT plays a highly valuable role as it provides objective evidence of the competence of the participant laboratories. In this paper, we propose a multivariate calibration model to assess equivalence among laboratories measurements in PT. Our method allows to deal with multivariate data, where the item under test is measured at different levels. Although intuitive, the proposed model is nonergodic, which means that the asymptotic Fisher information matrix is random. As a consequence, a detailed asymptotic analysis was carried out to establish the strategy for comparing the results of the participating laboratories. To illustrate, we apply our method to analyze the data from the Brazilian engine test group, PT program, where the power of an engine was measured by eight laboratories at several levels of rotation.
Keywords: Asymptotic theory, hypothesis testing, confidence region, ultrastructural model, measurement error model
1. Introduction
Comparative calibration models are typically used to compare different ways of measuring the same unknown quantity. The problem of comparing measurements may arise in different areas and contexts as can be seen in Refs. [2,3,7,14,21,23], for example. In this paper, we propose an ultrastructural calibration model motivated by the proficiency testing (PT) studies.
Proficiency studies are conducted to evaluate the equivalence of laboratories measurements. In these studies, a reference value of some measurand (the quantity to be measured) is determined and the results of all the laboratories are compared to this reference value. According to Refs. [6,12], accredited laboratories should assure the quality of test results by participating in PT programs. Various statistical techniques have been adopted to assess equivalence among laboratories measurements. These include classical statistical techniques such as paired t test, z-score, normalized error, repeated measures analysis of variance and Bland–Altman plot, see Refs. [1,10,12,15,19] and references therein.
As extensively discussed by ISO GUM [11], the measurement is only an estimate of the measurand and thus, it must be accompanied by its uncertainty. The approach to quantification of uncertainty of measurement is presented in Ref. [11]. As discussed by Gleser [8], the basic idea is to approximate a measurement equation , where Y denotes the measurement, g is a known function and denote the d input quantities, by a first-order Taylor series about the expected values of . The standard combined uncertainty is defined as the standard deviation of the probability distribution of Y based on this linear approximation. The expected value and the variance of each input quantity may be based on measurements or any other information, such as the resolution of the measuring instrument. There are two types of uncertainty evaluation, besides this fact all uncertainties are determined by standard deviation. Type A is determined by the statistical analysis of a series of observations and type B is determined by other means, such as instruments specifications, correction factors or even data from additional experiments (see supplemental material for more detail).
One critical point in most techniques is the fact that the type B source of variability [11] is not considered. In this direction, Pinto et al. [17] extended Jaech's model [13] to encompass the type B source of variation and they evaluated this model under elliptical distributions. Toman [22] proposed a Bayesian Hierarquical model to encompass the type B source of variation into the model, see Ref. [16] for further developments of the Bayesian Hierarquical model.
In spite of the amount of techniques to assess equivalence among laboratories measurements, to the best of our knowledge, these approaches consider the reference as a single value. In some proficiency studies, the item under testing is measured at different levels. As one illustration, consider the measurement system to evaluate the engine power. In this case, the power (torque times rotation) is measured at different levels of rotation and as a result, we have one reference value for each level.
The first goal of this work is to propose an ultrastructural calibration model to assess equivalence among laboratories measurements considering that the item under testing is measured at different levels with the uncertainties described above.
Let represent the measurand with mean and variance at the jth level. In general, the parameter is determined by one expert laboratory during the stability study which is conducted to guarantee the stability of the item under testing. In our example, the GM power train developed one standard engine and, during the stability study, evaluated its natural variability .
One of the basic element in all PT is the evaluation of the performance of each participant. In order to do so, the PT provider has to determine a reference value, which can be obtained in two different manners. One is to employ a reference laboratory and the other is to use a consensus value. We will use the reference laboratory strategy.
In order to compare the measurements of the participant laboratories with the measurements obtained by the reference laboratory, we assume that the reference laboratory measures the item without bias,
| (1) |
where represents the measurement error related to the reference laboratory. As and were determined by different laboratories with different methods, we assume that and are independent. As a consequence, we obtain that and .
Besides the fact that the reported result of the engine power has been corrected for all recognized significant systematic effects, many problems can occur during the measurement process. For example, the participant laboratory can fail to conduct the test in accordance with the requirements of the measurement procedure, poor working condition of equipment of the participant laboratory, among others.
In order to compare the measurement results between the participant laboratory and the reference laboratory, we assume that the measurement of the kth replica of the measurement of the engine power at the jth engine rotation level obtained by the ith laboratory is given by
| (2) |
where represents the measurement error, is the additive bias and is the multiplicative bias related to the measurements of the ith participant laboratory with respect to the measurements of the reference laboratory. As and were determined by different laboratories with different methods, we assume that and are also independent. As a consequence, we obtain that and .
The variance is determined by the combined variance calculated and provided by the ith laboratory following the procedure proposed by ISO GUM [11] (see Supplemental Material for more details).
Considering the proposed ultrastructural calibration model, the second goal of this work is to develop a test to evaluate the competence of the group of laboratories and also the competence of individual laboratories with respect to the reference laboratory. In this case, we provide statistical testing hypothesis to evaluate additive and multiplicative bias. Finally, we propose one graphical analysis to assess the equivalence of the measurements of the ith laboratory with respect to the measurements of the reference laboratory.
Besides the fact that the proposed model is simple and intuitive, it presents some challenging properties. As we have only one item under testing for the entire PT program, the true unobserved value of the item under testing (engine power) at the jth level (rotation) is also the same during the PT program. As a consequence, there is a natural dependency among all measurements at the same jth level. This fact yields that the observed information matrix converges in probability to a random matrix with components related to the mean of the measurand being null. Thus, it is not possible to obtain consistent estimate for the parameter . Subsequently, the usual asymptotic theory is not applicable to the proposed ultrastructural calibration model.
To address this problem, we will apply the smoothness of the likelihood function to derive one suitable asymptotic theory to the ultrastructural calibration model, as developed by Refs. [20,24,25]. As a consequence of the asymptotic theory developed in Section 3, we will propose a Wald type test to evaluate the bias parameters. Moreover, we will apply the Wald statistics to develop a graphical analysis to assess the competence of each participant laboratory with respect to the reference laboratory.
In Section 2, we describe the model and obtain the Score function, as well as, the observed information matrix in closed form expressions. Moreover, we develop the Expectation Maximization (EM) algorithm to obtain the maximum likelihood estimates (MLEs) of the parameters. In Section 3, we develop the asymptotic theory to assess the equivalence among laboratories measurements in PT. Next, tests for the composite hypothesis and confidence regions are obtained. Moreover, we perform a simulation study considering different number of replicas, nominal values and parameter values. In Section 4, we apply the developed methodology to the real data set collected to perform a proficiency study. Finally, we discuss the obtained results in Section 5.
2. The model
Considering the model defined by (1) and (2), and the engine power illustration, let and represent, respectively, the measurements of the engine power of the reference laboratory and the ith laboratory at the jth engine rotation value, the measurements of all the laboratories at the jth engine rotation value and finally, the observed data with . Then, assuming that , where and with ( ) denoting a vector composed by zeros ( one's) and denoting the diagonal matrix with the diagonal elements given by , we have and , , where , , , , with denoting the identity matrix of size . Furthermore,
with
The log-likelihood function is given by
| (3) |
where and , .
Moreover, the covariance between the observations taken at the same value of the engine rotation by the reference laboratory (ith laboratory) is given by ( ) and the covariance between the observations of the reference laboratory and the th laboratory at the same value of the engine rotation is given by , while the covariance between the observations of the ith and hth laboratory at the jth engine rotation is given by , that is:
; .
After algebraic manipulations, the elements of the score function, , denoted by , , were obtained and are given by
with and
Subsequently, the elements of the observed information matrix, , denoted by , were obtained in closed form expressions and are given by
2.1. EM algorithm
In this subsection, we are going to outline the EM algorithm [5] used to obtain the estimates of the parameters. In measurement error models, if the latent data , , is introduced to augment the observed data, the MLEs of the parameters based on the augmented data (complete data) become easy to obtain. Considering the model defined by (1) and (2) and the observed data for the jth engine rotation value, , we augment by considering the unobserved data . Then, the complete data for the jth engine rotation value is given by , with and the covariance matrix given by , , with and as given above in Section 2. Furthermore, and let , then
It follows that the log-likelihood function of the complete data is given by
which is much simpler than (3). Given the estimates of in the th iteration, , the E step consists in the obtention of the expectation of the complete data log-likelihood function, , with respect to the conditional distribution of given the observed data, and . The M step consists in the maximization of the function obtained in the E step with respect to , which gives the estimates of the parameters for the next iteration, Each iteration of the EM algorithm increments the log-likelihood function of the observed data , i.e. When the likelihood function of the complete data belongs to the exponential family, the implementation of the EM algorithm is usually simple. In our case, the E step consists in the obtention of and , . In the M step, we maximize the log-likelihood function of the complete data where the values of the sufficient statistics were substituted by the expected values obtained in the E step.
The EM algorithm for the model defined by (1) and (2) may be summarized as follows.
E step: considering the properties of the multivariate normal distribution, the E step consists in the obtention of
where
and represent, respectively, the value of and evaluated at .
M step: the M step consists in the obtention of
Notice that closed form expressions were obtained for all the expressions in the M step, which means that this procedure will be computationally inexpensive and also it is very simple to implement. Furthermore, if the variance terms were not known, we can easily adapt the algorithm to obtain the estimate of the parameters and ,
3. Asymptotic theory
In this section, we will develop the asymptotic theory necessary to prove the consistency and the asymptotic distribution of the MLE regarding the bias parameters. In the sequel, we will apply the regularity properties of the likelihood function to establish the asymptotic results, as proposed by Refs. [20,24,25]. For a given , we define
for every , where is the Borel σ-algebra. By applying Kolmogorov extension theorem, there exists a unique probability defined on such that
for every . The marginal distribution of the observed data will be denoted by and the marginal distribution of the unobserved variables will be denoted by . We say that when and where is a positive constant for every .
Let be the space of all matrices. The norm of the matrix A is . A sequence of matrices converges to a limit A if, and only if, . If the matrix A is positive definite, we write . In this case, denotes the symmetric positive square root of A. In the same way, for a given vector , we consider the norm of as follows .
We denote by the set of all vectors such that , where c is a positive constant. Moreover, we denote by the set of all random vectors with values in such that . Here, we take the random vector as function of the observed data .
Lemma 3.1
For any sequences and , there exist a positive semidefinite random matrix such that
Proof.
See supplementary material.
The random matrix was obtained in closed form expressions and its elements are given by
has two important features. First, every component associated with is null, it means that we do not have enough information to estimate in a consistent way. This is a consequence of the fact that, for any level , the same item (engine) is measured by all the laboratories under the same conditions. Second, the matrix is random and the model is considered nonergodic. As a consequence, the score random process is nonergodic. Furthermore, the components associated with satisfy
| (4) |
We will denote by and the score vector and the observed information matrix without the components involving , respectively. We also denote by the random matrix without the components involving . Moreover, we denote the bias components of the vector of parameters by and the related MLE. Furthermore, the random matrices and depend only on the bias components of the vector of parameters. As a consequence, from now on, we will denote and by and , respectively.
Let be a sequence of real continuous function defined on a metric space, we say that converges uniformly in τ to if for every sequence . Let and be probabilities defined on the Borel subsets of a metric space depending on the arbitrary parameter τ, and let C be the space of real bounded uniformly continuous functions. We shall say that uniformly if
If Q is a metric space and , the family of probabilities is continuous in τ if whenever in Q.
As described in the proof of Lemma 3.1, the random matrix is a function of the unobserved variables and the parameter . Then, for each fixed , it is defined on the probability space . We denote by the distribution of the random matrix .
Lemma 3.2
Given the sequences and and a bounded continuous function, then
for every . Moreover, the distribution of is continuous in .
Proof.
See supplementary material.
Let be a vector and let be a sequence of parameters. We define a vector in such that as . As is a smooth function, we may write
| (5) |
where , and is random. As is a function of the observed data and , we conclude that . As a consequence, we obtain that as .
Theorem 3.3
Let be a sequence of parameters and let be a sequence of random vectors. Then, we have that
where with denoting the standard normal random vector on , independent of the random matrix . Furthermore, the random matrix is positive definite with probability one and it depends only on the bias components and the unobserved variables .
Proof.
See supplementary material.
In the sequel, we will calculate the asymptotic distribution of the MLE regarding the bias parameters . For every vector such that and , it follows from Equation (5) that
where , and , such that is random. As a consequence of Lemma 3.1 and Equation (4), we arrive at the following lemma.
Lemma 3.4
We have that
(6) for .
This Lemma is crucial to understand the behavior of the likelihood function with respect to the true value parameters . For n sufficiently large, the impact of the true values vanishes. Moreover, the maximum with respect to of the right side of Equation (6) satisfies
| (7) |
By applying Equation (6), for n sufficiently large, we conclude that corresponds closely to the value of that maximizes independent of the vector .
The maximum of , MLE , is given by
where corresponds to the MLE of the bias parameters . As a consequence, we conclude that
| (8) |
Summing up the results obtained from Equations (7) and (8), we arrive at the following theorem.
Theorem 3.5
The MLE of satisfies
uniformly in probability, where .
As a consequence of Theorems 3.3 and 3.5 and the continuous mapping theorem, we conclude that
| (9) |
Equation (9) and the continuous mapping theorem yield
| (10) |
From Equation (10), we know that the asymptotic distribution of the MLE regarding the bias parameters is not normal, because the matrix is random.
Corollary 3.6
The MLE related to the bias parameters satisfies
In the sequel, we will derive the usual Wald statistics to perform hypothesis testing about the bias parameters. In order to do it, it is necessary to derive Theorem 3.3 with . By applying Prohorov's theorem, we know that the sequence is uniformly tight. Then, for each , there exists a constant c>0 such that
| (11) |
As a consequence, with probability tending to one, .
Lemma 3.7
Let be a sequence of parameters. Then, we have that
Thus, we arrive at the following corollaries.
Corollary 3.8
Conditional on , the asymptotic distribution of is given by
Applying again Equation (9), Lemma 3.7 and continuous mapping theorem, we arrive at the Wald statistics.
Corollary 3.9
We have that
In the next subsection, these results are used to develop the Wald test statistics to test the equivalence of all the laboratories with respect to the reference laboratory, as well as the composite hypothesis.
3.1. Equivalence among participant laboratories
In this subsection, we will propose multiple hypothesis testing to assess the equivalence among the laboratories measurements with respect to the reference laboratory.
First, we will test for the equivalence of all laboratories with respect to the reference laboratory,
| (12) |
To test hypothesis (12), we may apply the Wald statistics as established in Corollary 3.9, i.e.
| (13) |
with for hypothesis defined in (12). Under the conditions established in Corollary 3.9, as indicated above has an asymptotic distribution.
In the sequel, we consider tests of the composite hypothesis
| (14) |
where is a vector-valued function such that the derivative matrix is continuous in and the . In order to develop these composite tests, consider the Taylor expansion
where , and . By applying the assumption on the continuity of , we arrive at the following expression:
Letting , we obtain
| (15) |
Theorem 3.10
The compound Wald statistic
(16) has an asymptotic distribution.
Proof.
See supplementary material.
If the null hypothesis (12) is rejected, the multiple test is performed,
| (17) |
Let , i.e. is representing the ijth element of the matrix , then for the hypothesis (17) can be written as
| (18) |
with asymptotic distribution.
As we are considering multiple test, it is important to control the type 1 error probability. For controlling the familywise error, it can be considered for example, the Simes–Hockberg procedure [9]. To provide a graphical analysis of the performance of the laboratories measurements with respect to the measurements of the reference laboratory, the result obtained in Theorem 3.10 can be used to obtain the confidence regions. Next, we present a simulation study considering the tests given in (13) and (16).
3.2. Simulation
In this subsection, we perform a simulation study to compare the behavior of the Wald test statistics developed in the previous section for different number of replicas, parameter values and nominal levels of the test. Considering the model defined in (1) and (2), with p = 5 (number of participant laboratories) and m = 5 (number of different engine rotation values), it generated 10,000 samples with 3, 7, 15 and 30 replicas. The parameters of the true unobserved value of the item under testing at the jth point ( , ) was assumed to be: and , for the mean values and and , for the standard deviations. It considered three sets of parameter values for the standard deviation related to the measurement error of each laboratory ( ) at the jth engine rotation value.
Moreover, it was considered , and for the nominal significance levels. The routines were implemented in Ref. [18].
Table 1 shows the mean value, standard deviation (sd) and the mean square error (MSE) of the MLEs of the parameters considering the EM algorithm presented in Section 2.1, which was obtained with the samples generated for under . Clearly, as the number of replicas increase, the values of the sd and MSE decrease and the mean value of the estimates of and , , approaches the true value. On the other hand, considering the parameters , , the same do not happen as the parameter is not consistent. For and , the results were similar and are not shown.
Table 1.
MLEs of the parameters with .
| n = 3 | n = 7 | |||||
|---|---|---|---|---|---|---|
| Mean | sd | MSE | Mean | sd | MSE | |
| 0.0040 | 0.1264 | 0.0160 | 0.0029 | 0.0823 | 0.0068 | |
| 0.0009 | 0.1328 | 0.0176 | 0.0025 | 0.0892 | 0.0080 | |
| 0.0068 | 0.1242 | 0.0155 | 0.0040 | 0.0829 | 0.0069 | |
| 0.1299 | 0.0169 | 0.0037 | 0.0865 | 0.0075 | ||
| 0.9999 | 0.0067 | 0.0000 | 0.9995 | 0.0044 | 0.0000 | |
| 0.9999 | 0.0072 | 0.0001 | 0.9999 | 0.0047 | 0.0000 | |
| 0.999 | 0.0066 | 0.0000 | 0.9999 | 0.0044 | 0.0000 | |
| 1.0003 | 0.0070 | 0.0000 | 0.9996 | 0.0047 | 0.0000 | |
| 10.00987 | 0.2489 | 0.0620 | 9.9919 | 0.2485 | 0.0618 | |
| 20.0023 | 0.3070 | 0.0942 | 20.0008 | 0.3169 | 0.1004 | |
| 30.0034 | 0.4192 | 0.1757 | 30.0090 | 0.3999 | 0.1600 | |
| 40.0121 | 0.4669 | 0.2182 | 40.0045 | 0.4525 | 0.2048 | |
| 50.0373 | 0.5577 | 0.3124 | 50.0074 | 0.5334 | 0.2846 | |
| n = 3 | n = 7 | |||||
| Mean | sd | MSE | Mean | sd | MSE | |
| 0.0011 | 0.0585 | 0.0034 | 0.0003 | 0.0387 | 0.0014 | |
| −0.0006 | 0.0577 | 0.0033 | 0.0009 | 0.0393 | 0.0015 | |
| 0.0009 | 0.0586 | 0.0034 | 0.0027 | 0.0404 | 0.0016 | |
| 0.0005 | 0.0556 | 0.0031 | 0.0004 | 0.0402 | 0.00165 | |
| 1.0000 | 0.0031 | 0.0000 | 1.0000 | 0.0021 | 0.00000 | |
| 1.0001 | 0.0032 | 0.0000 | 1.0000 | 0.0021 | 0.00000 | |
| 1.0000 | 0.0032 | 0.0000 | 1.0000 | 0.00222 | 0.0000 | |
| 1.0001 | 0.0030 | 0.0000 | 1.0000 | 0.0022 | 0.0000 | |
| 10.0045 | 0.2506 | 0.0628 | 9.9983 | 0.2352 | 0.0553 | |
| 19.9984 | 0.3189 | 0.1017 | 20.0020 | 0.2904 | 0.0841 | |
| 29.9945 | 0.3933 | 0.1547 | 30.0221 | 0.3831 | 0.1472 | |
| 40.0132 | 0.4678 | 0.2190 | 40.0187 | 0.4444 | 0.1978 | |
| 49.9891 | 0.5234 | 0.2741 | 50.0036 | 0.5032 | 0.2753 | |
Next, we consider the test for the equivalence of all laboratories with respect to the reference laboratory:
It obtained the empirical significance levels considering the test obtained in (13). The results are summarized in Table 2.
Table 2.
Empirical sizes for the Wald test statistics for the test .
| 1% | 5% | 10% | 1% | 5% | 10% | 1% | 5% | 10% | |
|---|---|---|---|---|---|---|---|---|---|
| 3 | 0.012 | 0.059 | 0.114 | 0.023 | 0.084 | 0.15 | 0.043 | 0.126 | 0.202 |
| 7 | 0.011 | 0.053 | 0.106 | 0.015 | 0.065 | 0.127 | 0.019 | 0.076 | 0.140 |
| 15 | 0.011 | 0.053 | 0.102 | 0.011 | 0.058 | 0.109 | 0.017 | 0.068 | 0.124 |
| 30 | 0.010 | 0.053 | 0.107 | 0.011 | 0.053 | 0.102 | 0.012 | 0.056 | 0.110 |
It can be noticed that as the number of replicas ( ) increase, the empirical sizes approach the nominal sizes. Also, considering the first set of parameter values for the standard deviation of the measurement error of the laboratories ( ), the nominal and empirical values are close even for small number of replicas, however as these standard deviations increase ( and ), we need a larger number of replicas.
Furthermore, to simulate the power of the test for the equivalence of all laboratories with respect to the reference laboratory, it was considered a gradual distance from the null hypothesis for the second and forth laboratories and obtained the percentages of the observed values of the test statistics which were greater than the quantile of the Chi-squared distribution with 8 degree of freedom.
Figure 1 shows the power of the test for different number of replicas ( and 30) with the parameter of the standard deviation of the measurement error of each laboratory at the jth engine rotation value given by and , respectively. Notice that in both figures as the number of replicas increase, the power of the test increases.
Figure 1.
Simulated power for the Wald test statistics with and for the test .
Figure 2 shows the power of the test as the standard deviation of the measurement error of the laboratories increases from to for fixed number of replicas. In all cases, the power of the test under is greater than under . Another point to observe is the fact that the distance between the two curves (power under and power under ) diminishes as the number of replicas increase.
Figure 2.
Simulated power for the Wald test statistics with and for the test .
Next, without loss of generality we consider the second laboratory to test for the equivalence of a laboratory with respect to the reference laboratory:
It obtained the empirical significance levels considering the test obtained in (18). The results are summarized in Table 3 and it reaches the same conclusions as given for Table 2 for the equivalence of all laboratories with respect to the reference laboratory.
Table 3.
Empirical sizes for the Wald test statistics for the test
| 1% | 5% | 10% | 1% | 5% | 10% | 1% | 5% | 10% | |
|---|---|---|---|---|---|---|---|---|---|
| 3 | 0.016 | 0.065 | 0.126 | 0.024 | 0.081 | 0.147 | 0.035 | 0.114 | 0.189 |
| 7 | 0.010 | 0.051 | 0.101 | 0.017 | 0.070 | 0.129 | 0.023 | 0.088 | 0.151 |
| 15 | 0.010 | 0.051 | 0.101 | 0.013 | 0.061 | 0.113 | 0.016 | 0.068 | 0.126 |
| 30 | 0.008 | 0.048 | 0.102 | 0.012 | 0.053 | 0.102 | 0.013 | 0.063 | 0.120 |
Figure 3 shows the power of the test when the standard deviation of the measurement error of the laboratories are given by and for different number of replicas. As the number of replicas increase, the power of the test increases, in addition when the standard deviation of the measurement error of the laboratories increase, the power decreases.
Figure 3.
Simulated power for the Wald test statistics with and for the test
Furthermore, we considered a simulation study where the data set has the same characteristics as the data set considered in the Application Section, i.e. same number of laboratories, engine rotation values, number of replicas for each laboratory, the values of and , , and . For the values of , and , the MLEs of these parameters were considered. and . See Table 6 for the estimates of and , . It generated 10,000 samples.
Table 6.
MLEs of the bias parameters.
| Laboratories | |||||||
|---|---|---|---|---|---|---|---|
| i | 2 | 3 | 4 | 5 | 6 | 7 | 8 |
| 0.0700 | 0.1000 | 0.0658 | 0.2183 | 0.1288 | −0.0315 | 0.0063 | |
| 0.9661 | 0.9856 | 0.9957 | 0.9871 | 0.9983 | 0.9745 | 0.9913 | |
For the test of equivalence of all laboratories with respect to the reference laboratory:
it obtained the empirical significance level for , and , considering the test obtained in (13), which was respectively given by 0.012, 0.058 and 0.116.
Moreover, to simulate the power of the test, it was considered a gradual distance from the null hypothesis for the second and forth laboratories and obtained the percentages of the observed values of the test statistics which were greater than the quantile of the Chi-squared distribution with 14 degrees of freedom. The left-hand panel of Figure 4 shows the corresponding power of the test.
Figure 4.
Simulated power for the Wald test statistics: (left panel) and (right panel).
Next, we consider the second laboratory to test for the equivalence of a laboratory with respect to the reference laboratory:
It was obtained the empirical significance level for , and , considering the test obtained in (18), the corresponding values were given by 0.015, 0.063 and 0.121, respectively. The right-hand panel of Figure 4 shows the power of the test.
In the next section, we apply the developed results for the real data set used in the stability study to show the usefulness of the proposed methodology.
4. Application
PT determines the performance of individual laboratories for specific measurements. The measurement procedure is developed under the coordination of the reference laboratory (one accredited laboratory). A set of detailed instructions are developed to enable participant laboratories to carry out a measurement without additional information. While to specify the item under testing, it is necessary to develop and characterize one suitable item. The main property of the item is its stability over time.
In our illustration, the GM power train developed an engine, in which all engine basic parameters were locked to reduce variability in the power engine measurements. Then, the engine was tested on an engine dynamometer during a period of time to prove the stability and characterize the measurand for each rotation value j, . By applying statistical process control techniques, GM power train estimated the stable variance of the measurand under the specified conditions described in the measurement procedure.
In the sequel, the item (engine) under testing was sent to the participating laboratories. Each laboratory measured the item (power engine) according to a given set of instructions and reported their results together with the uncertainty to the administrator.
The engine power of the standard engine was measured by eight (p) laboratories at nine (m) engine rotation values. The natural variability ( ) associated with the true unobserved values was evaluated during the stability study and can be found in Table 4.
Table 4.
Standard deviation of the true engine power measurements ( ) at the jth engine rotation value.
| 0.0877 | 0.1600 | 0.2720 | 0.3161 | 0.3760 | 0.4480 | 0.4760 | 0.5000 | 0.5080 |
The variance ( ) of the measurement error corresponding to the ith laboratory at the rotation value, , was determined by the combined variance calculated and provided by the ith laboratory following the procedure proposed by ISO GUM [11], as described in the supplemental material. These values can be found in Table 5. The measurements of each laboratory can be found in Online Resource.
Table 5.
Measurement error standard deviations ( ) for the ith laboratory at the jth engine rotation value.
| Laboratory | ||||||||
|---|---|---|---|---|---|---|---|---|
| Rotation | i = 1 | i = 2 | i = 3 | i = 4 | i = 5 | i = 6 | i = 7 | i = 8 |
| j = 1 | 0.0825 | 0.0735 | 0.0224 | 0.0900 | 0.2232 | 0.1005 | 0.1068 | 0.1578 |
| j = 2 | 0.1466 | 0.1304 | 0.0424 | 0.1622 | 0.3984 | 0.1808 | 0.1929 | 0.2881 |
| j = 3 | 0.2486 | 0.2216 | 0.0707 | 0.2739 | 0.6715 | 0.3058 | 0.3208 | 0.4869 |
| j = 4 | 0.2912 | 0.2590 | 0.0831 | 0.3211 | 0.7918 | 0.3578 | 0.3788 | 0.5745 |
| j = 5 | 0.3450 | 0.3081 | 0.0985 | 0.3803 | 0.9317 | 0.4250 | 0.4540 | 0.6740 |
| j = 6 | 0.4111 | 0.3665 | 0.1166 | 0.4511 | 1.1026 | 0.5052 | 0.5403 | 0.7949 |
| j = 7 | 0.4409 | 0.3918 | 0.1253 | 0.4830 | 1.1805 | 0.5374 | 0.5751 | 0.8511 |
| j = 8 | 0.4627 | 0.4062 | 0.1300 | 0.5021 | 1.2229 | 0.5560 | 0.5992 | 0.8838 |
| j = 9 | 0.4717 | 0.4136 | 0.1327 | 0.5114 | 1.2386 | 0.5644 | 0.6132 | 0.8978 |
First, considering the EM algorithm presented in Section 2, the MLEs of the bias parameters were obtained in Table 6.
Thereafter, considering the test given in (13) we tested the equivalence of all laboratories with respect to the reference laboratory. The value of the test statistics was given by . Therefore, we conclude that the group of laboratories are not consistent, there are at least one laboratory with significant multiplicative or additive bias.
Thus, we performed the multiple test given in Table 7, which gives the conclusion that the fourth and fifth laboratories are consistent at significance level
. As we were performing multiple test, we applied the corrections developed by Hochberg (1988), Holm (1979) and Hommel (1988) to control the type 1 error probability for the family of tests.
After applying the corrections, we conclude that the laboratories 4, 5 and 6 are compliant with the reference laboratory with the familywise error ratio smaller than
.
Table 7.
Wald test statistics, , for the hypothesis: , ; with respective p-values.
| Laboratories | |||||||
|---|---|---|---|---|---|---|---|
| i | 2 | 3 | 4 | 5 | 6 | 7 | 8 |
| 517.2679 | 69.3573 | 1.9682 | 6.6394 | 10.9409 | 324.5544 | 17.5634 | |
| p-Value | 0.0000 | 0.0000 | 0.3738 | 0.0362 | 0.0042 | 0.0000 | 0.0002 |
| p-Value (Holm) | 0.0000 | 0.0000 | 0.3738 | 0.0723 | 0.0126 | 0.0000 | 0.0006 |
| p-Value (Hochberg) | 0.0000 | 0.0000 | 0.3738 | 0.0723 | 0.0126 | 0.0000 | 0.0006 |
| p-Value (Hommel) | 0.0000 | 0.0000 | 0.3738 | 0.0723 | 0.0126 | 0.0000 | 0.0006 |
Subsequently, we constructed the confidence regions for the seven laboratories with the confident coefficient of and Bonferroni corrections, so that the familywise error of the test is less than . These regions can be found in Figure 5. We can conclude visually that laboratories 4, 5 and 6 are compliant with the reference laboratory. Moreover, all of the seven laboratories do not have additive bias.
Figure 5.
Joint confidence regions for the participant laboratories.
5. Discussion
In this work, we propose a strategy to evaluate PT results with multivariate response. This is the most common case of PT, as in general, the item under test is measured at different levels of values.
PT determines the performance of individual laboratories for specific measurements, which means that PT compares the measuring results obtained by different laboratories. The usual comparative calibration model assumes that the measurand (x) is independent among the laboratories (participants and reference), as described in Refs. [4,7]. However, it is not the case in many PT. As we have only one item under testing (one engine), there is a natural dependency among all measurements at the same level (rotation). To fullfill this gap in the literature of measurement comparison model, we introduce one suitable ultrastructural model to encompass this dependency.
As a consequence of this dependency, the observed information matrix does not converge to the expected information matrix and the usual asymptotic theory is not applicable. In fact, the observed information matrix converges in probability to a random matrix. Furthermore, this random matrix has null components related to the mean of the measurand and in addition, the correspondent component of the score function also converges in probability to zero. In this paper, we extended the asymptotic theory developed by Refs. [20,24,25] to derive a Wald type test in this scenario, which is the base for assessing the competence of the participants laboratories.
The asymptotic theory developed in this work is based on two results. First, the observed information matrix converges in probability to a random matrix (see Lemma 3.1). Based on the fact that the log-likelihood function is smooth, we extend a result in Ref. [20] to obtain Theorem 3.3. In sequel, we apply standard arguments from asymptotic theory to derive a Wald type test. As a consequence, for any smooth log-likelihood function satisfying Lemma 3.1 and Theorem 3.3, we can apply the results of this paper to derive a Wald type test.
To assess the behavior of asymptotic results, a simulation study was performed. In general, we conclude that the performance of the asymptotic results are closely related to the sample size and the magnitude of the variance components. As the variance components of the reference laboratory are known before the start of the PT program, we can use the empirical power function to estimate the sample size.
To illustrate the developed methodology, we analyzed the results of the proficiency test related to the engine power in the Application Section. In the real data set considered here, we have eight laboratories including the reference laboratory. At the beginning of the program, the reference laboratory evaluated the stability of the engine under test and determined the component of variance related to the true value. To ensure comparability of results, the reference laboratory measured the engine at the beginning and end of the PT program. Each participating laboratory reported its measurements and respective uncertainties. The results of the participant laboratories were compared with the results of the reference laboratory using Wald statistics, as presented in Section 4.
Besides the fact that the proposed methodology was illustrated considering PT results, it can be applied in any situation where the interest is in comparing the measurements obtained using different manners with a reference value.
Preprint
arXiv:2011.00640 [math.ST]
Supplementary Material
Funding Statement
The research was financed in part by the Coordenação de Aperfeiçoamento de Pessoal de Nível Superior– Brasil(CAPES)– Finance Code 001. Research carried out using the computational resources of the Center for Mathematical Sciences Applied to Industry (CeMEAI) is funded by FAPESP (Grant Number 2013/07375-0).
Disclosure statement
No potential conflict of interest was reported by the author(s).
References
- 1.Altman D.G. and Bland J.M., Comparison of methods of measuring blood pressure, J. Epidemiol. Community Health 40 (1986), pp. 274–277. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Barnett V.D., Simultaneous pairwise linear structural relationships, Biometrics 25 (1969), pp. 129–142. [PubMed] [Google Scholar]
- 3.Cheng C.-L. and Van Ness J.W., Statistical Regression with Measurement Error, Kendall's Library of Statistics, Vol. 6, John Wiley & Sons, New York, 1997. [Google Scholar]
- 4.Cheng C.-L. and Van Ness J.W., Statistical Regression with Measurement Error, Arnold, London and Oxford University Press, New York, 1999. [Google Scholar]
- 5.Dempster A.P., Laird N.M., and Rubin D.B., Maximum likelihood from incomplete data via the EM algorithm, J. R. Stat. Soc. Ser. B (Methodol.) 39 (1977), pp. 1–38. [Google Scholar]
- 6.EA-4/18 , 2010. Available at http://www.european-accreditation.org/publication/ea-4-18-inf-rev00-june-2010
- 7.Giménez P. and Patat M.L., Local influence for functional comparative calibration models with replicated data, Stat. Pap. 55 (2014), pp. 431–454. [Google Scholar]
- 8.Gleser L.J., Assessing uncertainty in measurement, Stat. Sci. 13 (1998), pp. 277–290. [Google Scholar]
- 9.Hochberg Y., A sharper Bonferroni procedure for multiple tests of significance, Biometrika 75 (1988), pp. 800–802. [Google Scholar]
- 10.ISO 13528 , Statistical methods for use in proficiency testing by interlaboratory comparisons, Tech. Rep., International Organization for Standardization, Geneva, 2015.
- 11.ISO GUM , Guide to the expression of uncertainty in measurement, (gum), bipm, iec, ifcc, iupac, iupap, oiml., 1995.
- 12.ISO, IEC 17043 , Conformity assessment general requirements for proficiency testing, Tech. Rep., International Organization for Standardization/International Electrotechnical Commission, Geneva, 2010.
- 13.Jaech J.L., Statistical Analysis of Measurement Errors, Vol. 2, Wiley, New York, 1985.n [Google Scholar]
- 14.Kimura D.K., Functional comparative calibration using an em algorithm, Biometrics 48 (1992), pp. 1263–1271. [Google Scholar]
- 15.Linsinger T.P.J., Kandler W., Krska R., and Grasserbauer M., The influence of different evaluation techniques on the results of interlaboratory comparisons, Accreditation Qual. Assur. 3 (1998), pp. 322–327. [Google Scholar]
- 16.Page G.L. and Vardeman S.B., Using Bayes methods and mixture models in inter-laboratory studies with outliers, Accreditation Qual. Assur. 15 (2010), pp. 379–389. [Google Scholar]
- 17.Pinto D.L., Aoki R., and Silva G.F., Statistical analysis of proficiency testing results under elliptical distributions, Comput. Stat. Data Anal. 53 (2009), pp. 1427–1439. ISSN 0167-9473. Available at 10.1016/j.csda.2008.12.003. Available at http://www.sciencedirect.com/science/article/pii/S0167947308005720 [DOI] [Google Scholar]
- 18.R Core Team , R: A Language and Environment for Statistical Computing, R Foundation for Statistical Computing, Vienna, Austria, 2016. Available at https://www.R-project.org/
- 19.Rosario P., Martínez J.L., and Miguel Silván J., Comparison of different statistical methods for evaluation of proficiency test data, Accreditation Qual. Assur. 13 (2008), pp. 493–499. [Google Scholar]
- 20.Sweeting T.J., Uniform asymptotic normality of the maximum likelihood estimator, Ann. Stat. 8 (1980), pp. 1375–1381. ISSN 00905364. Available at http://www.jstor.org/stable/2240949 [Google Scholar]
- 21.Theobald C.M. and Mallinson J.R., Comparative calibration, linear structural relationships and congeneric measurements, Biometrics 34 (1978), pp. 39–45. [Google Scholar]
- 22.Toman B., Bayesian approaches to calculating a reference value in key comparison experiments, Technometrics 49 (2007), pp. 81–87. [Google Scholar]
- 23.Vilca-Labra F., Aoki R., and Zeller C.B., Hypotheses testing for structural calibration model, Stat. Pap. 52 (2011), pp. 553–565. [Google Scholar]
- 24.Weiss L., Asymptotic properties of maximum likelihood estimators in some nonstandard cases, J. Am. Stat. Assoc. 66 (1971), pp. 345–350. [Google Scholar]
- 25.Weiss L., Asymptotic properties of maximum likelihood estimators in some nonstandard cases, II, J. Am. Stat. Assoc. 68 (1973), pp. 428–430. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.





