Abstract
Positive and negative predictive values are important measures of a medical diagnostic test performance. We consider testing equality of two positive or two negative predictive values within a paired design in which all patients receive two diagnostic tests. The existing statistical tests for testing equality of predictive values are either Wald tests based on the multinomial distribution or the empirical Wald and generalized score tests within the generalized estimating equations (GEE) framework. As presented in the literature, these test statistics have considerably complex formulas without clear intuitive insight. We propose their re-formulations which are mathematically equivalent but algebraically simple and intuitive. As is clearly seen with a new re-formulation we present, the generalized score statistic does not always reduce to the commonly used score statistic in the independent samples case. To alleviate this, we introduce a weighted generalized score (WGS) test statistic which incorporates empirical covariance matrix with newly proposed weights. This statistic is simple to compute, it always reduces to the score statistic in the independent samples situation, and it preserves type I error better than the other statistics as demonstrated by simulations. Thus, we believe the proposed WGS statistic is the preferred statistic for testing equality of two predictive values and for corresponding sample size computations. The new formulas of the Wald statistics may be useful for easy computation of confidence intervals for difference of predictive values. The introduced concepts have potential to lead to development of the weighted generalized score test statistic in a general GEE setting.
Keywords: Correlated data, Diagnostic test, Paired design, Positive and negative predictive values, Score statistic
1 Introduction
Diagnostic tests are important in medicine especially when the gold standard for ascertainment of disease is invasive or expensive. We consider diagnostic tests with binary result and assume gold standard is available for comparison with the test result. As a motivating example, Table 1 presents the coronary artery disease data [1]. This is a paired design, in which all the patients receive both diagnostic tests, and it is the situation we consider. The gold standard for coronary artery disease is coronary angiography and the diagnostic tests are exercise stress test (Test 1) and clinical history of chest pain (Test 2). Two important measures used to evaluate diagnostic tests are positive and negative predictive values. Positive predictive value (ppv) is the probability of disease (as identified by the gold standard) when the diagnostic test is positive and negative predictive value (npv) is the probability of no disease when the diagnostic test is negative. We consider testing of equality of predictive values of two diagnostic tests. Although a joint comparison of positive and negative predictive values has been considered [2, 3], we focus presently on comparison of positive and negative predictive values separately.
Table 1.
Coronary artery disease data
CAD | No CAD | |||
---|---|---|---|---|
Test 2 result | Test 2 result | |||
Test 1 result | Chest pain | No chest pain | Chest pain | No chest pain |
Positive stress test | 473 | 29 | 22 | 46 |
Negative stress test | 81 | 25 | 44 | 151 |
CAD = coronary artery disease
Since both tests are measured on the same patients, the correlation between predictive values of the two tests is present. To account for this when testing equality of predictive values, generalized estimating equations (GEE) approach [4] with the generalized score test [5, 6] was considered [7]. Alternatively, data in Table 1 were considered to have a multinomial distribution and application of the δ-technique provided Wald test statistics [2, 3, 8]. These test statistics, as reported in the literature, tend to have rather complex formulas which are not most intuitive. Sometimes, formulas are different for the mathematically equivalent test statistics. Thus, there is a need for simple formulas just as they exist for testing equality of two proportions in the two independent samples case.
We propose such novel, simple, and intuitive algebraic re-formulations of the existing test statistics for comparing predictive values. The new formulas facilitate comparisons between the Wald statistics and with the generalized score statistic. A new formulation of the generalized score statistic clearly shows that if each patient receives only one test (unpaired design) then this statistic does not always reduce to the score test statistic commonly used in the independent samples situation. Motivated by this, we propose a new generalized score based statistic for testing equality of two predictive values. The suggested statistic always reduces to the score statistic with independent samples. We call this new statistic the weighted generalized score (wgs) test statistic because it results from consideration of weights when computing empirical covariance matrix needed for the generalized score statistic. The new statistic has superior type I error behavior when compared to the other available test statistics, as demonstrated with simulations.
The paper is organized as follows. Section 2 develops simple formulations of the multinomial based Wald statistics without and with transformation (log, logit) of predictive values. Section 3 deals with the GEE based test statistics resulting from consideration of a logistic model. Section 3.1 derives simple formulation of the empirical Wald statistic and shows that it is equivalent to the multinomial based Wald statistic with the logit transformed predictive values. Section 3.2 derives two algebraically different but mathematically equivalent formulations of the generalized score statistic. The first formulation shows that the generalized score statistic can be quite similar to the multinomial based Wald statistic and the second formulation explains why in the independent samples situation the generalized score statistic does not reduce to the commonly used score statistic. Section 4.1 utilizes these insights to motivate a need for weights when computing the empirical covariance matrix. Section 4.2 presents the proposed weights and the resulting new weighted generalized score statistics. Section 5 contains simulations, Section 6 a detailed example, and discussion follows in Section 7.
2 Proposed new formulations of the multinomial based Wald statistics
This section introduces notations and develops simple formulations of the multinomial based Wald statistics without and with transformation of predictive values. The log and logit transformations are considered.
Table 2 displays general notations for a paired study design data. Appropriate marginal counts are denoted by a dot in the dimension over which marginalization occurs. For brevity we denote n = n•••. Using this notations, an estimate of ppv for Test 1 is and for Test 2 is . Similarly, and .
Table 2.
Paired design data notations
Disease (D) | No disease | |||
---|---|---|---|---|
Test 2 result | Test 2 result | |||
Test 1 result | Positive (+) | Negative (−) | Positive (+) | Negative (−) |
Positive (+) | n ++D | n +−D | ||
Negative (−) | n −+D | n − − D |
Several statistics for comparing predictive values are derived by considering the cells in Table 2 to have multinomial distribution. Let g(π) = f(ppv1) – f(ppv2) = f(π+•D/π+••) – f(π•+D/π•+•) and h(π) = f(npv1) – f(npv2) = f(π–•D/π–••) – f(π•–D/π•–•), where f(·) is a monotone differentiable function. Utilizing the δ-technique, the Wald statistics for testing H0 : ppv1 = ppv2 and H0 : npv1 = npv2 are thus
where . The test statistics are distributed asymptotically as the distribution under the corresponding null hypotheses.
For f(x) = x, the formulas for the denominators of the above test statistics in terms of counts in Table 2 were provided in [3, 8]. The formulas presented in [3] and [8] are mathematically equivalent but differ algebraically. These formulas are rather complex with no easy intuitive insight. Transformation f(x) = log x was considered in [2, 8]. Similarly, the provided formulas while mathematically equivalent are different algebraically, and are also rather involved and not easily intuitive. Below we present novel, simple, and intuitive algebraic re-formulations of these multinomial based Wald test statistics.
Considering f(x) = x and the estimated variance-covariance matrix of and as derived in Appendix A, we have
with
(1) |
Here, the variance of is explicitly expressed as sum of the commonly used binomial variances of and minus two times covariance. The covariance depends on the estimates of positive predictive values and on the frequencies of concordant positive test results for diseased (n++D) and disease free groups .
Considering f(x) = log x, we have
In addition, with f(x) = logit x = log{x/(1 – x)} we have
Similarly, utilizing the estimated variance-covariance matrix of and (see Appendix A), we suggest the following formulas for the Wald statistics testing equality of two negative predictive values:
with
The proposed formulas can be seen as directly expanding the commonly used statistics for two independent samples and reduce to such statistics when there are no concordant results (i.e. ), as is the case in the two independent samples situation. We will show below that the logit transformation based test statistic turns out to be the same as the empirical Wald test statistic from the GEE approach proposed in [7].
3 Proposed new formulations of the GEE based statistics
This section deals with the GEE based test statistics resulting from consideration of a logistic model. Section 3.1 derives simple formulation of the empirical Wald statistic and shows that it is equivalent to the multinomial based Wald statistic with the logit transformed predictive values. Section 3.2 derives two algebraically different but mathematically equivalent formulations of the generalized score statistic. The first formulation shows that the generalized score statistic can be quite similar to the multinomial based Wald statistic and the second formulation explains why in the independent samples situation the generalized score statistic does not reduce to the commonly used score statistic.
3.1 New formulation of the empirical Wald statistic
The GEE approach with disease status as the dependent variable was considered in [7]. For implementation of this approach, data in Table 2 can be re-expressed, with i denoting patient (cluster), j denoting test number within a cluster (j = 1, 2). Disease status is denoted by Dij (1=disease, 0=no disease), covariate Zij indicates type of diagnostic test (1=Test 1, 0=Test 2), and Tij denotes test result (1=positive, 0=negative). To test equality of positive predictive values, only data with at least one positive test result (i.e. Ti1 = 1 or Ti2 = 1) need to be used and the model
(2) |
allows test of the hypothesis H0 : βppv = 0 which is equivalent to H0 : ppv1 = ppv2. To test equality of negative predictive values (H0 : npv1 = npv2) the model logit P (Dij = 1∣Zij, Tij = 0) = αnpv + βnpvZij is considered for data with at least one negative test result, and hypothesis H0 : βnpv = 0 is tested.
The GEE empirical Wald statistic for testing H0 : βppv = 0 in model (2) is , where the denominator is the second (last) diagonal element of the empirical variance-covariance matrix Ve = VmIeVm. The matrix Vm is the model based variance-covariance matrix and matrix Ie is the empirical information matrix. Use of the identity working correlation matrix in the context of diagnostic tests is advocated in [7, 9] and thus, using notations from the above section, the information matrix can be expressed as
(3) |
where cov(Di) is a variance-covariance matrix of Di = [Di1, Di2]T. The model based variance-covariance matrix is derived in Appendix B and the empirical information matrix Ie results from (3) by substituting for cov(Di) the matrix
The empirical variance-covariance matrix Ve = VmIeVm is derived in Appendix C and the novel algebraic formulation of the GEE empirical Wald test statistic for testing H0 : βppv = 0 in model (2) is as follows:
Similarly, the GEE empirical Wald statistic to test equality of two negative predictive values (H0 : βnpv = 0) is
Above formulations show that the GEE empirical Wald statistics are the same as the corresponding multinomial based Wald statistics utilizing the logit transformation (derived in Section 2), because and .
3.2 Two new formulations of the generalized score statistic
Utilizing model (2) the generalized score statistic formula for testing equality of two positive predictive values is presented in [7] and it is
where, , NP is the number of patients with at least one positive test outcome, and mi is the number of positive test results for the i-th patient.
Below, we derive two interesting (mathematically equivalent) formulations of the above statistic, with the first formulation facilitating its comparison with the multinomial based Wald statistic and the second one for direct comparison with the commonly used two independent samples score statistic. In addition, the new formulations provide motivation for the new weighted generalized score statistic we introduce in Section 4.
The general form of the generalized score statistic is , where L is the contrast matrix with r rows (number of considered contrasts) and p columns (number of parameters), S is the score vector, Vm is the model based variance-covariance matrix, and Ve = VmIeVm is an empirical variance-covariance matrix with Ie denoting the empirical information matrix. Quantities S, Vm, and Ie are computed under H0. The generalized score statistic is distributed under H0 as .
For positive predictive values consider model (2) and because, as indicated earlier, the identity working matrix is required for application of GEE in the context of this paper [7, 9], the score vector under H0 : βppv = 0 is (see Appendix D for details)
where is the pooled positive predictive value. Alternatively with w1 = n+••/(n+••+n•+•) and w2 = 1 – w1. The model based variance-covariance matrix has the same formulation as matrix Vm derived in Appendix B but with substituted for and . To test H0 : βppv = 0, the contrast matrix L = [0, 1] is used, and the generalized score statistic can be expressed as
(4) |
with the empirical information matrix obtained by inserting the matrix
in place of cov(Di) in expression (3). After brief algebra, the first new formulation of the generalized score statistic for testing equality of two positive predictive values is
(5) |
with
This formulation shows that the generalized score statistic can be quite similar to the multi-nomial based Wald statistic if Rppv is small and is close to Cppv.
Further algebra leads to the second new formulation of the generalized score statistic
(6) |
where . Since Wppv is not always equal to zero, the above formulation explicitly shows that the generalized score statistic does not always reduce to the score statistic when applied to independent samples (see Section 4 for details).
Similarly, two new formulations of the generalized score statistic for testing equality of two negative predictive values are
where
4 Development of the weighted generalized score statistic
This section motivates and presents the new statistic. Section 4.1 utilizes insights provided by the two formulations of the generalized score statistic to motivate a need for weights when computing the empirical covariance matrix. Section 4.2 presents the proposed weights and the new weighted generalized score statistics.
4.1 Motivation
The formulation (6) of the generalized score statistic shows that in the two independent samples situation (here n+•• denotes size of sample with Test 1 and n•+• size of an independent sample with Test 2) the generalized score statistic does not always reduce to the commonly used score statistic
(7) |
Even though with independent samples we have because there are no concordant positive test results, the term Wppv equals zero only if or , or in a balanced situation, i.e. when n+•• = n•+•. Hence, in general, the generalized score statistic does not reduce to the score statistic when dealing with two independent samples. In addition, formulation (5) shows that the generalized score statistic can be quite similar to the multinomial Wald statistic if Rppv is small and is close to Cppv. In fact, in the two independent samples situation (here ) with unequal sample sizes (n+•• ≠ n•+•), the generalized score statistic tracks closely the multinomial based Wald statistic rather than the score statistic. An example of this relationship is presented on Figure 1 which shows how these statistics change as the proportion of Test 1 positives among Test 1 or Test 2 positives changes. In a balanced situation this proportion is w1 ≡ n+••/(n+•• + n•+•) = 0.5 and the generalized score statistic and the score statistic have the same value. This is because with w1 = 0.5 the term Wppv in expression (6) equals zero. However, when the two independent samples have different sizes (w1 ≠ 0.5) then, as w1 changes, the generalized score statistic surprisingly tracks the multinomial based rather than the score statistic (7).
Figure 1.
Behavior of the generalized score statistic (solid curve), the score statistic (dashed curve) and the multinomial based Wald statistic (dotted curve) in the two independent samples situation as a function of proportion of Test 1 positives among patients with positive result on either of the two diagnostic tests (ppv1 = 0.80, ppv2 = 0.85, total n = 900)
To alleviate this behavior of the generalized score statistic we develop a weighted generalized score statistic which always reduces to the score statistic when applied to two independent samples. Motivation for the form of the proposed statistic arises from comparison of the following two formulas, with the first one a re-expression of the score statistic (7) and the second one a re-expression of the generalized score statistics (6) in the two independent samples situation:
Recall that w1 = n+••/(n+•• + n•+•) and w2 = 1 – w1. The only difference between the above two formulas is the opposite weighting of the averages in the curly brackets. This observation leads to consideration of weights (defined in Section below) when computing the empirical covariance matrix utilized in derivation of the generalized score statistic. With these weights the generalized score statistic will be equal to the score statistic in the independent samples situation. Alternative intuition behind the suggestion of weights is that if w1 > 0.5 then the pooled estimate is influenced more by the Test 1 data; thus the distance used in the empirical covariance matrix is too small and should be up-weighted and distance down-weighted. Similarly, if w1 < 0.5 then is influenced more by the Test 2 data and the distance should be down-weighted with distance up-weighted.
4.2 The proposed weighted generalized score statistic
We propose weights , i.e. and , and the following weighted empirical covariance matrix:
The weighted empirical information matrix is obtained by substituting the above matrix for cov(Di) in expression (3). Subsequently, expression (4) with inserted in place of leads to the proposed weighted generalized score (wgs) statistic:
(8) |
The above statistic is the same as the generalized score statistic in a balanced design (i.e. when w1 = 0.5), because then the term Wppv in formulation (6) of the generalized score statistic equals zero. However, a balanced situation is not likely in a paired design and in a more common unbalanced situation these two statistics are different. In the independent samples case (, i.e. ), the proposed weighted generalized score statistic (8) reduces to the score statistic (7) even in an unbalanced design, in contrast to the generalized score statistic.
Similarly, for testing equality of two negative predictive values the weights are and , and the following weighted generalized score statistic is suggested:
(9) |
In the next Section we demonstrate with simulations that the weighted generalized score statistic has superior type I error behavior as compared to the generalized score statistic, as well as compared to the discussed multinomial based statistics.
5 Simulations
We performed simulations to evaluate size and power of test statistics. Each true scenario considers predictive values, prevalence of disease, and measure of association between the two test results. Rather than assume the same degree of association of diagnostic tests in each disease category, we consider a more general approach and allow differential association of two tests for diseased and not diseased groups, as parameterized with odds ratios ORD and , respectively. This may be a more realistic scenario because, for example, in the CAD data (Table 1) the odds ratio in the diseased group is 473×25/(29×81) = 5.03 and it is substantially larger than the odds ratio 22 × 151/(46 × 44) = 1.64 in the group with no disease.
Hence, seven true parameters formulate a simulation scenario (ORD, , ppv1, ppv2, npv1, npv2, and disease prevalence θ). These parameters can define the corresponding true multinomial distribution reflecting the paired design considered in this paper and the multinomial probabilities are shown in Table 3. To derive probabilities of this multinomial distribution, we first compute sensitivity se and specificity sp of each test based on the considered predictive values and the disease prevalence. Quantities x and y in Table 3 are margin compatible solutions of quadratic equations resulting from equating cross-product of the first two columns (diseased group) to ORD and the last two columns (non-diseased group) to , respectively. The occasionally assumed in literature conditional independence situation, in which tests are considered independent conditional on the disease status, is equivalent to assumption that . In this situation, the quadratic equations reduce to linear equations with solutions x = se1se2 and y = sp1sp2, and Table 3 probabilities are then simply probabilities expected under no association between tests within a disease status category.
Table 3.
Multinomial distribution probabilities for specified predictive values ppv1, ppv2, npv1, and npv2, disease prevalence θ, and odds ratios ORD and . Quantities x and y are compatible solutions of quadratic equations resulting from equating cross-product of the first two columns to ORD and the last two columns to , respectively.
Disease (D) | No disease | |||
---|---|---|---|---|
Test 2 result | Test 2 result | |||
Test 1 result | Positive | Negative | Positive | Negative |
Positive | θ x | θ(se1 – x) | (1 – θ)(1 – sp1 – sp2 + y) | (1 – θ)(sp2 – y) |
Negative | θ(se2 – x) | θ(1 – se1 – se2 + x) | (1 – θ)(sp1 – y) | (1 – θ)y |
se1 = ppv1(npv1 – 1 + θ)/{θ(ppv1 + npv1 – 1)}, | ||||
sp1 = npv1(ppv1 – θ)/{(1 – θ)(ppv1 + npv1 – 1)} | ||||
se2 = ppv2(npv2 – 1 + θ)/{θ(ppv2 + npv2 – 1)}, | ||||
sp2 = npv2(ppv2 – θ)/{(1 – θ)(ppv2 + npv2 – 1)} |
For each considered true scenario, the corresponding multinomial distribution described in Table 3 was derived, and data (as in Table 2) with total sample size n were generated repetitively from this distribution. We added a very small number to each cell of generated data when this particular data set led to inability to compute one of the considered statistics because of zero cells or predictive values equal to zero or one. We generated data four million times because results are displayed to three decimal digits and this number of simulations provides the length of the 95% confidence interval for the proportion of rejected tests to be less than 0.001. For each simulation, H0 was rejected if a test statistic exceeded the 0.95 quantile of the distribution.
Table 4 displays empirical size for the multinomial Wald statistics, the generalized score statistic, and the proposed weighted generalized score statistics. The results for the multinomial based Wald statistic using logit transformation and the GEE empirical Wald statistic are displayed in one column, because these statistics are the same as shown in Section 3.1. We considered ppv1 = ppv2 = 0.75, npv1 = npv2 = 0.80, and varied prevalence of disease θ and total sample size n. We specified ORD = 5.0 and , which is similar to the CAD data in Table 1. As seen in Table 4 the proposed weighted generalized score (wgs) statistics preserve the nominal 0.05 type I error better than the other statistics. The empirical size of Tm is similar or slightly larger than for the generalized score statistics, as anticipated from expression (5), but the empirical sizes for the both statistics are moderately inflated, mainly with lower disease prevalence or smaller sample sizes. The empirical size is underestimated for logit (or equivalently GEE empirical Wald) and log transformation multinomial statistics, especially for smaller sample sizes.
Table 4.
Percent of 4,000,000 repeated simulations with H0 rejected at 5 percent level. True predictive values are PPV1 = PPV2 = 0.75 and NPV1 = NPV2 = 0.80, ORD = 5 and are odds ratios defining degree of association between results of the two tests among diseased and not-diseased, n is the total number of patients with two tests, and θ is the prevalence of diease.
H0 : PPV1 = PPV2 | H0 : NPV1 = NPV2 | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
θ | n | T m | T gs | T wgs | T m | T gs | T wgs | ||||
T gw | T gw | ||||||||||
0.3 | 50 | 9.3 | 3.6 | 0.9 | 7.0 | 4.2 | 4.8 | 3.8 | 3.7 | 4.8 | 4.2 |
100 | 6.6 | 4.4 | 2.4 | 6.0 | 5.0 | 5.1 | 4.7 | 4.7 | 5.1 | 4.8 | |
200 | 5.7 | 4.6 | 4.2 | 5.5 | 5.0 | 5.1 | 4.8 | 4.8 | 5.1 | 4.9 | |
300 | 5.5 | 4.7 | 4.5 | 5.3 | 5.0 | 5.0 | 4.9 | 4.9 | 5.0 | 4.9 | |
400 | 5.4 | 4.8 | 4.7 | 5.3 | 5.0 | 5.0 | 4.9 | 4.9 | 5.0 | 4.9 | |
500 | 5.3 | 4.8 | 4.7 | 5.2 | 5.0 | 5.0 | 5.0 | 5.0 | 5.0 | 5.0 | |
0.4 | 50 | 6.4 | 4.4 | 2.6 | 6.1 | 4.9 | 5.4 | 4.0 | 3.7 | 5.3 | 4.5 |
100 | 5.7 | 4.6 | 4.2 | 5.6 | 5.0 | 5.3 | 4.6 | 4.6 | 5.2 | 4.9 | |
200 | 5.3 | 4.8 | 4.7 | 5.3 | 5.0 | 5.1 | 4.8 | 4.8 | 5.1 | 5.0 | |
300 | 5.2 | 4.9 | 4.8 | 5.2 | 5.0 | 5.1 | 4.9 | 4.9 | 5.1 | 4.9 | |
400 | 5.2 | 4.9 | 4.8 | 5.1 | 5.0 | 5.1 | 4.9 | 4.9 | 5.1 | 5.0 | |
500 | 5.1 | 4.9 | 4.9 | 5.1 | 5.0 | 5.0 | 4.9 | 4.9 | 5.0 | 5.0 | |
0.5 | 50 | 5.9 | 4.5 | 3.8 | 5.8 | 4.9 | 5.6 | 3.8 | 2.9 | 5.4 | 4.5 |
100 | 5.4 | 4.8 | 4.6 | 5.4 | 5.0 | 5.4 | 4.5 | 4.3 | 5.3 | 4.9 | |
200 | 5.2 | 4.9 | 4.8 | 5.2 | 5.0 | 5.2 | 4.8 | 4.7 | 5.2 | 5.0 | |
300 | 5.1 | 4.9 | 4.9 | 5.1 | 5.0 | 5.1 | 4.8 | 4.8 | 5.1 | 5.0 | |
400 | 5.1 | 4.9 | 4.9 | 5.1 | 5.0 | 5.1 | 4.9 | 4.9 | 5.1 | 5.0 | |
500 | 5.1 | 5.0 | 4.9 | 5.1 | 5.0 | 5.1 | 4.9 | 4.9 | 5.1 | 5.0 |
Table 5 summarizes empirical power when the true predictive values are ppv1 = 0.75, ppv2 = 0.85, npv1 = 0.80, and npv2 = 0.90. Here we also consider ORD = 5.0 and , and vary prevalence of disease θ and total sample size n. Empirical power of the multinomial Tm and the generalized score statistics may be somewhat overestimated because of potential for the inflated type I error as demonstrated in Table 4. Power of the proposed weighted generalized score statistic is slightly higher or similar to power of the generalized score statistic. Although not shown, as expected, power for the conditional independence situation is less than power with the chosen positive association between the two tests.
Table 5.
Percent of 4,000,000 repeated simulations with H0 rejected at 5 percent level. True predictive values are PPV1 = 0.75, PPV2 = 0.85, NPV1 = 0.80, and NPV2 = 0.90, ORD = 5 and are odds ratios defining degree of association between results of the two tests among diseased and not-diseased, n is the total number of patients with two tests, and θ is the prevalence of diease.
H0 : PPV1 = PPV2 | H0 : NPV1 = NPV2 | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
θ | n | T m | T gs | T wgs | T m | T gs | T wgs | ||||
T GW | T GW | ||||||||||
0.3 | 50 | 11.7 | 4.4 | 4.0 | 8.8 | 10.6 | 51.4 | 47.3 | 40.2 | 51.4 | 47.4 |
100 | 15.0 | 9.5 | 12.2 | 13.4 | 16.6 | 82.3 | 81.1 | 80.0 | 82.3 | 81.2 | |
200 | 24.8 | 20.3 | 26.0 | 23.5 | 27.5 | 98.5 | 98.4 | 98.3 | 98.5 | 98.4 | |
300 | 35.0 | 31.2 | 37.1 | 33.8 | 37.9 | 99.9 | 99.9 | 99.9 | 99.9 | 99.9 | |
400 | 44.7 | 41.5 | 47.0 | 43.7 | 47.6 | 100 | 100 | 100 | 100 | 100 | |
500 | 53.5 | 50.8 | 55.8 | 52.6 | 56.2 | 100 | 100 | 100 | 100 | 100 | |
0.4 | 50 | 16.2 | 11.5 | 8.8 | 15.3 | 15.4 | 34.6 | 30.2 | 23.4 | 34.3 | 32.1 |
100 | 26.6 | 22.8 | 23.9 | 25.9 | 26.7 | 61.5 | 58.9 | 57.8 | 61.3 | 60.4 | |
200 | 46.7 | 44.1 | 46.1 | 46.2 | 47.3 | 89.3 | 88.7 | 88.7 | 89.3 | 89.1 | |
300 | 63.3 | 61.6 | 63.3 | 62.9 | 63.9 | 97.6 | 97.5 | 97.5 | 97.6 | 97.5 | |
400 | 75.7 | 74.6 | 75.9 | 75.4 | 76.2 | 99.5 | 99.5 | 99.5 | 99.5 | 99.5 | |
500 | 84.4 | 83.7 | 84.6 | 84.2 | 84.8 | 99.9 | 99.9 | 99.9 | 99.9 | 99.9 | |
0.5 | 50 | 23.6 | 19.5 | 16.7 | 23.3 | 21.9 | 22.3 | 17.5 | 12.3 | 21.5 | 21.7 |
100 | 40.7 | 38.0 | 38.0 | 40.5 | 40.0 | 41.7 | 37.9 | 37.1 | 41.2 | 41.6 | |
200 | 68.2 | 66.9 | 67.3 | 68.1 | 68.0 | 70.1 | 68.4 | 69.3 | 69.8 | 70.4 | |
300 | 84.6 | 84.0 | 84.3 | 84.5 | 84.6 | 86.3 | 85.6 | 86.1 | 86.2 | 86.6 | |
400 | 93.1 | 92.8 | 93.0 | 93.0 | 93.1 | 94.2 | 93.9 | 94.2 | 94.2 | 94.4 | |
500 | 97.1 | 97.0 | 97.0 | 97.1 | 97.1 | 97.7 | 97.6 | 97.7 | 97.7 | 97.7 |
In conclusion, the simulation results support use of the proposed weighted generalized score statistics by demonstrating excellent size and power behavior.
6 Example
We consider the coronary artery disease data [1] presented in Table 1 (n = 871) and provide detailed calculations which may be easily followed by practitioners with their own data.
First we compute the weighted generalized score statistic for testing equality of positive predictive values. The number of patients with positive Test 1 is n+•• = 473+29+22+46 = 570 and with positive Test 2 is n•+• = 473 + 81 + 22 + 44 = 620. This is an unbalanced situation because n+•• ≠ n•+•. Positive predictive values are estimated as and . The pooled estimate is . Subsequently, , and . Substituting computed values we obtain
Hence, utilizing distribution, we have p = 0.37 and conclude that there is no evidence to claim difference between the two positive predictive. As an aside, observe that the term reduces the denominator substantially and thus correlation between Test 1 and Test 2 has a major impact in this data set. For comparison, the generalized score statistic is 0.802 and the multinomial based Wald statistics are as follows: 0.802 for the untransformed version, 0.800 for the log transformed version, and 0.806 for the logit transformed version (or equivalently the GEE empirical Wald).
The weighted generalized score statistic for comparison of two negative predictive values is computed similarly. We have n–•• = 81+25+44+151 = 301, n•–• = 29+25+46+151 = 251, and the estimated negative predictive values are and . The pooled estimate is . Then , and . Subsequently,
Here, p < 0.001, and we conclude that the two negative predictive values differ. The generalized score statistic is 23.579 and the multinomial based Wald statistics are as follows: 23.725 for the untransformed version, 22.437 for the log transformed version, and 21.742 for the logit transformed version (or equivalently for the GEE empirical Wald statistic).
7 Discussion
We have proposed new weighted generalized score test statistics (8) and (9) for testing equality of positive and negative predictive values of two diagnostic tests in a paired design. Simulations indicate that these statistics preserve type I error better than the generalized score statistic or the multinomial distribution based Wald statistics. The proposed statistics are intuitive, simple to compute, and in absence of correlated data (unpaired design) they naturally reduce to the commonly used score statistic for comparison of two proportions with independent samples. Thus, we recommend the weighted generalized score (wgs) test statistic for hypothesis testing and for corresponding sample size computations. We have also developed novel simple formulas for the existing Wald statistics and they may be used for computation of confidence intervals for difference of two predictive values. Although, the simple univariable formulas for comparison of predictive values provide intuitive insight, a GEE based regression analysis is necessary when adjustment for covariates is of interest, especially if performance of a test varies substantially according to patient characteristics [7]. This is similar to co-existence of logistic regression methodology and the commonly used “paper-and-pencil” formulas for comparison of two proportions. Finally, we have proposed the weighted generalized score statistic in the setting of diagnostic tests, but with further development the introduced concept may lead to an improved score test within the general GEE framework.
Acknowledgement
This project was supported by the National Center for Research Resources and the National Center for Advancing Translational Science of the National Institutes of Health through the Clinical and Translational Science Award Number UL1RR024128.
Appendix A. Derivation of multinomial based variance-covariance matrix of predictive values
The matrix of derivatives of f(ppv1) = f(π+•D/π+••) and f(ppv2) = f(π•+D/π•+•) with respect to is G = BD, where D is a diagonal matrix with derivatives f’(ppv1) and f’(ppv2) on the diagonal, and
Utilizing the δ-technique and the variance-covariance matrix Σπ of the multinomial distribution we have BT π = 0 and
where .
Similarly, variance-covariance matrix of and is
where .
Appendix B. Derivation of model based variance-covariance matrix
Then
Appendix C. Derivation of empirical variance-covariance matrix
Since and (notation in Appendix A), then using Im and from Appendix B leads to the empirical variance-covariance matrix Ve = VmIeVm expressed as
Appendix D. Derivation of score vector
References
- [1].Weiner DA, Ryan TJ, McCabe CH, Kennedy JW, Schloss M, Iristani F, Chaitman BR, Fisher LD. Exercise stress-testing — correlations among history of angina, stsegment response and prevalence of coronary-artery disease in the Coronary Artery Surgery Study (CASS) The New England Journal of Medicine. 1979;301:230–235. doi: 10.1056/NEJM197908023010502. DOI: 10.1056/NEJM197908023010502. [DOI] [PubMed] [Google Scholar]
- [2].Moskowitz CS, Pepe MS. Comparing the predictive values of diagnostic tests: sample size and analysis for paired study designs. Clinical Trials. 2006;3:272–279. doi: 10.1191/1740774506cn147oa. DOI: 10.1191/1740774506cn147oa. [DOI] [PubMed] [Google Scholar]
- [3].Roldán Nofuentes JA, Luna del Castillo JD, Montero Alonso MA. Global hypothesis test to simultaneously compare the predictive values of two binary diagnostic tests. Computational Statistics and Data Analysis. 2012;56:1161–1173. DOI: 10.1016/j.csda.2011.06.003. [Google Scholar]
- [4].Liang KY, Zeger SL. Longitudinal data analysis using generalized linear models. Biometrika. 1986;73:13–22. DOI: 10.1093/biomet/73.1.13. [Google Scholar]
- [5].Boos DD. On generalized score tests. American Statistician. 1992;46:327–333. DOI: 10.2307/2685328. [Google Scholar]
- [6].Rotnitzky A, Jewell NP. Hypothesis testing of regression parameters in semiparametric generalized linear models for cluster correlated data. Biometrika. 1990;77:485–497. DOI: 10.1093/biomet/77.3.485. [Google Scholar]
- [7].Leisenring W, Alonzo T, Pepe MS. Comparisons of predictive values of binary medical diagnostic tests for paired designs. Biometrics. 2000;56:345–351. doi: 10.1111/j.0006-341x.2000.00345.x. DOI: 10.1111/j.0006-341X.2000.00345.x. [DOI] [PubMed] [Google Scholar]
- [8].Wang W, Davis CS, Soong S. Comparison of predictive values of two diagnostic tests from the same sample of subjects using weighted least squares. Statistics in Medicine. 2006;25:2215–2229. doi: 10.1002/sim.2332. DOI: 10.1002/sim.2332. [DOI] [PubMed] [Google Scholar]
- [9].Pepe MS, Anderson GL. A cautionary note for inference for marginal regression models with longitudinal data and general correlated response data. Communications in Statistics. 1994;23:939–951. DOI: 10.1080/03610919408813210. [Google Scholar]