Comparison of operational characteristics for binary tests with clustered data

Minjung Kwak; Sang-Won Um; Sin-Ho Jung

doi:10.1002/sim.6485

. Author manuscript; available in PMC: 2016 Jul 10.

Published in final edited form as: Stat Med. 2015 Mar 20;34(15):2325–2333. doi: 10.1002/sim.6485

Comparison of operational characteristics for binary tests with clustered data

Minjung Kwak ^a, Sang-Won Um ^b, Sin-Ho Jung ^b,^c,^*

PMCID: PMC4632652 NIHMSID: NIHMS674235 PMID: 25801180

Summary

Although statistical methodology is well developed for comparing diagnostic tests in terms of their sensitivity and specificity, comparative inference about predictive values is not. In this paper we consider the analysis of studies comparing operating characteristics of two diagnostic tests that are measured on all subjects and have test outcomes from multiple sites with varying number of sites among subjects. We have developed a new approach for comparing sensitivity, specificity, positive predictive value, and negative predictive value with simple variance calculation and in particular focus on comparing tests using difference of positive and negative predictive values. Simulation studies are conducted to show the performance of our approach. We analyze real data on patients with lung cancer diagnostic tests to illustrate the methodology.

Keywords: Sensitivity, Specificity, Positive predictive value, Negative predictive value, Clustered binary outcome

1 Introduction

Many new medical tests have been developed with recent advances in biotechnology and such tests can be used for various purposes including diagnosis, prognosis, risk prediction, and disease screening. Among frequently used measures for binary tests are accuracy, sensitivity, specificity, positive predictive value, and negative predictive value. Sensitivity is the probability of a positive test among disease subjects (cases), and specificity is the probability of a negative test among healthy subjects (controls). On the other hand, positive predictive value (PPV) is the probability of having the disease, given that the test is positive, and negative predictive value (NPV) is the probability of not having the disease, given that the test is negative. McNemar’s test can be used to test equality of the sensitivity or specificity for two tests evaluated on the same subjects. However, relatively little attention has been paid to the comparison between paired PPVs or NPVs. Wang et al. [1] studied the size and power of the equality test of two PPVs (and of two NPVs in a similar way) using the delta method to derive the asymptotic normality of the log ratio or the difference of two PPVs. Leisenring et al. [2] implemented the generalized estimating equations to solve this problem. They proposed a score statistic and a Wald statistic derived from a marginal regression model. More recently Kosinski [3] proposed a weighted generalized score statistic for simpler computation. His statistic reduces to the score statistic in the independent sample situation. Moskowitz et al. [4] derived sample size formula for the log ratio of two PPVs and they used the multinomial Poisson transformation which transforms the likelihood of the data into a Poisson likelihood with additional parameters to avoid a complicated variance form obtained by the delta method.

Although well established, sensitivity and specificity have some deficiencies in clinical use. This arises mainly from the fact that sensitivity and specificity are population measures, i.e., they summarize the characteristics of the test over a population. However when we consider the results of a diagnostic test from a patient’s perspective, if the diagnostic test is positive, then the patient wants to know the chance that he or she actually has the disease. And if the test is negative, the patient may ask about the probability that he or she does not actually have the disease if his/her test comes back to negative. These questions refer to the positive and negative predictive values of a diagnostic test. So PPV and NPV are much more important than sensitivity and specificity from the patient’s perspective.

Our study is motivated by a real project to compare the performance of two diagnostic tests to determine the metastatic status of lymph nodes for lung cancer patients. Medi-astinoscopy has been a diagnostic standard, but its sensitivity and negative predictive value are not very high. Furthermore, the technique is invasive and the morbidity and mortality rates are known to be about 2% and 0.08%, respectively (Porte et al.[6], Silvestri et al.[7]). Endobronchial ultrasound-guided transbronchial needle aspiration (EBUS) is a less invasive intervention with broader visual scope so that the investigators would like to compare the new method, EBUS, to the existing diagnostic tool, mediastinoscopy. Both diagnostic tests are conducted for some chosen lymph nodes of each patient, so that the resulting data are clustered paired binary outcomes, where pairs are two diagnostic tests from each lymph node site and clusters are patients. The study was designed to show the non-inferiority of the new test compared to the standard test, but it was shown to be even superior in some operating characteristics. Rao et al. [5] proposed a simple method for the analysis of clustered binary data assuming no specific model for the intracluster correlation. They obtained an estimator of the variance inflation factor due to clustering and adjusted the variance formula derived for independent data by multiplying by the inflation factor. Jung et al. [8] proposed a sample size calculation method in clustered binary data using an optimal weighting scheme that minimizes the variance of the estimator. Hu et al. [9] derived a sample size formula for sensitivity and specificity of diagnostic tests using the sign test for dependent multiple observations per subject under common correlation model. These methods mainly dealt with single sample proportion estimator and the focus has been given on sensitivity or specificity. In addition, Katsis[10] proposed a Bayesian approach to the sample size for two dependent binomial populations using a Dirichlet prior on the proportions.

This paper is novel in that 1) we provide simpler test statistics for the comparison of sensitivity, specificity, positive predictive value, and negative predictive value in quite a general setting 2) our method can be easily applied to clustered data where we allow multiple dependent observations per subject with varying cluster sizes. In Section 2, we describe the data structure and provide statistical testing method for the sensitivity, specificity, accuracy, PPV, and NPV for two clustered test outcomes. We numerically study the performance of our method under various scenarios in Section 3. Section 4 presents an example of the proposed method. Conclusions are made in Section 5.

2 Data and Proposed Method

In clinical practice, clinicians are interested in early detection of a specific disease or decision of medical status for their patients. Usually there exist good treatments available for the disease, but subjects will have some harmful effect if the disease status is erroneously diagnosed. We assume that there are two available diagnostic methods and wish to identify the better one. For example, we prefer the method with a higher PPV. For subject i = 1, …, n, we observe paired binary outcomes for two diagnostic tests from each of m_i sites, such as lymph node sites. We assume that max_i m_i ≤ c(< ∞) and a positive diagnostic outcome provides evidence of disease. From site j of subject i, we observe binary random variables d_ij denoting the disease status, 0 for non-disease and 1 for disease, and x_kij denoting the outcome of diagnostic test k(= 1, 2), 0 for negative and 1 for positive. Thus, the resulting data are summarized as {(d_ij, x₁_ij, x₂_ij), 1 ≤ j ≤ m_i, 1 ≤ i ≤ n}. Let p_k denote the sensitivity, specificity, accuracy, PPV or NPV for test method k. Now, for each operational characteristic of diagnostic tests, we propose a consistent estimator p̂_k for p_k and derive the asymptotic normality of $\sqrt{n} ({\hat{p}}_{1} - {\hat{p}}_{2})$ using clustered binary data. Hence, we reject H₀: p₁ = p₂ in favor of H₁: p₁ ≠ p₂ if the absolute value of $Z = \sqrt{n} ({\hat{p}}_{1} - {\hat{p}}_{2}) / \hat{σ}$ is larger than z₁₋_α_/2, where σ̂² is a variance estimator of $\sqrt{n} ({\hat{p}}_{1} - {\hat{p}}_{2})$ and z₁₋_α_/2 is the 100(1−α/2) percentile of the standard normal distribution.

2.1 Sensitivity and Specificity

An unbiased estimator of the sensitivity for diagnostic test k(= 1, 2), p_k, is given by

{\hat{p}}_{k} = \frac{\sum_{i = 1}^{n} \sum_{j = 1}^{m_{i}} x_{kij} d_{i j}}{\sum_{i = 1}^{n} \sum_{j = 1}^{m_{i}} d_{i j}} .

Let $D = \sum_{i = 1}^{n} \sum_{j = 1}^{m_{i}} d_{i j}$ denote the total number of diseased sites across all n subjects. We have

\sqrt{n} ({\hat{p}}_{k} - p_{k}) = \frac{\sqrt{n}}{D} \sum_{i = 1}^{n} \sum_{j = 1}^{m_{i}} (x_{kij} - p_{k}) d_{i j} .

Since the n subjects are independent, given the disease status of the patients, $ε_{k i} = \sum_{j = 1}^{m_{i}} (x_{kij} - p_{k}) d_{i j}$ are independent 0-mean random variables. By the central limit theorem for large n, $\sqrt{n} ({\hat{p}}_{k} - p_{k})$ is asymptotically normal with mean 0 and variance $σ_{k}^{2}$ that can be consistently estimated by

{\hat{σ}}_{k}^{2} = \frac{n}{D^{2}} \sum_{i = 1}^{n} {\sum_{j = 1}^{m_{i}} (x_{kij} - {\hat{p}}_{k}) d_{i j}}^{2} .

Furthermore, for the difference in sensitivity between two diagnostic tests, we have

\sqrt{n} ({\hat{p}}_{1} - {\hat{p}}_{2}) = \frac{\sqrt{n}}{D} \sum_{i = 1}^{n} \sum_{j = 1}^{m_{i}} (x_{1 i j} - x_{2 i j}) d_{i j} .

Under H₀: p₁ = p₂, $ε_{i} = \sum_{j = 1}^{m_{i}} (x_{1 i j} - x_{2 i j}) d_{i j}$ are independent 0-mean random variables. Hence, under H₀, $\sqrt{n} ({\hat{p}}_{1} - {\hat{p}}_{2})$ is approximately normal with mean 0 and variance σ² that can be consistently estimated by

{\hat{σ}}^{2} = \frac{n}{D^{2}} \sum_{i = 1}^{n} {\sum_{j = 1}^{m_{i}} (x_{1 i j} - x_{2 i j}) d_{i j}}^{2} .

The inference on specificities can be derived from that on sensitivities by replacing d_ij and x_kij with 1 − d_ij and 1− x_kij, respectively.

2.2 Accuracy

For site j of patient i, the testing result by diagnostic method k is accurate if y_kij ≡ I(x_kij = d_ij) = x_kijd_ij + (1 − x_kij)(1 − d_ij) equals 1, and inaccurate if y_kij = 0. So, an unbiased estimator of the accuracy for method k, p_k = P(x_kij = d_ij), is given as

{\hat{p}}_{k} = \frac{1}{N} \sum_{i = 1}^{n} \sum_{j = 1}^{m_{i}} y_{kij},

where $N = \sum_{i = 1}^{n} m_{i}$ denotes the total number of sites across all n subjects. From

\sqrt{n} ({\hat{p}}_{k} - p_{k}) = \frac{\sqrt{n}}{N} \sum_{i = 1}^{n} \sum_{j = 1}^{m_{i}} (y_{kij} - p_{k}),

$ε_{k i} = \sum_{j = 1}^{m_{i}} (y_{kij} - p_{k})$ are independent 0-mean random variables. Hence, $\sqrt{n} ({\hat{p}}_{k} - p_{k})$ is approximately normal with mean 0 and variance $σ_{k}^{2}$ that can be consistently estimated by

{\hat{σ}}_{k}^{2} = \frac{n}{N^{2}} \sum_{i = 1}^{n} {\sum_{j = 1}^{m_{i}} (y_{kij} - {\hat{p}}_{k})}^{2} .

For the difference in accuracy between two diagnostic tests, we have

\sqrt{n} ({\hat{p}}_{1} - {\hat{p}}_{2}) = \frac{\sqrt{n}}{N} \sum_{i = 1}^{n} ε_{i},

where $ε_{i} = \sum_{j = 1}^{m_{i}} (y_{1 i j} - y_{2 i j})$ , i = 1, …, n are independent 0-mean random variables under H₀: p₁ = p₂. Hence, under H₀, $\sqrt{n} ({\hat{p}}_{1} - {\hat{p}}_{2})$ is approximately normal with mean 0 and variance σ² that can be consistently estimated by

{\hat{σ}}^{2} = \frac{n}{N^{2}} \sum_{i = 1}^{n} {\sum_{j = 1}^{m_{i}} (y_{1 i j} - y_{2 i j})}^{2} .

2.3 PPV and NPV

Noting that the PPV of diagnostic test k is defined by p_k = P(d_ij = 1|x_kij = 1) = P(x_kij = 1, d_ij = 1)/P(x_kij = 1) ≡ a_k/b_k, a consistent estimator of p_k is obtained by

{\hat{p}}_{k} = \frac{\sum_{i = 1}^{n} \sum_{j = 1}^{m_{i}} x_{kij} d_{i j}}{\sum_{i = 1}^{n} \sum_{j = 1}^{m_{i}} x_{kij}} \equiv \frac{{\hat{a}}_{k}}{{\hat{b}}_{k}}

where ${\hat{a}}_{k} = N^{- 1} \sum_{i = 1}^{n} \sum_{j = 1}^{m_{i}} x_{kij} d_{i j}$ and ${\hat{b}}_{k} = N^{- 1} \sum_{i = 1}^{n} \sum_{j = 1}^{m_{i}} x_{kij}$ .

By letting a_kij = x_kijd_ij and b_kij = x_kij, we have

\begin{matrix} {\hat{p}}_{k} - p_{k} = \frac{{\hat{a}}_{k} b_{k} - a_{k} {\hat{b}}_{k}}{b_{k} {\hat{b}}_{k}} \\ = \frac{b_{k} ({\hat{a}}_{k} - a_{k}) - a_{k} ({\hat{b}}_{k} - b_{k})}{b_{k}^{2}} + o_{p} (n^{- 1 / 2}), \end{matrix}

where the ignorable error term o_p(n^−1/2) is added by replacing the consistent estimator b̂_k with b_k in the denominator. Then, we have

\begin{matrix} \sqrt{n} ({\hat{p}}_{k} - p_{k}) = \frac{\sqrt{n}}{N} \sum_{i = 1}^{n} \sum_{j = 1}^{m_{i}} {\frac{1}{b_{k}} (a_{kij} - a_{k}) - \frac{a_{k}}{b_{k}^{2}} (b_{kij} - b_{k})} + o_{p} (1) \\ = \frac{\sqrt{n}}{N} \sum_{i = 1}^{n} ε_{k i} + o_{p} (1), \end{matrix}

where, for i = 1, …, n, $ε_{k i} = b_{k}^{- 1} \sum_{j = 1}^{m_{i}} {a_{kij} - a_{k} - p_{k} (b_{kij} - b_{k})}$ are independent random variables with mean 0. Hence, $\sqrt{n} ({\hat{p}}_{k} - p_{k})$ is approximately normal with mean 0 and variance $σ_{k}^{2}$ that can be consistently estimated by

{\hat{σ}}_{k}^{2} = \frac{n}{N^{2}} \sum_{i = 1}^{n} {\hat{ε}}_{k i}^{2},

where ε̂_ki is obtained from ε_ki by replacing a_k, b_k and p_k with their consistent estimators â_k, b̂_k and p̂_k, respectively, i.e. ${\hat{ε}}_{k i} = {\hat{b}}_{k}^{- 1} \sum_{j = 1}^{m_{i}} {a_{kij} - {\hat{a}}_{k} - {\hat{p}}_{k} (b_{kij} - {\hat{b}}_{k})}$ .

Similarly, under H₀: p₁ = p₂, we have

\begin{array}{l} {\hat{p}}_{1} - {\hat{p}}_{2} = ({\hat{p}}_{1} - p_{1}) - ({\hat{p}}_{2} - p_{2}) \\ = N^{- 1} \sum_{i = 1}^{n} (ε_{1 i} - ε_{2 i}) + o_{p} (n^{- 1 / 2}) . \end{array}

For i = 1, …, n, we observe that ε_i = ε₁_i − ε₂_i are independent random variables with mean 0. Hence, under H₀, $\sqrt{n} ({\hat{p}}_{1} - {\hat{p}}_{2})$ is approximately normal with mean 0 and variance σ² that can be consistently estimated by

{\hat{σ}}^{2} = \frac{n}{N^{2}} \sum_{i = 1}^{n} {({\hat{ε}}_{1 i} - {\hat{ε}}_{2 i})}^{2} .

The inference on NPV can be derived from that on PPV by replacing d_ij and x_kij with 1 − d_ij and 1 − x_kij, respectively.

3 Simulations

We carried out a simulation study to evaluate the properties of the proposed method under a variety of conditions. First, for sensitivity and specificity with single binary test outcome per subject, we confirmed that our simple variance formulas yield the same numerical results as the lengthy and complicated formulas of Wang et al. [1] that were derived by using the multivariate central limit theorem together with the delta method. So, we focused on testing two PPVs and NPVs with clustered binary test outcomes. We specified the disease prevalence, the number of patients tested n, the positive predictive value, and the negative predictive value for both test k(= 1, 2), PPV_k and NPV_k. In addition, we specify the variance ( $σ_{1}^{2}$ ) for a subject-specific random effect that is used to induce the correlation among the multiple sites on the same individual and the variance ( $σ_{2}^{2}$ ) for a site-specific random effect that is used to induce the correlation between the two test outcomes performed at each site.

Under each setting, we simulated 5000 data sets and evaluated the size or power of the proposed method. Each simulated data set was generated as follows. First we generate the number of sites m_i for ith patient by generating uniform random number from {2, 4, 6, 8, 10}. Given m_i, we randomly generate disease indicators d_ij, j = 1, …, m_i for each individual by first drawing a multivariate normal random vector and then turning each random variable into a binary indicator using a percentile corresponding to a specified disease prevalence. For each individual, independent random effects r_i and u_ij were generated from zero-mean normal distributions with variance $σ_{1}^{2}$ and $σ_{2}^{2}$ respectively, and binary test outcomes were generated using probit models with the parametrization P(x_kij = 1|d_ij, r_i, u_ij) = Φ{α_k(1 − d_ij) + β_kd_ij + r_i + u_ij} for test k(= 1, 2), where Φ(·) is the cumulative distribution function of the standard normal distribution. The values of the parameters (α₁, β₁, α₂, β₂) used in these probit models were chosen so that the true and false positive rates are equal to those calculated and they in turn give the desired predictive values. By taking the expectation with respect to the random effects, we obtain the marginal rate for test k as a function of coefficients (α_k, β_k) and the variance of the random effects $σ_{1}^{2}$ and $σ_{2}^{2}$ by

P (x_{k} = 1 ∣ d) = Φ {\frac{α_{k} (1 - d) + β_{k} d}{\sqrt{1 + σ_{1}^{2} + σ_{2}^{2}}}} .

We want to examine the empirical type I error rate and power for testing the equality of two PPVs or NPVs. We assume a disease prevalence of 0.25, 0.5, or 0.7, and a sample size n of 100, 200, or 500. The number of sites for each subject m_i takes 2, 4, 6, 8 or 10 with an equal chance of 20%. We present the Kappa statistics as a measurement of correlation between the two test outcomes for each patient in Table 1 under the simulation settings. These are estimated from simulation data. Table 2 reports the empirical size and power for testing the equality of two PPVs with α = 0.05. In Table 2 we consider the null hypothesis that the two diagnostic tests have equal PPVs of 0.7 or 0.85. Under the alternative hypothesis, we set PPV₁ = 0.7 or 0.85 and PPV₂ = PPV₁ + 0.05. In these simulations, NPVs are fixed at 0.9. We observe that the empirical sizes are maintained closely to the nominal level of 0.05 under the simulation settings considered. And the empirical power increases in sample size and prevalence. Also, we have a larger power with PPV₁ = 0.85 than with 0.7 since the former has a larger odds ratio between the two prognostic tests and higher Kappa statistics as shown in Table 1.

Table 1.

Esimated Kappa statistic under various PPVs and prevalences

Prevalence	PPV₁	PPV₂	Kappa
0.25	0.70	0.70	0.163
	0.70	0.75	0.169
	0.85	0.85	0.467
	0.85	0.90	0.515
0.50	0.70	0.70	0.284
	0.70	0.75	0.310
	0.85	0.85	0.506
	0.85	0.90	0.604
0.75	0.70	0.70	0.160
	0.70	0.75	0.188
	0.85	0.85	0.328
	0.85	0.90	0.426

Open in a new tab

Table 2.

Empirical size under H₀: PPV₁ = PPV₂ and power under H₁: PPV₂ = PPV₁+0.05 with PPV₁ = 0.7 or 0.85. n denotes the number of subjects.

Prevalence	n	PPV₁ = 0.7		PPV₁ = 0.85
Prevalence	n	Size	Power	Size	Power
0.25	50	0.059	0.095	0.065	0.152
	100	0.056	0.107	0.056	0.240
	200	0.053	0.156	0.052	0.412
	500	0.049	0.303	0.047	0.807
0.50	50	0.055	0.326	0.051	0.333
	100	0.056	0.531	0.055	0.554
	200	0.051	0.826	0.047	0.875
	500	0.054	0.996	0.050	0.997
0.75	50	0.057	0.652	0.061	0.674
	100	0.053	0.865	0.054	0.895
	200	0.052	0.937	0.055	0.995
	500	0.046	1.000	0.052	1.000

Open in a new tab

We compare empirical test power of our proposed method to two previously proposed methods for single site per subject based on 5000 simulation samples. In Table 3 we present the empirical test powers obtained by using weighted least square method by Wang et al.[1], generalized estimating equation method by Leisenring et al. [2], and our proposed method. In Table 3 we consider the null hypothesis that the two diagnostic tests have equal PPVs of 0.75. Under the alternative hypothesis, we set PPV₁ = 0.85 and PPV₂ = 0.75. In these simulations, NPVs are fixed at 0.85. We consider various scenarios with the disease prevalence of 0.25, 0.5, or 0.75 and the sample sizes of 100, 200, or 500. The proposed method seems to show similar or slightly larger empirical power compared to the previously proposed methods. We observe that the empirical size of the test appears to be close to the nominal level α = 0.05 for all three methods and omit the presentation.

Table 3.

Comparison of empirical powers for equality testing of two PPVs. PPV₁ = 0.85, PPV₂ = 0.75 and NPV₁ = NPV₂ = 0.85. n denotes the number of subjects.

Prevalence	n	WLS¹	GEE²	New³
0.25	100	0.040	0.063	0.073
	200	0.164	0.191	0.208
	500	0.386	0.399	0.404
0.50	100	0.346	0.367	0.390
	200	0.614	0.625	0.652
	500	0.954	0.954	0.966
0.70	100	0.781	0.791	0.821
	200	0.978	0.979	0.982
	500	1.000	1.000	1.000

Open in a new tab

Weighted least square method by Wang et al. [1]

Generalized estimating equation method by Leisenring et al. [2]

We set cluster size as m_i = 1 for our proposed method.

We also conducted similar simulations on NPVs and observed almost the same results, so that we decided not to report them here.

4 Example

For illustrative purpose, the methodology developed in this paper is applied to real data from lung cancer diagnostic test. Mediastinoscopy is a technique often used for staging of lymph nodes of lung cancer and involves making an incision above breast bone. Investigators are interested in comparing the performance of a new diagnostic test called EBUS, which is less invasive and has a broader visual scope than mediastinoscopy [11]. The patients with non-small cell lung cancer have come from a single-center prospective trial that is conducted to compare the diagnostic measures such as sensitivity, specificity, PPV and NPV of EBUS and those of mediastinoscopy to detect lymph node metastasis. Some lymph nodes are located in deeper place than others and it is important to confirm the presence of malignancy in differently located lymph nodes. Each examination was performed to evaluate multiple lymph nodes in various stations.

For each subject both diagnostic techniques look into several lymph nodes and determine positivity of cancer malignancy. Also a gold standard involving lymph node dissection is available for each lymph node tested. Each test result is binary outcome. The number of lymph nodes tested per subject is widely varied from three to eighteen and the investigators agree to focus on five lymph nodes which are commonly used to make cancer staging. It is noted that not all five lymph nodes have gold standard test result and that even for the lymph node with gold standard result, some data were coded as missing in either diagnostic test outcome. In the latter case the investigators agree to replace missing diagnostic test outcome with negativity since not being able to see a lymph node is deemed to imply that the unrecognizable lymph node is fine. This result in the number of lymph nodes to vary between two and five per patient. A total of 127 patients, on whom both diagnostic tests were performed, were analyzed with a total of 441 lymph nodes.

Measuring the predictive accuracy of EBUS, we estimate PPV = 1 and NPV = 0.916. For mediastinoscopy, we estimate PPV = 1 and NPV = 0.901. Neither test of the equality between the two PPVs or NPVs appears to be statistically significant. The 95% confidence interval for the difference is not available for PPVs and (−0.048, 0.017) for NPVs. Therefore, the two tests EBUS and mediastinoscopy do not seem to have a different PPV or NPV in diagnosing lung cancer. The summary of the equality test of each operating characteristic -sensitivity, specificity, accuracy, PPV, or NPV - for EBUS and mediastinoscopy is presented in Table 4.

Table 4.

Analysis of the lung cancer data for comparing the diagnostic performance between Mediastinoscopy (M) and EBUS (E)

Parameter	M	E	p-value^*	95% CI^**
Sens.	0.715	0.764	0.343	(−0.150, 0.052)
Spec.	1.000	1.000	1.000	NA
Accu.	0.921	0.934	0.343	(−0.042, 0.015)
PPV	1.000	1.000	1.000	NA
NPV	0.901	0.916	0.342	(−0.048, 0.017)

Open in a new tab

P-value from two-sided tests for testing the equality of two diagnostic tests

^**

95% confidence interval for p_M − p_E

In this project, the investigators are actually interested in showing the noninferiority of EBUS to mediastinoscopy in PPV and NPV. We formulate the null hypothesis stating that the PPV (or NPV) for EBUS is at least 10 % worse than the PPV (or NPV) for mediastinoscopy, whereas the alternative hypothesis that we want to prove is stating that the PPV (or NPV) for EBUS is not inferior. Following Silva et al. [12], we obtain the p-values 0 and 1.27×10⁻⁷ for the noninferiority of PPVs and NPVs, respectively. One-sided lower confidence bound for the difference of PPVs and NPVs are 0 and −0.043, respectively. Since both lower confidence bounds are larger than −0.1, the negative of the noninferiority margin, we conclude that EBUS is not inferior to mediastinoscopy in PPV and NPV.

5 Discussion

Recent advances in biotechnology have developed many new medical tests for various purposes including diagnosis, prognosis, and disease screening. Various measures can be used to quantify the test performance. Among frequently used measures for binary tests are sensitivity, specificity, positive predictive value, and negative predictive value. They determine the extent to which the test accurately reflects presence or absence of disease and often used in the various stages of cancer treatment. Several authors have pointed out that the predictive values are more directly applicable in patient care and thus have greater clinical relevance than sensitivity or specificity.

In this article we have developed new methods of comparing operating characteristics - sensitivity, specificity, accuracy, positive predictive value, and negative predictive value -of two diagnostic tests that have clustered binary test outcomes with varying cluster sizes. We have provided consistent variance estimates which can be quite simply calculated for testing the equality of diagnostic performances between two tests. A major advantage of our method is that our method does not need lengthy delta method or variable transformation proposed by Wang et al. [1] and Moskowitz and Pepe [4] Using simulation studies we have shown that our test statistics comparing the equality of two predictive values perform well in small and moderate sized samples under a variety of conditions. In particular we investigate the diagnostic performance of a newly developed test EBUS to mediastinoscopy by testing the equality of their positive predictive value and negative predictive value. Our test statistic looks familiar and intuitive resembling the one used in testing of the two population proportions. Our approach is useful in that the calculation is quite simple and does not need any modeling.

In case-control study, PPV is a function of sensitivity, specificity, and disease prevalence using Bayes’ rule and given by

PPV = \frac{sensitivity \times prevalence}{sensitivity \times prevalence + (1 - specificity) \times (1 - prevalence)} .

So PPV and NPV are dependent on the population chosen and the prevalence of disease which can not be estimated in case-control designs. However, the patients of our motivating example have come from a single-center prospective study conducted to compare the performance of two diagnostic tests, EBUS and mediastinoscopy, to evaluate lymph node metastasis. Our setting is rather similar to a paired cross-sectional study considered in Wang et al. [1] and Moskowitz and Pepe [4]. In a cross-sectional study on the relevant test population, both PPV and NPV can be estimated directly from the study results.

Predictive values vary with disease prevalence. It has been pointed out that, when the prevalence of disease is very low, the negative predictive value is high even for poorly accurate tests [13]. However, when testing the equality of two predictive values, the prevalence of disease does not seem to affect the performance of the test much in terms of test size and power.

References

1.Wang W, Davis CS, Soong SJ. Comparison of predictive values of two diagnostic tests from the same sample of subjects using weighted least squares. Statistics in Medicine. 2006;25:2215–2229. doi: 10.1002/sim.2332. [DOI] [PubMed] [Google Scholar]
2.Leisenring W, Alonzo TA, Pepe MS. Comparisons of predictive values of binary medical diagnostic tests for paired designs. Biometrics. 2000;56:345–351. doi: 10.1111/j.0006-341x.2000.00345.x. [DOI] [PubMed] [Google Scholar]
3.Kosinski AS. A weighted generalized score statistic for comparison of predictive values of diagnostic tests. Statistics in Medicine. 2012 doi: 10.1002/sim.5587. [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Moskowitz CS, Pepe MS. Comparing the predictive values of diagnostic tests: sample size and analysis for paired study designs. Clinical Trials. 2006;3:272–279. doi: 10.1191/1740774506cn147oa. [DOI] [PubMed] [Google Scholar]
5.Rao JNK, Scott AJ. A simple method for the analysis of clustered binary data. Biometrics. 1992;48:577–585. [PubMed] [Google Scholar]
6.Porte H, Roumilhac D, Eraldi L, Cordonnier C, Puech P, Wurtz A. The role of mediastinoscopy in the diagnosis of mediastinal lymphadenopathy. European Journal of Cardio-Thoracic Surgery. 1998;13:196–199. doi: 10.1016/s1010-7940(97)00324-2. [DOI] [PubMed] [Google Scholar]
7.Silvestri GA, Gonzalez AV, Jantz MA, et al. Methods for staging non-small cell lung cancer: Diagnosis and management of lung cancer, 3rd ed: American College of Chest Physicians evidence-based clinical practice guidelines. Chest. 2013;143:e211S–250S. doi: 10.1378/chest.12-2355. [DOI] [PubMed] [Google Scholar]
8.Jung SH, Kang SH, Ahn C. Sample size calculations for clustered binary data. Statistics in Medicine. 2001;20:1971–1982. doi: 10.1002/sim.846. [DOI] [PubMed] [Google Scholar]
9.Hu F, Schucany WR, Ahn C. Nonparametric sample size estimation for sensitivity and specificity with multiple observations per subject. Drug Information Journal. 2010;44(5):609–616. doi: 10.1177/009286151004400508. [DOI] [PMC free article] [PubMed] [Google Scholar]
10.Katsis A. Sample size for testing homogeneity of two a priori dependent binomial populations using the Bayesian approach. Journal of Applied Mathematics and Decision Sciences. 2004;8(1):33–42. [Google Scholar]
11.Um SW, Kim HK, Jung SH, Han J, Lee KJ, Park HY, Choi YS, Shim YM, Ahn MJ, Park K, Ahn YC, Choi JY, Lee KS, Suh JY, Chung MP, Kwon OJ, Kim J, Kim H. Endobronchial Ultrasound versus Mediastinoscopy for Mediastinal Nodal Staging of Non-small Cell Lung Cancer. Journal of Thoracic Oncology. doi: 10.1097/JTO.0000000000000388. To appear. [DOI] [PubMed] [Google Scholar]
12.Silva GT, Logan BR, Klein JP. Methods for equivalence and noninferiority testing. Biology of Blood and Marrow Transplantation. 2008;15(1):120–127. doi: 10.1016/j.bbmt.2008.10.004. [DOI] [PMC free article] [PubMed] [Google Scholar]
13.Altman DG, Bland JM. Diagnostic tests 2: predictive values. British Medical Journal. 1994;309:102. doi: 10.1136/bmj.309.6947.102. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R1] 1.Wang W, Davis CS, Soong SJ. Comparison of predictive values of two diagnostic tests from the same sample of subjects using weighted least squares. Statistics in Medicine. 2006;25:2215–2229. doi: 10.1002/sim.2332. [DOI] [PubMed] [Google Scholar]

[R2] 2.Leisenring W, Alonzo TA, Pepe MS. Comparisons of predictive values of binary medical diagnostic tests for paired designs. Biometrics. 2000;56:345–351. doi: 10.1111/j.0006-341x.2000.00345.x. [DOI] [PubMed] [Google Scholar]

[R3] 3.Kosinski AS. A weighted generalized score statistic for comparison of predictive values of diagnostic tests. Statistics in Medicine. 2012 doi: 10.1002/sim.5587. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R4] 4.Moskowitz CS, Pepe MS. Comparing the predictive values of diagnostic tests: sample size and analysis for paired study designs. Clinical Trials. 2006;3:272–279. doi: 10.1191/1740774506cn147oa. [DOI] [PubMed] [Google Scholar]

[R5] 5.Rao JNK, Scott AJ. A simple method for the analysis of clustered binary data. Biometrics. 1992;48:577–585. [PubMed] [Google Scholar]

[R6] 6.Porte H, Roumilhac D, Eraldi L, Cordonnier C, Puech P, Wurtz A. The role of mediastinoscopy in the diagnosis of mediastinal lymphadenopathy. European Journal of Cardio-Thoracic Surgery. 1998;13:196–199. doi: 10.1016/s1010-7940(97)00324-2. [DOI] [PubMed] [Google Scholar]

[R7] 7.Silvestri GA, Gonzalez AV, Jantz MA, et al. Methods for staging non-small cell lung cancer: Diagnosis and management of lung cancer, 3rd ed: American College of Chest Physicians evidence-based clinical practice guidelines. Chest. 2013;143:e211S–250S. doi: 10.1378/chest.12-2355. [DOI] [PubMed] [Google Scholar]

[R8] 8.Jung SH, Kang SH, Ahn C. Sample size calculations for clustered binary data. Statistics in Medicine. 2001;20:1971–1982. doi: 10.1002/sim.846. [DOI] [PubMed] [Google Scholar]

[R9] 9.Hu F, Schucany WR, Ahn C. Nonparametric sample size estimation for sensitivity and specificity with multiple observations per subject. Drug Information Journal. 2010;44(5):609–616. doi: 10.1177/009286151004400508. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R10] 10.Katsis A. Sample size for testing homogeneity of two a priori dependent binomial populations using the Bayesian approach. Journal of Applied Mathematics and Decision Sciences. 2004;8(1):33–42. [Google Scholar]

[R11] 11.Um SW, Kim HK, Jung SH, Han J, Lee KJ, Park HY, Choi YS, Shim YM, Ahn MJ, Park K, Ahn YC, Choi JY, Lee KS, Suh JY, Chung MP, Kwon OJ, Kim J, Kim H. Endobronchial Ultrasound versus Mediastinoscopy for Mediastinal Nodal Staging of Non-small Cell Lung Cancer. Journal of Thoracic Oncology. doi: 10.1097/JTO.0000000000000388. To appear. [DOI] [PubMed] [Google Scholar]

[R12] 12.Silva GT, Logan BR, Klein JP. Methods for equivalence and noninferiority testing. Biology of Blood and Marrow Transplantation. 2008;15(1):120–127. doi: 10.1016/j.bbmt.2008.10.004. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R13] 13.Altman DG, Bland JM. Diagnostic tests 2: predictive values. British Medical Journal. 1994;309:102. doi: 10.1136/bmj.309.6947.102. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Comparison of operational characteristics for binary tests with clustered data

Minjung Kwak

Sang-Won Um

Sin-Ho Jung

Summary

1 Introduction

2 Data and Proposed Method

2.1 Sensitivity and Specificity

2.2 Accuracy

2.3 PPV and NPV

3 Simulations

Table 1.

Table 2.

Table 3.

4 Example

Table 4.

5 Discussion

References

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Comparison of operational characteristics for binary tests with clustered data

Minjung Kwak

Sang-Won Um

Sin-Ho Jung

Summary

1 Introduction

2 Data and Proposed Method

2.1 Sensitivity and Specificity

2.2 Accuracy

2.3 PPV and NPV

3 Simulations

Table 1.

Table 2.

Table 3.

4 Example

Table 4.

5 Discussion

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases