A sequential conditional probability ratio test procedure for comparing diagnostic tests

Liansheng Tang; Ming Tan; Xiao-Hua Zhou

doi:10.1080/02664763.2010.515678

. Author manuscript; available in PMC: 2012 Apr 20.

Published in final edited form as: J Appl Stat. 2010 Oct 1;38(8):1623–1632. doi: 10.1080/02664763.2010.515678

A sequential conditional probability ratio test procedure for comparing diagnostic tests

Liansheng Tang ^a,^*, Ming Tan ^b, Xiao-Hua Zhou ^c,^d

PMCID: PMC3331726 NIHMSID: NIHMS368006 PMID: 22523441

Abstract

In this paper, we derive sequential conditional probability ratio tests to compare diagnostic tests without distributional assumptions on test results. The test statistics in our method are nonparametric weighted areas under the receiver-operating characteristic curves. By using the new method, the decision of stopping the diagnostic trial early is unlikely to be reversed should the trials continue to the planned end. The conservatism reflected in this approach to have more conservative stopping boundaries during the course of the trial is especially appealing for diagnostic trials since the end point is not death. In addition, the maximum sample size of our method is not greater than a fixed sample test with similar power functions. Simulation studies are performed to evaluate the properties of the proposed sequential procedure. We illustrate the method using data from a thoracic aorta imaging study.

Keywords: diagnostic accuracy, ROC, AUC, weighted AUC, SCPRT

1. Introduction

Magnetic resonance imaging (MRI) is a commonly used routine in disease diagnosis due to its high resolution and relative safety in practice. Unlike traditional computed tomography using radioactive isotopes, MRI machines apply strong magnetic field to align hydrogen atoms in the body and generate 3D images from these aligned atoms. Due to its noninvasive nature, MRI can be a good candidate in diagnosing patients with severe conditions. Diagnostic trials have been conducted to compare MRI with traditional computed tomograph or to compare different MRI modalities. For instance, thoracic aortic dissection is a life-threatening condition. Its mortality rate within 48 h can reach 68% without early diagnosis and prompt treatment [13]. MRI provides the detailed dissecting process and shows higher sensitivity than many other types of imaging modalities [12]. It thus emerges as an excellent diagnostic tool [1,21]. Common MRI techniques include spin-echo MRI (SE-MRI) and cinematic presentation of MRI (CINE-MRI). The former is a conventional technique, and the latter monitors the flow of cerebrospinal fluid in addition to SE-MRI. To compare these two MRI modalities on their accuracy to detect thoracic aortic dissections, a diagnostic imaging trial was conducted by Van Dyke et al. [17].

To summarize test results in these trials, the receiver-operating characteristic (ROC) curve is a commonly used statistical tool [22]. The ROC curve plots the true positive rate (TPR) (the ratio of correctly detected diseased subjects) versus false positive rate (FPR) (the ratio of incorrectly identified nondiseased subjects) in the entire range of threshold values. Summary measures of ROC curves are obtained to evaluate the accuracy of diagnostic tests. The area under the ROC curve (AUC) is one of these summary measures [7]. However, when ROC curves from two tests intersect, resulting AUCs may be equal. This may lead to the false conclusion of the same accuracy of the two tests. Concerned with this limitation, Wieand et al. [18] proposed a Δ-statistic to compare the discriminant accuracy of tests at a prespecified range of specificities. A properly defined Δ-statistic also gives the nonparametric sensitivity estimator at a given specificity.

Group sequential designs in these diagnostic imaging trials are logistically feasible since a patient’s disease status is usually available by some gold standard when the patient is recruited, and test results in these trials are immediately generated from scans [10]. Parametric and non-parametric group sequential methods have been proposed in diagnostic trials [11,16,23]. Zhou et al. [23] proposed a nonparametric sequential AUC difference estimator. They used the fact that an empirical AUC estimator is essentially a Wilcoxon statistic. They then derived its asymptotic property in group sequential designs. A more general sequential ROC summary statistic was introduced by Tang et al. [16]. They proposed a sequential Δ-statistic which is asymptotically a Brownian motion process as the information time increases. This desirable property allows the Δ-statistic to be used with standard group sequential monitoring methods such as Pocock, O’Brien-Fleming (OBF), and the error spending function method. The Pocock’s test applies the same nominal significance level for each of the sequential stages, while the OBF test tends to make the earlier rejection more difficult and the latter rejection easier. These two tests require a fixed number of groups to compute the stopping boundaries. A more flexible error spending function test does not specify the number of stages. It partitions the overall type I error to derive stopping boundaries for sequentially computed test statistics at every group. The test statistics are compared with the boundaries to determine the acceptance or rejection of the null hypothesis until type I error is totally spent (see, [8] for more detailed discussion). Compared with a fixed sample test, these sequential tests have smaller expected sample sizes as they tend to reject the null hypothesis in earlier stages. However, the maximum sample sizes (MSS) of these tests exceed those in the fixed sample test. This may be considered a distinct disadvantage by medical researchers, and may become a practical obstacle for them to utilize the group sequential test in diagnostic trials. In addition, these methods do not consider likely outcomes if the trial is continued to its planned end.

To address the aforementioned concerns with common sequential tests, a class of stochastic curtailment procedures including the predictive power test (PPT) [2,3,5,14] and the sequential conditional probability ratio test (SCPRT) [19,20] have been proposed. Using the stochastic curtailment tests, we can specify in the trial protocol a pre-trial discordance probability that the significance of the trial based on the interim data will be reversed should the trial continue to the end for a given interim analysis plan. So the confidence of the early stopping decision will be enhanced should such a discordance probability is negligible (say, 1%). The discordance probability reflects the chance we are willing to take in making a decision based on interim data. If we want to be more conservative, we set a small discordance probability (say) 1%, and we can be less conservative in an individual trial, we can set a discordance probability of 3–5% to derive the monitoring boundary. The PPT requires a prior distribution for the parameter of interest, while the SCPRT is virtually parameter-free and can obtain almost identical stopping boundaries as the predictive power approach. Furthermore, the SCPRT has MSS no larger than a fixed sample test with the same type I and II error rates. Since the test is conditional on the total information at the planned end, the decision reached by the test during interim analysis is unlikely to be reversed should the trial continue to the end. Due to the conditioning, SCPRT tends to have larger expected sample sizes than common group sequential design methods as shown in [15].

The purpose of this article is to derive a SCPRT test in comparing diagnostic imaging tests, to evaluate their properties and to compare their performance with the classic Pocock and OBF tests in diagnostic trials. Utilizing the SCPRT framework for a class of stochastic processes based on information time [20], we can derive the boundaries for the test statistic of interest if the sequentially computed test statistic can be approximated by a Brownian motion. Therefore, based on the Brownian motion property of Δ-statistics, we can derive a new nonparametric SCPRT test to sequentially compare accuracy of diagnostic tests. Our method will not only enjoy appealing properties of SCPRTs, but also include a general family of ROC summary measures. The rest of the paper is organized as follows. In Section 2, we briefly introduce the SCPRT procedure based on information time and the Δ-statistics. We then combine a sequential version of Δ-statistics with SCPRT tests to form a new class of sequential procedures. In Section 3, we evaluate their properties in simulation studies. In Section 4, we illustrate our method with the thoracic aorta imaging trial. Finally, discussion is given in Section 5.

2. Methods

We first briefly introduce SCPRT tests and the Δ-statistic. We will then combine the sequential Δ-statistic and SCPRT tests, and propose a new sequential procedure to have common MSS as the fixed sample test under the same specification of power, type I error, and alternatives.

2.1 SCPRT tests on information time

Without loss of generality, we define the null H₀: θ = θ₀ versus the alternative H_a: θ > θ₀. In this section, we will describe SCPRT tests under H₀ versus H_a in a set of discrete information time points, (t₁, t₂, …, t_K), in a group sequential design with maximum K analyses, where t_k is the information time, k = 1, …, K and t_k_₁≤ t_k_₂ for k₁ < k₂. Let w_k be a sufficient test statistic for the parameter θ at the kth stage. In a group sequential test, if w_k is outside of stopping boundaries defined by (a_k, b_k), then the trial is stopped. H₀ is rejected in favor of H_a if w_k ≥ b_k; and H₀ is accepted if w_k ≤ a_k. Otherwise, the trial is continued. At the final stage, we let a_K = b_K to ensure termination of the trial. w_K ≥ b_K allows us to reject H₀, and w_K < a_K allows the acceptance of H₀. Let L(w_k|w_K) be the likelihood function of w_k given w_K. The conditional maximum-likelihood ratio is given by

λ (t_{k}, w_{k} ∣ t_{K}, z_{α}) = \frac{{max}_{w > z_{α}} L (w_{k} ∣ w_{K} = w)}{{max}_{w \leq z_{α}} L (w_{k} ∣ w_{K} = w)},

(1)

where z_α is the (1 − α)th percentile of a standard normal distribution, which is also the stopping boundary at the last time point t_K. Assume the sequential w_k statistic is approximately a Brownian motion process, the lower and upper stopping boundaries based on Equation (1) are derived by Xiong [19] and Xiong et al. [20] as follows:

a_{k} = z_{α} - {2 a t_{k} (1 - t_{k})}^{1 / 2}, and b_{k} = z_{α} + {2 b t_{k} (1 - t_{k})}^{1 / 2},

(2)

where a and b are determined by the probability of discordance between the conclusion by the sequential test and the conclusion should the trial be carried out to the planned end.

2.2 Δ-Statistic

Suppose two diagnostic tests are conducted on m diseased subjects and n nondiseased subjects. We denote the measurements from test ℓ(ℓ = 1, 2) on the ith diseased subject as X_ℓ_i, where i = 1, …, m, and the measurements on the j th nondiseased subject as Y_ℓ_j, where j = 1, …, n. Define the joint cumulative survival functions (X₁_i, X₂_i) ~ F (x₁, x₂) for the diseased population with marginal survival functions X_ℓ_i ~ F_ℓ(x). Similarly, define (Y₁_j, Y₂_j) ~ G(y₁, y₂) for the nondiseased population with marginal survival functions Y_ℓ_j ~ G_ℓ(y).

An ROC curve for the ℓth test can be expressed as a plot of the TPR versus the FPR, as the threshold varies over the real numbers. Equivalently, we can define the ROC curve for test ℓ as ${ROC}_{ℓ} (u) = F_{ℓ} (G_{ℓ}^{- 1} (u))$ , where 0 ≤ u ≤ 1, and $G_{ℓ}^{- 1} (u) = inf {y : G_{ℓ} (y) < u}$ . Here, u corresponds to the FPR. Wieand et al. [18] proposed comparing two ROC curves on the basis of the weighted area under the ROC curve (wAUC) $Ω_{ℓ} = \int_{0}^{1} [F_{ℓ} {{G_{ℓ}}^{- 1} (u)}] d W (u)$ , with a probability measure W (u) defined on the FPR, u, for u ∈ (0, 1). Included in this class of accuracy measures are the AUC (when W (u) = u for 0 < u < 1), the partial area under the curve (pAUC) between FPRs u₁ and u₂ (when W (u) = (u − u₁)/(u₂ − u₁) for 0 < u₁ ≤ u ≤ u₂ ≤ 1), and the sensitivity at a given level of specificity u₀ (when W (u) is a point mass at u₀). A natural nonparametric estimator for Ω_ℓ is given by

{\hat{Ω}}_{ℓ} = \int_{0}^{1} [{\hat{F}}_{ℓ} {{\hat{G}}_{ℓ}^{- 1} (u)}] d W (u),

(3)

based on empirical survivor distribution functions F̂_ℓ(x) and Ĝ_ℓ(y), where ${\hat{G}}_{ℓ}^{- 1} (u) = inf {y : {\hat{G}}_{ℓ} (y) < u}$ . Wieand et al. [18] used the difference between two wAUCs, Δ = Ω₁ − Ω₂, as a statistical measure to compare diagnostic tests. Its nonparametric estimator, Δ̂, is given by

\hat{Δ} = {\hat{Ω}}_{1} - {\hat{Ω}}_{2}

(4)

From results in [16,18], the asymptotic variance of Δ̂ takes the form

σ_{Δ}^{2} = \frac{v_{X}}{m} + \frac{v_{Y}}{n},

where v_X and v_Y are given by

\begin{array}{l} v_{X} = \sum_{ℓ = 1}^{2} (\int_{0}^{1} \int_{0}^{1} F_{ℓ} {G_{ℓ}^{- 1} (u_{1} \land u_{2})} d W (u_{1}) d W (u_{2}) - {[\int_{0}^{1} F_{ℓ} {G_{ℓ}^{- 1} (u_{1})} d W (u_{1})]}^{2}) \\ - 2 \int_{0}^{1} \int_{0}^{1} [F {G_{1}^{- 1} (u_{1}), G_{2}^{- 1} (u_{2})} - F_{1} {G_{1}^{- 1} (u_{1})} F_{2} {G_{2}^{- 1} (u_{2})}] d W (u_{1}) d W (u_{2}), \\ v_{Y} = \sum_{ℓ = 1}^{2} [\int_{0}^{1} \int_{0}^{1} r_{ℓ} (u_{1}) r_{ℓ} (u_{2}) (u_{1} \land u_{2}) d W (u_{1}) d W (u_{2}) - {\int_{0}^{1} r_{ℓ} (u) u d W (u)}^{2}] \\ - 2 \int_{0}^{1} \int_{0}^{1} r_{1} (u_{1}) r_{2} (u_{2}) [G {G_{1}^{- 1} (u_{1}), G_{2}^{- 1} (u_{2})} - u_{1} u_{2}] d W (u_{1}) d W (u_{2}), \end{array}

(5)

with

r_{ℓ} (u) = \frac{F_{ℓ}^{'} {G_{ℓ}^{- 1} (u)}}{G_{ℓ}^{'} {G_{ℓ}^{- 1} (u)}} .

2.3 Our method

We define the following symbols for the kth stage of a SCPRT test with a maximum K analyses, k = 1, …, K:

m_k, n_k are the numbers of available observations for diseased and nondiseased groups, respectively
F̂_ℓk, Ĝ_ℓk are respective empirical survival functions
Δ̂_k = Ω̂_k₁ − Ω̂_k₂ where Ω̂_k_ℓ is the ℓth empirical wAUC
σ_Δ_k is the variance of Δ̂_k at the kth look, its estimate is σ̂_Δ_k
Z_k = Δ̂_k/σ̂_Δ_k
I_k = 1/σ_Δ_k, statistical information, consequently, I_k ≤ I_k₊₁, k = 1, …, K
τ_k = I_k/I_K.

Define a modified sequential Δ-statistic $B_{k} = \sqrt{τ_{k} I_{k}} {\hat{Δ}}_{k}$ , which is an asymptotically unbiased estimator for $\sqrt{τ_{k} I_{k}} Δ_{k}$ , with asymptotic variance var(B_k) = τ_k. B_k behaves asymptotically like a Brownian motion process with the drift parameter $Δ \sqrt{I_{K}}$ [16]. We propose to use the modified sequential Δ-statistic, B_k, in SCPRT tests in the following paragraph.

At the planning stage of a trial, MSS of a SCPRT test are the same as those of a fixed sample size test with level α and power 1 − β under the same H_a. For instance, suppose we are interested in a one-sided test of H₀: Δ = 0 versus H_a: Δ > 0. The rejection of H₀ will show significant evidence of the superiority of test 1 over test 2 in diagnostic accuracy to discriminate the diseased from the non-diseased. Let λ = m_K/n_K. Given an alternative value δ_a, the MSS m_K and n_K are calculated by

m_{K} = λ n_{K} = \frac{{(z_{1 - α} + z_{β})}^{2} V^{2}}{δ_{a}^{2}},

where V² is an initial guess for v_X + v_Y by investigators or calculated from preliminary results. As the trial is carried out to the kth stage (1 ≤ k < K), supposedly m_k diseased and n_k nondiseased subjects are recruited and two diagnostic tests of interest are conducted on these subjects. From available test outcomes, we estimate the sequential Δ-statistic Δ̂_k, its standard error σ_Δ_k, and the modified interim statistic B_k. It can be shown from Equation (5) that the statistical information time at the kth stage is the ratio of current sample sizes and MSS, i.e. τ_k = m_k/m_K, or τ_k = n_k/n_K, as we maintain the sample size ratio λ = m_k/n_k of the diseased and the nondiseased as a constant. The estimated modified Δ-statistic then becomes

B_{k} = \frac{m_{k} {\hat{Δ}}_{k}}{\sqrt{m_{K} ({\hat{v}}_{X k} + λ {\hat{v}}_{Y k})}},

where v̂_Xk and v̂_Yk are respective estimates of v_Xk and v_Yk. In our simulation studies and the examples, the density functions in r_ℓ(u) are estimated using the Epanechnikov kernel function E(x) = 3/4(1 − x₂)I (|x| ≤ 1) with the bandwidth of 4/ max(min(m_k, n_k)^4/5, 50). The stopping boundaries a_k and b_k are obtained via Equation (2) with a certain probability of discordance. [20, Table 1] provides the exact values of design parameters a and b under various discordance probabilities. According to Tan et al. [15], if B_k is outside of (a_k, b_k), then the trial is stopped without accruing more subjects. The null is rejected if B_k ≥ b_k, and the null is accepted if B_k ≤ a_k. Otherwise, more subjects are recruited to continue to the k + 1th interim analysis. If at the final interim analysis B̂_K ≥ b_K, we would conclude that test 1 has superior diagnostic accuracy than test 2. Note that our method is not limited in the aforemention one-sided test. We can apply it to two-sided test of equal diagnostic accuracy or other hypothesis tests. The test procedure is essentially the same.

Table 1.

Simulated significance level (m = n).

p_d	m(n)	Three-group SCPRT test			Four-group SCPRT test
p_d	m(n)	Norm	Lognorm	Exp	Norm	Lognorm	Exp
0.001	50	0.022	0.032	0.032	0.027	0.032	0.028
	100	0.030	0.029	0.022	0.025	0.031	0.027
	200	0.023	0.020	0.024	0.023	0.023	0.025
0.010	50	0.022	0.033	0.032	0.028	0.032	0.029
	100	0.032	0.029	0.024	0.029	0.033	0.029
	200	0.024	0.022	0.025	0.025	0.025	0.025
0.040	50	0.034	0.041	0.036	0.040	0.032	0.030
	100	0.037	0.035	0.030	0.037	0.035	0.035
	200	0.026	0.025	0.025	0.027	0.025	0.031

Open in a new tab

Note: The 95% prediction interval is (0.025 ± 0.010).

Our method is summarized in the following steps:

At the kth stage, determine boundaries a_k and b_k, and calculate B_k
Reject H₀ if B_k ≥ b_k, and Accept H₀ if B_k ≤ a_k
Otherwise, continue to the (k + 1)th stage
At the final stage, conclude that test 1 has superior diagnostic accuracy than test 2 if B̂_K ≥ b_K

3. Simulation studies

3.1 Type I error rate

We applied our SCPRT method to simulated data sets to evaluate its finite-sample performance. To simulate type I error rate, the hypothesis was set to be H₀: Δ = 0 versus H₀: Δ > 0. Under the true H₀, we simulated test results using the following three parametric models:

Bivariate normal: (X₁, X₂) ~ N((11, 1), Σ₁) and (Y₁, Y₂) ~ N((10, 0), Σ₂), where
$\sum_{1} = (\begin{matrix} 1 & \sqrt{2} ρ \\ \sqrt{2} ρ & 2 \end{matrix}) and \sum_{2} = (\begin{matrix} 2 & \sqrt{2} ρ \\ \sqrt{2} ρ & 1 \end{matrix}), with ρ = 0.5.$

We chose these mean vectors so that true AUCs are the same for two tests.
Bivariate lognormal: exp(X₁, X₂) and exp(Y₁, Y₂), where (X₁, X₂), (Y₁, Y₂) are from the above bivariate normal.
Bivariate exponential [6]: (X₁, X₂) ~ H_x (x₁, x₂), and (Y₁, Y₂) ~ H_y (y₁, y₂). Here, H_x (x₁, x₂) = exp(−β₁₁x₁) exp(−β₂₁x₂)[1 + 4ρ{1 − exp(−β₁₁x₁)}{1 − exp(−β₂₁x₂)}], and H_y (y₁, y₂) = exp(−β₁₂y₁) exp(−β₂₂y₂)[1 + 4ρ{1 − exp(−β₁₂y₁)}{1 − exp(−β₂₂y₂)}]. In the simulation, ρ = 0.25 and β = (β₁₁, β₁₂, β₂₁, β₂₂) = (1, 2, 2, 4). Again, the β vector was chosen so that true AUCs for two tests are the same.

In the simulation, we considered three-group and four-group equal-space SCPRT tests. Different discordance probabilities, p_d, ranging from 0.001 to 0.04 were specified. The nominal significance level was set to be α = 0.025. The simulated significance levels were calculated from 1000 replicates under each setting and are shown in Table 1 (m = n) and in Table 2 (m ≠ n). They show that most of the simulated levels are within the 95% prediction interval. This illustrates the good finite-sample performance of our procedure.

Table 2.

Simulated significance level (m ≠ n).

p_d	m	n	Three-group SCPRT test			Four-group SCPRT test
p_d	m	n	Norm	Lognorm	Exp	Norm	Lognorm	Exp
0.001	50	60	0.021	0.028	0.030	0.024	0.016	0.028
0.001	50	80	0.018	0.017	0.027	0.015	0.021	0.020
0.010	50	60	0.022	0.028	0.030	0.025	0.016	0.028
0.010	50	80	0.019	0.017	0.028	0.016	0.021	0.021
0.040	50	60	0.027	0.033	0.036	0.025	0.019	0.030
0.040	50	80	0.021	0.021	0.034	0.022	0.023	0.025

Open in a new tab

3.2 Maximum and average sample sizes

We compared MSS for fixed-sample design (FSD), SCPRT, Pocock and OBF designs with the Δ-statistic for binormal data. We used power 0.8 and type I error rate α = 0.05 for the sample size calculation. Suppose the binormal distribution of the test outcomes is given by (X₁, X₂) ~ N{(μ₁, μ₂), Σ}, (Y₁, Y₂) ~ N {(0, 0), Σ}, where covariance matrix Σ had common variances 1 and covariances 0.5. We let λ = 1. Since the distributions were known, we were able to obtain the exact variance of Δ̂ from the results in Equation (5). We obtained MSS for three-group SCPRT, Pocock and OBF designs. Let Δ be the difference between AUCs, or partial AUCs (pAUCs). The MSS for comparing AUCs and pAUCs are presented in Table 3. These tables indicate that our SCPRT design has smaller MSS than Pocock and OBF designs.

Table 3.

Maximum possible number of subjects in both arms for comparing AUCs or pAUCs.

	Comparing AUCs
	Ω₂ = 0.7				Ω₂ = 0.75				Ω₂ = 0.8
	Δ= 0.05	0.1	0.15	0.2	0.05	0.1	0.15	0.2	0.05	0.1	0.15	0.2
FSD	832	195	81	44	710	163	68	38	568	128	55	43
SCPRT	832	195	81	44	710	163	68	38	568	128	55	43
Pocock	970	227	94	51	828	190	79	44	662	149	64	50
OBF	847	199	83	45	723	166	70	39	578	131	56	45
	Comparing pAUCs
	Ω₂ = 0.3				Ω₂ = 0.35				Ω₂ = 0.4
	Δ = 0.05	0.1	0.15	0.2	0.05	0.1	0.15	0.2	0.05	0.1	0.15	0.2

FSD	655	156	65	34	604	140	57	31	513	116	48	36
SCPRT	655	156	65	34	604	140	57	31	513	116	48	36
Pocock	764	182	76	40	704	163	67	36	598	135	56	42
OBF	667	159	66	35	614	143	59	32	522	118	50	37

Open in a new tab

We also conducted a simulation study to compare the average sample number (ASN) and simulated power of the proposed method with those of Pocock and OBF methods. The same model configuration for calculating the MSS in the preceding paragraph was used to simulate data. For the comparison of AUCs, we selected μ₁ = 0.7500 and μ₂ = 0.9655 to have Ω₁ = 0.75, Ω₂ = 0.7. For the comparision of pAUCs, we selected μ₁ = 0.6142 and μ₂ = 0.8653 to have Ω₁ = 0.35, Ω₂ = 0.3. We simulated 1000 data sets for each setting, and counted the number of rejections. Dividing the number by 1000 gives the simulation power. The simulated powers and ASNs are presented in Table 4. The results indicate that the proposed SCPRT method has the smallest MSS and maintains the nominal power well. The SCPRT method tends to have larger ASNs than Pocock and OBF methods when the discordance probability p_d takes smaller values such as 0.005 or 0.01, but it has the lowest ASNs among all three methods when ρ = 0.04.

Table 4.

Simulated powers and ASNs.

	Comparing AUCs			Comparing pAUCs
	Ω₁ = 0.75, Ω₂ =0.70			Ω₁ = 0.35, Ω₂ = 0.30
	MSS	Power	ASN	MSS	Power	ASN
Pocock	970	79.2%	680.08	764	80.3%	517.89
OBF	847	81.1%	710.28	667	81.2%	557.09
SCPRT (p_d = 0.001)	832	80.6%	793.13	655	80.5%	630.30
SCPRT (p_d = 0.010)	832	80.7%	750.04	655	80.5%	592.90
SCPRT (p_d = 0.040)	832	81.1%	670.54	655	81.6%	524.70

Open in a new tab

Note: MSS, maximum sample size.

4. A real example

The aforementioned aortic dissection diagnostic trial recruited 45 patients with and 69 without a dissection identified by surgery [17]. Their MRI images were presented to radiologists to be rated in a five-point scale according to radiologists’ confidence for the presence of a dissection. 1 indicated “definitely no aortic dissection”, 2 indicated “probably no aortic dissection”, 3 indicated “unsure about aortic dissection”, 4 indicates “probably aortic dissection”, and 5 indicates “definitely aortic dissection”.

Since both the Δ measure and its variance are estimated nonparametrically using rank statistics, the proposed method can be used for categorical scores in the example. We applied our method to results by radiologist 1 to sequentially compare the diagnostic accuaracy of SE-MRI and CINE-MRI, and tried to decide whether the diagnostic trial could be stopped earlier by our method. Let Ω₁ and Ω₂ beAUCs for SE-MRI and CINE-MRI, respectively. Because of relatively small sample sizes, we applied our method with two looks. Let Δ be the AUC difference. The hypothesis in the study is H₀: Δ = 0 versus H_a: Δ > 0. We assume the original sample sizes were determined by the authors to maintain type I error rate, α = 0.05 and get their desired power, 1 − β. Our sequential SCPRT test need the same MSS as the original fixed sample test. At the first stage of our analysis, we randomly chose half observations of the first 25 patients with and 35 without dissection to estimated the modified Δ-statistic, B₁. The information time t₁ is approximately 0.5. Using results from Table 1 in [20], the upper and lower stopping boundaries a₁ = −0.5611 and b₁ = 2.5211 are calculated from the maximum discordance probability 0.001. We then calculated the modified Δ-statistic, B₁ and compared it with the boundaries. From the data at the first stage, we got AUC estimates Ω̂₁₁ = 0.6051 for CINE-MRI and Ω̂₁₂ = 0.7125 for SE-MRI. The estimated variance of Δ̂₁ = Ω̂₁₁ − Ω̂₁₂ is v̂_Δ = 0.00735. Thus, it follows that B₁ = −0.8852. Since B₁ < a₁, the trial could have been stopped earlier to accept H₀, that is, SE-MRI is not more accurate than CINE-MRI to detect thoracic aortic dissections. Scanning the rest of the 25 patients with and 34 patients without the condition could have been unnecessary.

5. Discussion

In this paper, we have proposed a sequential SCPRT test based on a Δ-statistic for comparing diagnostic accuracy of two tests. Our method is applicable to most diagnostic trials in which patients’ disease status and test results are immediately available at interim analysis. We applied the newly proposed method to a real MRI example for diagnosing aortic dissection and showed that our method ensures possible early stopping of the trial while ensuring a negligible discordance probability (of 0.001). Our sequential test addresses both ethical and cost concerns, which frequently arise in diagnostic imaging trials. In the example we analyzed, the MRI machine generates a huge noise during its operation. It is ethical to stop a trial earlier should significant evidence against H₀ be found. The MRI typically costs thousands of dollars per subject. Stopping a trial earlier will divert the resources for better use, especially when the stopping rule is such that the significant efficacy would sustain (with 0.999 probability) should the trial continue to the planned end.

When we applied the proposed SCPRT design to the aortic dissection example, we only considered ratings from one radiologist, although several radiologists provided image ratings in the example. The effect of radiologists complicates the design of diagnostic trials. Further topics are to develop new test statistics, which allow more than one radiologists and have approximately independent incremental variance structures, for designing sequential diagnostic trials. In addition, the proposed method may be applied for the comparison among more than two tests when the statistic of interest is a linear contrast of several estimated wAUCs. Such a contrast of AUCs has been considered in [4]. Furthermore, an adaptive procedure similar to the Jenison and Turnbull [9] t-test boundaries is available for the SCPRT [20].

Acknowledgments

The authors thank three anonymous referees for their constructive comments and useful suggestions.

References

1.Bitar R, Moody AR, Leung G, Kiss A, Gladstone D, Sahlas DJ, Maggisano R. In vivo identification of complicated upper thoracic aorta and arch vessel plaque by MR direct thrombus imaging in patients investigated for cerebrovascular disease. Am J Roentgenol. 2006;187:228–234. doi: 10.2214/AJR.05.1556. [DOI] [PubMed] [Google Scholar]
2.Choi SC, Pepple PA. Monitoring clinical trials based on predictive probability of significance. Biometrics. 1989;45:317–323. [PubMed] [Google Scholar]
3.Choi SC, Smith PJ, Becker DP. Early decision in clinical-trials when the treatment differences are small –experience of a controlled trial in head trauma. Control Clin Trials. 1985;6:280–288. doi: 10.1016/0197-2456(85)90104-7. [DOI] [PubMed] [Google Scholar]
4.DeLong ER, DeLong D, Clarke-Pearson D. Comparing the areas under two or more correlated receiver operating characteristic curves: A nonparametric approach. Biometrics. 1988;44:837–845. [PubMed] [Google Scholar]
5.Freidlin B, Korn EL, George SL. Data monitoring committees and interim monitoring guidelines. Control Clin Trials. 1999;20:395–407. doi: 10.1016/s0197-2456(99)00017-3. [DOI] [PubMed] [Google Scholar]
6.Gumbel EJ. Bivariate exponential distributions. J Am Stat Assoc. 1960;55:698–707. [Google Scholar]
7.Hanley J, McNeil B. The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology. 1982;143:29–36. doi: 10.1148/radiology.143.1.7063747. [DOI] [PubMed] [Google Scholar]
8.Jennison C, Turnbull B. Group Sequential Methods. Chapman and Hall; New York: 2000. [Google Scholar]
9.Jennison C, Turnbull B. On group sequential tests for data in unequally sized groups and with unknown variance. J Statist Plan Inference. 2001;96:263–288. [Google Scholar]
10.Kidwell CS, Chalela JA, Saver JL. Comparison of mri and ct for detection of acute intracerebral hemorrhage. J Am Med Assoc. 2004;292:1823–1830. doi: 10.1001/jama.292.15.1823. [DOI] [PubMed] [Google Scholar]
11.Liu A, Wu C, Schisterman EF. Nonparametric sequential evaluation of diagnostic biomarkers. Stat Med. 2008;27:1667–1678. doi: 10.1002/sim.3203. [DOI] [PMC free article] [PubMed] [Google Scholar]
12.Nienaber C, Kodolitsch Y, Kodolitschvon Y, Siglow V, Piepho A, Jaup T, Nicolas V, Weber P, Triebel H, Bleifeld W. The diagnosis of thoracic aortic dissection by noninvasive imaging procedures. N Engl J Med. 1993;328:1–9. doi: 10.1056/NEJM199301073280101. [DOI] [PubMed] [Google Scholar]
13.Shiga T, Wajima Z, Apfel CC, Inoue T, Ohe Y. Diagnostic accuracy of transesophageal echocardiography, helical computed tomography, and magnetic resonance imaging for suspected thoracic aortic dissection: Systematic review and meta-analysis. Arch Intern Med. 2006;166:1350–1356. doi: 10.1001/archinte.166.13.1350. [DOI] [PubMed] [Google Scholar]
14.Spiegelhalter DJ, Freedman LS, Blackburn PR. Monitoring clinical-trials – conditional or predictive power. Control Clin Trials. 1986;7:8–17. doi: 10.1016/0197-2456(86)90003-6. [DOI] [PubMed] [Google Scholar]
15.Tan M, Xiong X, Kutner MH. Clinical trial designs based on sequential conditional probability ratio tests and reverse stochastic curtailing. Biometrics. 1998;54:682–695. [PubMed] [Google Scholar]
16.Tang L, Emerson SS, Zhou XH. Nonparametric and semiparametric group sequential methods for comparing accuracy of diagnostic tests. Biometrics. 2008;64:1137–1145. doi: 10.1111/j.1541-0420.2008.01000.x. [DOI] [PubMed] [Google Scholar]
17.Van Dyke C, White R, Obuchowski N, Geisinger MA, Lorig RJ, Meziane MA. Cine MRI in the Diagnosis of Thoracic Aortic Dissection. 79th RSNA Meetings; Chicago, IL. 1993. [Google Scholar]
18.Wieand S, Gail MH, James BR, James KL. A family of nonparametric statistics for comparing diagnostic markers with paired or unpaired data. Biometrika. 1989;76:585–592. [Google Scholar]
19.Xiong X. A class of sequential conditional probability ratio tests. J Am Stat Assoc. 1995;90:1463–1473. [Google Scholar]
20.Xiong X, Tan M, Boyett J. Sequential conditional probability ratio tests for normalized test statistic on information time. Biometrics. 2003;59:624–631. doi: 10.1111/1541-0420.00072. [DOI] [PubMed] [Google Scholar]
21.Yoshida S, Akiba H, Tamakawa M, Yama N, Hareyama M, Morishita K, Abe T. Thoracic involvement of type A aortic dissection and intramural hematoma: Diagnostic accuracy–comparison of emergency helical CT and surgical findings. Radiology. 2003;228:430–435. doi: 10.1148/radiol.2282012162. [DOI] [PubMed] [Google Scholar]
22.Zhou XH, McClish DK, Obuchowski N. Statistical Methods in Diagnostic Medicine. Wiley; NewYork: 2002. [Google Scholar]
23.Zhou XH, Li SM, Gatsonis CA. Wilcoxon-based group sequential designs for comparison of areas under two correlated ROC curves. Stat Med. 2008;27:213–223. doi: 10.1002/sim.2856. [DOI] [PubMed] [Google Scholar]

[R1] 1.Bitar R, Moody AR, Leung G, Kiss A, Gladstone D, Sahlas DJ, Maggisano R. In vivo identification of complicated upper thoracic aorta and arch vessel plaque by MR direct thrombus imaging in patients investigated for cerebrovascular disease. Am J Roentgenol. 2006;187:228–234. doi: 10.2214/AJR.05.1556. [DOI] [PubMed] [Google Scholar]

[R2] 2.Choi SC, Pepple PA. Monitoring clinical trials based on predictive probability of significance. Biometrics. 1989;45:317–323. [PubMed] [Google Scholar]

[R3] 3.Choi SC, Smith PJ, Becker DP. Early decision in clinical-trials when the treatment differences are small –experience of a controlled trial in head trauma. Control Clin Trials. 1985;6:280–288. doi: 10.1016/0197-2456(85)90104-7. [DOI] [PubMed] [Google Scholar]

[R4] 4.DeLong ER, DeLong D, Clarke-Pearson D. Comparing the areas under two or more correlated receiver operating characteristic curves: A nonparametric approach. Biometrics. 1988;44:837–845. [PubMed] [Google Scholar]

[R5] 5.Freidlin B, Korn EL, George SL. Data monitoring committees and interim monitoring guidelines. Control Clin Trials. 1999;20:395–407. doi: 10.1016/s0197-2456(99)00017-3. [DOI] [PubMed] [Google Scholar]

[R6] 6.Gumbel EJ. Bivariate exponential distributions. J Am Stat Assoc. 1960;55:698–707. [Google Scholar]

[R7] 7.Hanley J, McNeil B. The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology. 1982;143:29–36. doi: 10.1148/radiology.143.1.7063747. [DOI] [PubMed] [Google Scholar]

[R8] 8.Jennison C, Turnbull B. Group Sequential Methods. Chapman and Hall; New York: 2000. [Google Scholar]

[R9] 9.Jennison C, Turnbull B. On group sequential tests for data in unequally sized groups and with unknown variance. J Statist Plan Inference. 2001;96:263–288. [Google Scholar]

[R10] 10.Kidwell CS, Chalela JA, Saver JL. Comparison of mri and ct for detection of acute intracerebral hemorrhage. J Am Med Assoc. 2004;292:1823–1830. doi: 10.1001/jama.292.15.1823. [DOI] [PubMed] [Google Scholar]

[R11] 11.Liu A, Wu C, Schisterman EF. Nonparametric sequential evaluation of diagnostic biomarkers. Stat Med. 2008;27:1667–1678. doi: 10.1002/sim.3203. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R12] 12.Nienaber C, Kodolitsch Y, Kodolitschvon Y, Siglow V, Piepho A, Jaup T, Nicolas V, Weber P, Triebel H, Bleifeld W. The diagnosis of thoracic aortic dissection by noninvasive imaging procedures. N Engl J Med. 1993;328:1–9. doi: 10.1056/NEJM199301073280101. [DOI] [PubMed] [Google Scholar]

[R13] 13.Shiga T, Wajima Z, Apfel CC, Inoue T, Ohe Y. Diagnostic accuracy of transesophageal echocardiography, helical computed tomography, and magnetic resonance imaging for suspected thoracic aortic dissection: Systematic review and meta-analysis. Arch Intern Med. 2006;166:1350–1356. doi: 10.1001/archinte.166.13.1350. [DOI] [PubMed] [Google Scholar]

[R14] 14.Spiegelhalter DJ, Freedman LS, Blackburn PR. Monitoring clinical-trials – conditional or predictive power. Control Clin Trials. 1986;7:8–17. doi: 10.1016/0197-2456(86)90003-6. [DOI] [PubMed] [Google Scholar]

[R15] 15.Tan M, Xiong X, Kutner MH. Clinical trial designs based on sequential conditional probability ratio tests and reverse stochastic curtailing. Biometrics. 1998;54:682–695. [PubMed] [Google Scholar]

[R16] 16.Tang L, Emerson SS, Zhou XH. Nonparametric and semiparametric group sequential methods for comparing accuracy of diagnostic tests. Biometrics. 2008;64:1137–1145. doi: 10.1111/j.1541-0420.2008.01000.x. [DOI] [PubMed] [Google Scholar]

[R17] 17.Van Dyke C, White R, Obuchowski N, Geisinger MA, Lorig RJ, Meziane MA. Cine MRI in the Diagnosis of Thoracic Aortic Dissection. 79th RSNA Meetings; Chicago, IL. 1993. [Google Scholar]

[R18] 18.Wieand S, Gail MH, James BR, James KL. A family of nonparametric statistics for comparing diagnostic markers with paired or unpaired data. Biometrika. 1989;76:585–592. [Google Scholar]

[R19] 19.Xiong X. A class of sequential conditional probability ratio tests. J Am Stat Assoc. 1995;90:1463–1473. [Google Scholar]

[R20] 20.Xiong X, Tan M, Boyett J. Sequential conditional probability ratio tests for normalized test statistic on information time. Biometrics. 2003;59:624–631. doi: 10.1111/1541-0420.00072. [DOI] [PubMed] [Google Scholar]

[R21] 21.Yoshida S, Akiba H, Tamakawa M, Yama N, Hareyama M, Morishita K, Abe T. Thoracic involvement of type A aortic dissection and intramural hematoma: Diagnostic accuracy–comparison of emergency helical CT and surgical findings. Radiology. 2003;228:430–435. doi: 10.1148/radiol.2282012162. [DOI] [PubMed] [Google Scholar]

[R22] 22.Zhou XH, McClish DK, Obuchowski N. Statistical Methods in Diagnostic Medicine. Wiley; NewYork: 2002. [Google Scholar]

[R23] 23.Zhou XH, Li SM, Gatsonis CA. Wilcoxon-based group sequential designs for comparison of areas under two correlated ROC curves. Stat Med. 2008;27:213–223. doi: 10.1002/sim.2856. [DOI] [PubMed] [Google Scholar]

PERMALINK

A sequential conditional probability ratio test procedure for comparing diagnostic tests

Liansheng Tang

Ming Tan

Xiao-Hua Zhou

Abstract

1. Introduction

2. Methods

2.1 SCPRT tests on information time

2.2 Δ-Statistic

2.3 Our method

Table 1.

3. Simulation studies

3.1 Type I error rate

Table 2.

3.2 Maximum and average sample sizes

Table 3.

Table 4.

4. A real example

5. Discussion

Acknowledgments

References

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

A sequential conditional probability ratio test procedure for comparing diagnostic tests

Liansheng Tang

Ming Tan

Xiao-Hua Zhou

Abstract

1. Introduction

2. Methods

2.1 SCPRT tests on information time

2.2 Δ-Statistic

2.3 Our method

Table 1.

3. Simulation studies

3.1 Type I error rate

Table 2.

3.2 Maximum and average sample sizes

Table 3.

Table 4.

4. A real example

5. Discussion

Acknowledgments

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases