Abstract
We propose efficient nonparametric statistics to compare medical imaging modalities in multi-reader multi-test data and to compare markers in longitudinal ROC data. The proposed methods are based on the weighted area under the ROC curve which includes the area under the curve and the partial area under the curve as special cases. The methods maximize the local power for detecting the difference between imaging modalities. The asymptotic results of the proposed methods are developed under a complex correlation structure. Our simulation studies show that the proposed statistics result in much better powers than existing statistics. We applied the proposed statistics to an endometriosis diagnosis study.
Keywords: ROC curve, Optimal weights, Wilcoxon statistics, Correlated data
1. Introduction
In medical imaging studies, one is concerned about whether a newly developed imaging modality is more accurate than traditional modalities to correctly discriminate a subject with abnormal lesions from a subject without such lesions. Imaging modalities are considered as an example of diagnostic markers, which are used to distinguish a subject with a particular condition (“the diseased”) from a subject without the condition (“the non-diseased”). For diagnostic markers that generate binary test results, their accuracy can be summarized in terms of sensitivity (probability of identifying a diseased subject when the disease truly exists) and specificity (probability of correctly ruling out a non-diseased subject when the disease is truly absent). For diagnostic markers that generate discrete or continuous test results, the receiver operating characteristic (ROC) curve is a standard statistical tool to describe and compare the accuracy of markers [1]. The ROC curve combines all possible pairs of sensitivities and 1–specificities from different decision thresholds and thus describes the accuracy of markers apart from decision thresholds.
For correlated results from two diagnostic markers, parametric and nonparametric methods have been proposed to compare ROC summary measures. Parametric methods for the area under the curve (AUC) assume distributions (e.g. negative exponential, normal, lognormal, gamma) on marker measurements [2, 3]. These methods may not perform well if the parametric assumptions are invalid. The semiparametric ROC estimation based on the logistic regression is proposed by [4]. As an alternative, nonparametric methods do not require distribution assumptions and are robust to model misidentification. Nonparametric methods to estimate and compare two AUCs have been proposed by [5], [6], and others. These methods are based on results for U-statistics because an empirical AUC statistic is essentially a Wilcoxon rank sum statistic [7]. However, if two ROC curves intersect, their AUCs may be equal and do not provide valid information for the comparison. Moreover, summarizing the entire ROC curve may include irrelevant information about the marker’s accuracy when one is only interested in some range of specificities. For example, acceptable specificities are high for early cancer detection tests. The partial area under the curve (pAUC), which summarizes part of the ROC curve in the range of desired specificities, may be a better alternative. Nonparametric methods to compare pAUCs are proposed by [8]. Utilizing the pAUCs is particularly important in comparing markers which are developed to screen a large population for certain diseases, for example, breast cancer [9]. A lower specificity for a large population leads to many more falsely classified non-diseased subjects who may have to undergo a more invasive test subsequently. It is thus desired to compare screening markers at a higher range of specificities.
In this paper we propose efficient nonparametric ROC statistics to analyze multi-reader multi-test ROC data and to nonparametrically summarize correlated longitudinal ROC data. The proposed method not only includes many nonparametric ROC summary measures as special cases, but also maximizes the local power for detecting the difference between markers. The rest of the article is organized as follows. In Section 2 we introduce the new statistics for multi-reader multi-test ROC data and longitudinal ROC data, and discuss the equivalence between our statistics and the generalized Wilcoxon statistics under specific assumptions. Section 3 gives the variance expressions for the proposed statistics. Section 4 reports simulation results to illustrate the small sample performance of the proposed ROC statistics and their theoretical variances. Section 5 applies the proposed method to a real example on the diagnosis of endometriosis. Section 6 gives some discusion.
2. Methods
2.1. Definition of nonparametric ROC summary statistics
We first define some notations. Suppose test result Xℓip of marker ℓ is from the pth abnormal location in the diseased subject i, where ℓ = 1, …, L, p = 0, 1, …, mℓi, and i = 1, … M. Test result Yℓjq of marker ℓ is from the qth normal location in the non-diseased subject j, where ℓ = 1, …, L, q = 0, 1, …, nℓj, and j = 1, … J. Here the total number of subjects is N = M + J. The joint pairwise cumulative function of (Xℓ1ip1, Xℓ2ip2) is taken to be SD,ℓ1,ℓ2(x1, x2), p1, p2 = 1, …, mℓi, with marginal survival functions Xℓip ~ SD,ℓ(x). Similarly we define (Yℓ1jq1, Yℓ2jq2) ~ SD̄,ℓ1,ℓ2(y1, y2), q1, q2 = 1, …, nℓi, with survival functions Yℓjq ~ SD̄,ℓ(y1, y2), q1, q2 = 1, …, nℓi with marginal survival functions Yℓjq ~ SD̄,ℓ(y). The ROC curve for the ℓth marker is then given by , where the false positive rate (FPR) u is in [0, 1]. The resulting ℓth weighted area under the curve (wAUC) is
(1) |
with a probability measure W(u) defined on u, for u ∈ [0, 1]. Included in this class of accuracy measures are AUC, pAUC between FPRs u1 and u2, and the sensitivity at a given level of FPR u0. W(u) can also be defined as certain distribution functions, such as the beta cdf, to assign varying weight to the specificity. The detailed discussion is in [10].
By substituting the functions SD,ℓ and SD̄,ℓ with their respective empirical function and , the nonparametric wAUC estimator is given by . The empirical survival functions and are defined
(2) |
Denote Ω = (Ω1, Ω2, …, ΩL). By substituting and in Equation (1), the nonparametric estimator of Ω is given by .
We define W(u) = u for 0 < u < 1 to obtain the nonparametric AUC estimator for the ℓth marker as follows
(3) |
The AUC statistic in (3) takes the form of the Wilcoxon rank-sum statistic. It essentially compares the measurements of abnormal locations with those of normal locations. To calculate this statistic, we obtain every possible pair of measurements from an abnormal location and a normal location. We assign 1 if the abnormal location’s measurement is larger than the normal location in the pair, and 0 otherwise. is then calculated by averaging the 1’s and 0’s over all possible pairs. Since the location within each subject is viewed as the unit of sampling, the inference based on the regular Wilcoxon rank-sum statistic is not valid here.
When W (u) = (u - u1)/(u2 - u1) for 0 < u1 ≤ u ≤ u2 < 1, empirically estimates the partial AUC (pAUC), and its explicit form is given by
(4) |
The pAUC statistic in (4) uses all measurements from the abnormal locations. Since the pAUC is specified to be in the range of (u1, u2), only measurements from the normal locations which fall in () are used in (4). That is, we sort all measurements from the normal locations from the smallest to the largest, and obtain the order statistics , and , where [x] denotes the smallest integer greater than or equal to x. We then calculate the Wilcoxon rank-sum like statistic by comparing all X’s with Y’s which are between and . The pAUC statistic is useful in disease screening when a high FPR would lead to a large number of falsely diagnosed subjects. It is desirable to evaluate and compare the marker accuracy at the low FPRs rather than the entire range of FPRs. When we are interested in the sensitivity of the ℓth marker at a particular threshold, say c, we can specify the probability measure to be a point mass at . The estimator then becomes
(5) |
The estimator in (5) is obtained by comparing all X’s with .
In the following sections, we propose efficient nonparametric methods based on the nonparametric estimator of Ω to evaluate and compare multiple markers in multi-reader multi-test ROC Data and longitudinal ROC data.
2.2. Multi-reader multi-test ROC data
One type of complex marker data arise frequently in medical imaging studies when radiological images of a patient are evaluated by several radiologists. [11] consider a mixed-effect ANOVA model while allowing for correlation among AUC estimators. Their model requires a specific covariance structure among the AUCs. [12] propose a pseudo-generalized estimating equation method and derive large sample theory for the estimators. Their method remains valid under the working-independence assumption.
In a multi-reader multi-test ROC study, suppose the radiologist r, r = 1, …, R, rates images for M diseased subjects and J non-diseased subjects from ℓ imaging devices. A radiologist can give one or more ratings to suspicious locations in each subject, that is, mℓi, nℓj ≥ 1. We consider L = 2. Denote Ω1, …, ΩR as wAUCs from R readers for modality 1, ΩR+1, …, Ω2R as wAUCs from R readers for modality 2. Common nonparametric approaches for comparing imaging modalities take the difference Ωr - ΩR+r between two devices for reader r, and then average these differences over all reader [13]. We can see that such methods are a special case of the linear combination of the weighted AUC statistics for reader-modality combinations. Rather than the simple average of all Ωr - ΩR+r’s, we propose to use the following weighted linear combination to possibly achieve a higher power to compare markers
(6) |
with positive and bounded weights . The parameter Ωm can be empirically estimated by
which compares two modalities with multiple readers.
Various choices of weights exist in the ROC literature. W̃ may not depend on the data. For instance, if all readers are assumed to be homogeneous with regard to their accuracy of rating images, an equal weight wr = 1/R can be assigned to reader r, r = 1, …, R. Then with mℓi = nℓj = 1 and W (u) = 1 at 0 < u < 1, becomes the AUC statistic in [13]. When one has to estimate W̃ from the data, the consistency of estimated weights Ŵ in probability is required for the derivation. For instance, a set of optimal weights is introduced by [14] and further developed by [15], who argues that when readers’ experience vary greatly, using equal weights may yield a biased AUC estimate. Let the R × R covariance matrix of estimated AUC differences, , be ΣA, and its consistent estimator . They then choose to obtain a consistent estimator for the AUC difference, where 1 is a R-dimensional vector of one’s. [14] and [15] show that this set of weights are optimal since they maximize the local power to detect the AUC difference between imaging modalities. It is clear that by combining these weights with mℓi = nℓj = 1 and W (u) = 1 at 0 < u < 1, becomes [15]’s statistic. To properly calculate the weights for the proposed statistic, we need to obtain the covariance matrix Σ of . Since in practice Ω is unknown, its consistent estimator can be obtained using the explicit expression (A.1) derived in the Appendix. Since Σ and ΣA is related via
where the rth column of the 2R × R matrix A has 1’s at rth and (R + r)th rows and 0 at other rows, the estimated weights are given by
(7) |
2.3. Longitudinal biomarker data
Another example of complex marker data comes from longitudinal studies when marker measurements are taken at several times during the studies. Most methodology for longitudinal ROC data rely on appropriate assumptions on the distributions of marker measurements [16]. In longitudinal ROC data, suppose L markers are measured on M diseased patients and J non-diseased patients at times t1, t2, …, tK.
Suppose each subject is repeatedly measured for every marker at each time. Let Xℓipk denote the test result of marker ℓ in the pth repetition on the diseased subject i at time tk, where ℓ = 1, …, L, p = 1, …, mℓik, i = 1, …, M, and k = 1, …, K. Let Yℓjqk denote test result of ℓth marker on the pth repetition in the non-diseased subject j at time tk, where ℓ = 1, …, L, q = 1, …, nℓjk, j = 1, … J, and k = 1, …, K. The nonparametric wAUC estimator for the ℓth marker is then given by , where and are defined by
(8) |
By defining W (u) accordingly in the wAUC estimator, we obtain the nonparametric AUC estimator for the ℓth marker:
the partial AUC estimator:
and the sensitivity estimator at the FPR of u0,
We define h to be a real-valued function of . Here the function h is defined on , and has continuous partial derivatives of order 2. Let the ROC summary measure be Δh = h(Ω). Its empirical estimator is given by
(9) |
The statistic above can be used to compare two longitudinal markers when h is a linear contrast. also includes a broad range of ROC statistics. It is the weighted AUC statistic in [17] and later in [10] for evaluating and comparing markers. When W (u) = 1 at 0 < u < 1 and h is a linear function, is the generalized AUC statistic in [13]. When W (u) = 1 at 0 < u < 1, is the AUC statistic in [18], assuming no correlation between X and Y, which allows for multiple observations per patient from each marker. When W (u) = (u - a)/(b - a) for 0 < a < u < b < 1 and h(Ω1, Ω2) = Ω1 - Ω2, is the pAUC statistic in [8] for comparing two markers.
When there are two longitudinal markers in the study, the optimal combination for comparing the two markers can be obtained using the similar steps in the aforementioned multi-reader multi-test studies. Suppose ℓ = 2. Let Ωl,k be the wAUC of marker l, l = 1, 2, at time tk and be its nonparametric estimator given by , where and are defined by
(10) |
Note that the estimation of Ωl,k is based on every individual time point. One can take difference of the wAUCs of two markers, and simply average these differences over all time points. We may also use the following weighted linear combination to possibly achieve a higher power to compare markers
(11) |
with positive and bounded weights . The parameter Ωℓ can be empirically estimated by
Similarly as in the previous section, the 2K × 2K covariance matrix Σ of can be estimated can be obtained using the explicit expression in (A.1). Thus the estimated weights are given by the same expression as (7).
3. Asymptotic variance expressions of the proposed statistics
In this section we derive the asymptotic variances for the proposed statistics in the multi-reader multi-test data and the longitudinal data. We first show the explicit variance expressions for , and then show the variance expression for the more general statistic in (9) for the longitudinal data.
The numbers of abnormal locations within a diseased subject may differ, and so are the numbers of normal locations within a non-diseased subject. Denote , and . Assume that SD,ℓ and SD̄,ℓ have continuous and positive derivatives, , and . In Appendix we show that the proposed statistic, , for the multi-reader multi-test ROC data is asymptotically normal when sample sizes are large. The variance of has the following expression when sample sizes get large:
(12) |
with
and
where I(ℓ1, ℓ2) = 1, if |ℓ2 - ℓ1| < R, and 0, otherwise, and
The marginal and joint survivor functions can also be empirically estimated.
Denote , and . we show in Appendix that the proposed statisatic, in (9) for the longitudinal data is also asymptotically normal, and the variance of takes on the following form when sample sizes are large,
(13) |
where
and
where
The empirical or other type of smoothed estimators for the marginal and joint survivor functions SD,ℓ, SD̄,ℓ, SD,ℓ1,l2(x1, x2), and SD̄,ℓ1,l2(y1, y2) can be used to estimate vX and vY. In the simulations and the example, we used the empirical estimators. That is, we estimate SD,ℓ and SD̄,ℓ using the expressions in (8). And we estimate SD,ℓ1,l2(x1, x2), and SD̄,ℓ1,l2(y1, y2) as follows:
Thus, when Δ’s are AUCs, vX is given by
and vY is given by
4. Simulation studies
We report simulation studies to evaluate the finite sample property of the proposed statistics. We simulated both multi-reader multi-test ROC data and longitudinal data. In multi-reader multi-test data, we considered the finite sample performance of the variance expression. More importantly, we compared the simulated powers of the equal weight and the optimal weight introduced in Section 2.2. We expect that the optimal weight results in better power than the equal weight. In longitudinal data we considered the general setting where each subject is diagnosed repeatedly at each time point and the number of repeated measures varies from subject to subject.
4.1. Multi-reader multi-test data
In the first simulation study we investigated the finite sample accuracy of the variance expression for multireader multitest data. We let mℓi = nℓj = 1, R = 3, and ℓ = 2. We simulated 1000 datasets under multivariate normal and lognormal distributions:
X ~ N(μX, ΣX) and Y ~ N(μY, ΣY), where μX = (1, …, 1), μY = (0, …, 0) and ΣX = ΣY is the variance-covariance matrix with diagonal elements (1, 1.5, 2, 1, 1.5, 2) and correlation coefficient, ρ;
X ~ LogNormal(μX, ΣX) and Y ~ LogNormal(μY, ΣY).
From simulated data we used the proposed statistic in Section 2.2, to estimate the AUC by defining the weight function W (u) = 1, for 0 < u < 1), and the pAUC by defining W (u) = 1, for 0 < u < 0.6; 0 otherwise. A 95% confidence interval for was obtained using the variance expression derived in (13). Table 1 shows biases, square root of mean squared errors (RMSE), and simulated coverage of confidence intervals. It is clear from the table that coverage levels are close to the nominal level, and biases for comparing AUCs or pAUCs are close to zero. This shows good performance of our estimator and associated asymptotic results.
Table 1.
AUC |
pAUC |
|||||||
---|---|---|---|---|---|---|---|---|
ρ | M (J) | Bias (in %) | RMSE | Coverage | Bias (in %) | RMSE | Coverage | |
Norm | −0.1 | 50 | 8.01E-02 | 0.0359 | 91.94% | 3.17E-02 | 0.0304 | 92.52% |
100 | 3.43E-02 | 0.0483 | 89.47% | 7.99E-02 | 0.0404 | 91.99% | ||
200 | −1.93E-01 | 0.0481 | 92.18% | −1.00E-01 | 0.0396 | 94.40% | ||
0.2 | 50 | −8.21E-02 | 0.0258 | 91.66% | −1.01E-01 | 0.0217 | 93.70% | |
100 | 1.31E-01 | 0.0348 | 89.87% | 1.03E-01 | 0.0296 | 91.20% | ||
200 | −1.32E-01 | 0.0343 | 92.50% | −1.21E-01 | 0.0297 | 92.60% | ||
0.5 | 50 | −6.38E-02 | 0.0175 | 94.12% | −2.01E-02 | 0.0151 | 95.70% | |
100 | −2.78E-02 | 0.0240 | 92.10% | −5.44E-02 | 0.0200 | 93.00% | ||
200 | 6.24E-02 | 0.0239 | 94.30% | −7.06E-03 | 0.0209 | 94.10% | ||
LN | −0.1 | 50 | −5.01E-02 | 0.0346 | 91.99% | 1.69E-02 | 0.0354 | 92.29% |
100 | 7.77E-02 | 0.0478 | 89.21% | 5.27E-02 | 0.0488 | 89.38% | ||
200 | −1.38E-01 | 0.0493 | 91.98% | −8.07E-04 | 0.0464 | 92.59% | ||
0.2 | 50 | −5.86E-02 | 0.0261 | 91.82% | −4.46E-02 | 0.0250 | 91.42% | |
100 | 7.04E-02 | 0.0339 | 90.16% | 7.59E-02 | 0.0352 | 89.39% | ||
200 | 3.88E-02 | 0.0340 | 92.40% | 4.38E-02 | 0.0345 | 92.70% | ||
0.5 | 50 | −5.39E-02 | 0.0169 | 94.43% | −3.60E-02 | 0.0172 | 93.93% | |
100 | −1.02E-01 | 0.0241 | 93.00% | −8.00E-02 | 0.0234 | 93.20% | ||
200 | −4.62E-02 | 0.0239 | 94.40% | −5.02E-02 | 0.0243 | 93.80% |
Norm denotes the normal distribution; LN denotes the lognormal distribution.
In the second simulation study we compared the performance of the proposed method with the parametric method by [3] and the semiparametric logistic regression method by [4] with regard to estimating the AUC. We used the same setting as the first simulation study except changing μX to (1, 1, 1, 1.5, 2, 2.5). The biases and RMSEs from the three methods are shown in Table 2. The results indicate that the proposed method and the semiparametric method perform much better than the parametric method when the distribution assumptions are violated. They also indicate that the semiparametric method performs as well as the proposed method. This is not surprising as can be seen from the description of the semiparametric method in Section 2 of [4]. The logistic regression fits the regression parameters based on the following equation:
where D is the disease status (with 1 being the diseased, and 0 being the non-diseased), β0 and β1 are regression parameters, and Z is the test result. After the regression parameter estimators, and , are obtained, the empirical ROC curve is estimated based on the new score, . Since the ROC curve is invariant to monotonic transformation, the empirical ROC curve based on the new score remains the same as the empirical ROC curve from the original test results.
Table 2.
Proposed Method | Semiparametric Method | Parametric Method | ||||||
---|---|---|---|---|---|---|---|---|
| ||||||||
ρ | M(J) | Bias | RMSE | Bias | RMSE | Bias | RMSE | |
Norm | −0.1 | 50 | −0.0140 | 0.0329 | −0.0123 | 0.0318 | −0.0131 | 0.0326 |
100 | −0.0126 | 0.0251 | −0.0144 | 0.0249 | −0.0138 | 0.0246 | ||
200 | −0.0136 | 0.0202 | −0.0132 | 0.0203 | −0.0135 | 0.0198 | ||
0.2 | 50 | −0.0149 | 0.0247 | −0.0155 | 0.0440 | −0.0117 | 0.0423 | |
100 | −0.0150 | 0.0331 | −0.0139 | 0.0327 | −0.0125 | 0.0317 | ||
200 | −0.0140 | 0.0451 | −0.0147 | 0.0262 | −0.0136 | 0.0241 | ||
0.5 | 50 | −0.0133 | 0.0455 | −0.0153 | 0.0456 | −0.0168 | 0.0446 | |
100 | −0.0132 | 0.0252 | −0.0130 | 0.0327 | −0.0151 | 0.0330 | ||
200 | −0.0132 | 0.0333 | −0.0139 | 0.0258 | −0.0121 | 0.0239 | ||
LN | −0.1 | 50 | −0.0152 | −0.0158 | −0.0122 | 0.0360 | 0.0689 | 0.0779 |
100 | −0.0131 | −0.0129 | −0.0120 | 0.0265 | 0.0758 | 0.0814 | ||
200 | −0.0131 | −0.0145 | −0.0127 | 0.0203 | 0.0799 | 0.0833 | ||
0.2 | 50 | −0.0158 | 0.0446 | −0.0139 | 0.0499 | 0.0706 | 0.0817 | |
100 | −0.0120 | 0.0232 | −0.0141 | 0.0351 | 0.0754 | 0.0810 | ||
200 | −0.0136 | 0.0327 | −0.0129 | 0.0249 | 0.0807 | 0.0846 | ||
0.5 | 50 | −0.0158 | 0.0460 | −0.0156 | 0.0498 | 0.0705 | 0.0838 | |
100 | −0.0129 | 0.0255 | −0.0120 | 0.0344 | 0.0791 | 0.0877 | ||
200 | −0.0145 | 0.0343 | −0.0134 | 0.0256 | 0.0826 | 0.0884 |
Norm denotes the normal distribution; LN denotes the lognormal distribution.
In the third simulation study we compared the simulated powers using the optimal weight versus the equal weight. We again let mℓi = nℓj = 1, R = 3, and ℓ = 2. We simulated 1000 datasets under multivariate normal distributions: X ~ N(μX, ΣX) and Y ~ N(μY, ΣY), where μX = (2, 1, …, 1), μY = (0, …, 0) and ΣX = ΣY is the variance-covariance matrix with diagonal elements (1, 1.5, 2, 2, 3, 2) and correlation coefficient, ρ. We selected m = n in (50,100), and ρ in (−0.1, 0.2, 0.5). For each simulated data, we estimated the weighted differences in (2.2):
with both equal weights (wr = 1/3) and the optimal weights given in (7). The AUC was estimated by defining the weight function W (u) = 1, for 0 < u < 1), and the pAUC was estimated by defining W (u) = 1, for 0 < u < 0.6; 0 otherwise. The simulated power was then calculated as the number of rejections out of 1000 simulated datasets. Table 3 shows the simulated powers for the comparison of AUCs and pAUCs. It is clear that the optimal weights always result in much larger powers than the equal weights.
Table 3.
AUC | ||||
---|---|---|---|---|
Equal Weight |
Optimal Weight |
|||
ρ | M=J=50 | 100 | 50 | 100 |
−0.1 | 0.507 | 0.741 | 0.723 | 0.932 |
0.2 | 0.335 | 0.541 | 0.659 | 0.909 |
0.5 | 0.327 | 0.538 | 0.703 | 0.936 |
| ||||
pAUC | ||||
Equal Weight |
Optimal Weight |
|||
M=J=50 | 100 | 50 | 100 | |
| ||||
−0.1 | 0.156 | 0.290 | 0.316 | 0.599 |
0.2 | 0.141 | 0.212 | 0.280 | 0.584 |
0.5 | 0.133 | 0.187 | 0.266 | 0.643 |
4.2. Longitudinal biomarker data
In this simulation study we generated multivariate log-normal correlated biomarker data. We generated data by taking exponential of multivariate normal data Xi ~ N(μX,i, ΣX,i) and Yj ~ N( 0, ΣY,j), where μX,i = (2, …, 2, 1, …, 1), and ΣX,i and ΣY,j are variance-covariance matrices. We let L = 2, K = 3, M = J = (50, 200). To allow various cluster sizes, we let mℓik = 2 for the first half of diseased subjects, and mℓik = 4 for the other half. For non-diseased subjects, let nℓjk = 5 for the first half, and nℓjk = 3 for the other half. We chose ΣX,i = (1 - ρ)M + ρ 1i1i′, where Mi is the LKmℓik × LKmℓik identity matrix and 1i is the LKmℓik × 1 matrix with all elements 1. Similar setting was applied to define ΣY,j. Here ρ gives within-subject correlation. We let ρ = 0.4 for the diseased and ρ = 0.3 for the non-diseased. We simulated 1000 datasets for each sample size, and obtained the estimate of AUC difference between two biomarkers, , and its variance. Table 4 shows biases, square root of mean squared errors (RMSE), and simulated coverage of confidence intervals. This again shows good performance of our estimator for correlated biomarker data.
Table 4.
AUC |
pAUC |
||||||
---|---|---|---|---|---|---|---|
M (J) | Bias (in %) | RMSE | Coverage | Bias (in %) | RMSE | Coverage | |
Norm | 50 | −0.1182 | 1.0266 | 97.40% | 0.0627 | 0.0184 | 97.40% |
100 | 0.0302 | 2.1682 | 96.60% | 0.0931 | 0.0128 | 96.60% | |
200 | 0.0038 | 1.5226 | 95.80% | 0.0116 | 0.0090 | 96.00% | |
LN | 50 | −0.0768 | 0.0143 | 97.10% | 0.0097 | 0.0125 | 97.10% |
100 | −0.1126 | 0.0218 | 96.20% | 0.0521 | 0.0093 | 96.80% | |
200 | −0.0445 | 0.0109 | 94.90% | 0.0317 | 0.0188 | 95.00% |
Norm denotes the normal distribution; LN denotes the lognormal distribution.
5. An example in the diagnosis of endometriosis
The proposed nonparametric ROC summary statistics are applied in this section to data from a study on endometriosis diagnosis. Endometriosis is a gynecological medical condition in which endometrial-like cells appear and flourish in areas outside the uterine cavity and is typically seen in women at their reproductive ages. It has been estimated that endometriosis occurs in roughly 5%–10% of women. Despite its relatively high prevalence, substantive and methodological challenges exist, including diagnostic proficiency. The Physician Reliability Study, an add-on to the Endometriosis: Natural History, Diagnosis and Outcome (ENDO) Study [19], addressed this issue by investigating whether sequentially added clinical information of a subject can aid in more accurately diagnosing the disease of endometriosis. Detailed study designs of ENDO and PRS can be found in the aforementioned references. For demonstration purpose in this paper, we used review results of 4 physicians (reviewers) in PRS on 150 participants. All 150 participants had recorded operative digital images of their pelvic organs and descriptive drawings and notes, both from surgeons who conducted the laparoscopies on these women in ENDO study. The reviewers conducted their reviewing and diagnosis under two modalities. Modality one corresponds to the setting where the reviewers are presented with participants’ digital video/images while modality two corresponds to the setting where both digital video/images and surgeon’s reports (drawings and notes) are presented. For each participant under each modality, the reviewer answered a series questions on what they observe from the clinical information. These answered were later used to derive the rASRM scores [20] which we used as the diagnostic outcomes in this paper. The visualized diagnosis from the original ENDO study of these participants were used as the gold standard.
For the first modality, the estimated AUCs are (0.71, 0.75, 0.63, 0.76) for the four reviewers; the corresponding numbers are (0.83, 0.85, 0.75, 0.87) for the second modality. With equal weights wr = 1/4, r = 1, …, 4, the Δ-statistic is , and its variance estimate is 0.0007475. We used (7) to obtain the optimal weights (w1, w2, w3, w4)=(298.08, 401.16, 176.88, 560.48). Using these weights, the Δ-statistic is given by , and its variance estimate is 0.0006961. This indicates that the Δ-statistic is more precisely estimated by using the optimal weights. The two-sided p-value using the optimal weights is 2.36 × 10−5, which is slightly smaller than the p-value 2.82 × 10−5 using equal weights. The two-sided p-values based on both sets of weights are both close to zero, which indicates that these physicians are able to give more precise diagnosis on endometriosis by reviewing both digital images and surgeons’ descriptive reports.
6. Discussion
The proposed methods in the paper are nonparametric and can be applied to evaluate and compare diagnostic markers in the multireader multitest data and the longitudinal data. As illustrated in the simulation studies and the example, the proposed weighted method in the multireader multitest data tends to have a larger power than the existing methods. We also conducted simulation studies to investigate the finite sample performance of the proposed method in the longitudinal data setting. More complex correlated data in which both normal and abnormal locations may occur in the same subject have been considered in [21] and [22]. How to extend the proposed statistics to such a data setting is a future research topic.
As pointed out by a reviewer, the proposed method is based on the empirical distribution estimators, and may not allow more complicated dependencies of observations in longitudinal data. For example, in the case of autoregressive dependencies, empirical estimators could not converge to target probabilities, especially when autoregression coefficients are greater than one. More research is merited to extend the proposed method in this direction.
Acknowledgement
The authors would like to thank an associate editor and two referees for their constructive comments and suggestions. The project described here was supported in part by Award Number R15CA150698 from the National Cancer Institute under the American Recovery and Reinvestment Act of 2009 and by Award Number H98230-11-1-0196 from the National Security Agency. The work was also supported in part with funding from the American Chemistry Council and the Intramural Research Program of the Eunice Kennedy Shriver National Institute of Child Health and Human Development. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Cancer Institute or the National Institutes of Health.
Appendix: Derivation of variance expression of Δh
Assume that SD,ℓ and SD̄,ℓ have continuous and positive derivatives, , and . Suppose that M/mℓ → αℓ, M/nℓ → βℓ, M/J → λ, , and , as M, J → ∞. Assume that αℓ, βℓ, and are finite numbers. In addition, assume that the function h has continuous partial derivatives of order 2 at each point of an open set (Ω − ε, Ω + ε), for ε > 0.
where and are the first derivatives of SD,ℓ and SD̄,ℓ, respectively.
The asymptotic normality of is derived using results from [18], which gives that for markers 1, … L,
where and are limiting Gaussian processes. Therefore, after some calculation, it follows that
(A.1) |
where the {ℓ1, ℓ2} element in Σ1 is given by
(A.2) |
and the {ℓ1, ℓ2} element in Σ2 is
(A.3) |
The Taylor expansion of at Ω gives
(A.4) |
where Δh(Ω) is the gradient of h evaluated at Ω. Since the asymptotic variance of the right hand side in (A.4) is given by
It follows that
(A.5) |
Using the covariance structures in (A.2) and (A.3) in (A.5), we can then obtain the asymptotic normality of by combining (A.1) with the Cramer-Wold device [23].
References
- 1.Zhou XH, McClish DK, Obuchowski N. Statistical Methods in Diagnostic Medicine. Wiley; New York: 2002. [Google Scholar]
- 2.Zou K. Comparison of correlated receiver operating characteristic curves derived from repeated diagnostic test data. Academic Radiology. 2001;8(3):225–233. doi: 10.1016/S1076-6332(03)80531-7. [DOI] [PubMed] [Google Scholar]
- 3.Molodianovitch K, Faraggi D, Reiser B. Comparing the areas under two correlated ROC curves: parametric and non-parametric approaches. Biometrical Journal. 2006;48:745–757. doi: 10.1002/bimj.200610223. [DOI] [PubMed] [Google Scholar]
- 4.Copas JB, Corbett P. Overestimation of the receiver operating characteristic curve for logistic regression. Biometrika. 2002;89(2):315–331. [Google Scholar]
- 5.DeLong ER, DeLong D, Clarke-Pearson D. Comparing the areas under two or more correlated receiver operating characteristic curves: A nonparametric approach. Biometrics. 1988;44:837–845. [PubMed] [Google Scholar]
- 6.Obuchowski NA. Nonparametric analysis of clustered ROC curve data. Biometrics. 1997;53:567–578. [PubMed] [Google Scholar]
- 7.Bamber D. The area above the ordinal dominance graph and the area below the receiver operating characteristic graph. Journal of Mathematical Psychology. 1975;12:387–415. [Google Scholar]
- 8.Zhang D, Zhou X, Freeman D, Freeman J. A non-parametric method for the comparison of partial areas under ROC curves and its application to large health care data sets. Statistics in Medicine. 2002;21(5):701–715. doi: 10.1002/sim.1011. [DOI] [PubMed] [Google Scholar]
- 9.Baker S, Pinsky P. A proposed design and analysis for comparing digital and analog mammography: special receiver operating characteristic methods for cancer screening. Journal of The American Statistical Association. 2001;96:421–428. [Google Scholar]
- 10.Li J, Fine JP. Weighted area under the receiver operating characteristic curve and its application to gene selection. Journal of the Royal Statistical Society: Series C (Applied Statistics) 2010;59(4):673–692. doi: 10.1111/j.1467-9876.2010.00713.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Obuchowski N, Rockette H. Hypothesis testing of diagnostic accuracy for multiple readers and multiple tests: an ANOVA approach with dependent observations. Communications in Statistics-Theory and Methods. 1995;24(2):285–308. [Google Scholar]
- 12.Song X, Zhou XH. A marginal model approach for analysis of multi-reader multi-test receiver operating characteristic (ROC) data. Biostatistics. 2005;6(2):303–312. doi: 10.1093/biostatistics/kxi011. [DOI] [PubMed] [Google Scholar]
- 13.Lee MLT, Rosner BA. The average area under correlated receiver operating characteristic curves: A nonparametric approach based on generalized two-sample wilcoxon statistics. Applied Statistics. 2001;50(3):337–344. [Google Scholar]
- 14.Wei LJ, Johnson WE. Combining dependent tests with incomplete repeated measurements. Biometrika. 1985;72(2):359–364. [Google Scholar]
- 15.Yang Y, Jin Z. Combining dependent tests to compare the diagnostic accuracies: non-parametric approach. Statistics in Medicine. 2006;25(7):1239–1250. doi: 10.1002/sim.2338. [DOI] [PubMed] [Google Scholar]
- 16.Etzioni R, Pepe M, Longton G, Hu C, Goodman G. Incorporating the time dimension in receiver operating characteristic curves: A case study of prostate cancer. Medical Decision Making. 1999;19:242–251. doi: 10.1177/0272989X9901900303. [DOI] [PubMed] [Google Scholar]
- 17.Wieand S, Gail MH, James BR, James KL. A family of nonparametric statistics for comparing diagnostic markers with paired or unpaired data. Biometrika. 1989;76:585–592. [Google Scholar]
- 18.Li G, Zhou K. A unified approach to nonparametric comparison of receiver operating characteristic curves for longitudinal and clustered data. Journal of the American Statistical Association. 2008;103:705–713. doi: 10.1198/016214508000000364. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Buck Louis GM, Hediger ML, Peterson CM, Croughan M, Sundaram R, Stanford J, Chen Z, Fujimoto VY, Varner MW, Trumble A, et al. Incidence of endometriosis by study population and diagnostic method: the endo study. Fertility and sterility. 2011;96:360–365. doi: 10.1016/j.fertnstert.2011.05.087. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.American Society For Reproductive Medicine Revised american society for reproductive medicine classification of endometriosis: 1996. Fertility and Sterility. 1997;67:817–821. doi: 10.1016/s0015-0282(97)81391-x. [DOI] [PubMed] [Google Scholar]
- 21.Werner C, Brunner E. Rank methods for the analysis of clustered data in diagnostic trials. Computational Statistics & Data Analysis. 2007;51(10):5041–5054. [Google Scholar]
- 22.Konietschke F, Brunner E. Nonparametric analysis of clustered data in diagnostic trials: Estimation problems in small sample sizes. Computational Statistics & Data Analysis. 2009;53(3):730–741. [Google Scholar]
- 23.Serfling RJ. Approximation theorems of mathematical statistics. Wiley; New York: 1980. [Google Scholar]