Summary
Receiver operating characteristic (ROC) curve is a popular tool to describe and compare the diagnostic accuracy of biomarkers when a binary-scale gold standard is available. However, there are many examples of diagnostic tests whose gold standards are continuous. Hence, Several extensions of receiver operating characteristic (ROC) curve are proposed to evaluate the diagnostic potential of biomarkers when the gold standard is continuous-scale. Moreover, in evaluating these biomarkers, it is often necessary to consider the effects of covariates on the diagnostic accuracy of the biomarker of interest. Covariates may include subject characteristics, expertise of the test operator, test procedures or aspects of specimen handling. Applying the covariate adjustment to the case that the gold standard is continuous is challenging and has not been addressed in the literature. To fill the gap, we propose two general testing frameworks to account for the covariates effect on diagnostic accuracy. Simulation studies are conducted to compare the proposed tests. Data from a study that assessed three types of imaging modalities with the purpose of detecting neoplastic colon polyps and cancers are used to illustrate the proposed methods.
Keywords: biomarkers, covariate adjustment, diagnostic accuracy, regression analysis
1 |. INTRODUCTION
In medical diagnostics, the receiver operating characteristic (ROC) curve is a useful tool to assess a biomarker’s accuracy and compare accuracies of competing biomarkers. In traditional ROC analysis, a subject is diagnosed as diseased and nondiseased based on the subject’s level of the biomarker. The diagnostic accuracy of the biomarker depends on the existence of a binary-scale gold standard, usually obtained from clinical follow-up, surgical verification, autopsy, and so on. Sensitivity and Specificity at various cutpoints of the biomarker’s test results summarize the diagnostic accuracies of the biomarker.
In practice however, there are many examples of gold standards whose outcomes are not on a binary-scale, but rather on a continuous-scale. For example, glycosylated hemoglobin is usually used as a primary index of diabetic control, and is originally measured as a continuous-valued variable. Setting an inappropriate threshold to dichotomize a continuous gold standard results in imprecision of the diagnostic accuracy of the biomarker and conceals an important relationship between the biomarker and the gold standard. Obuchowski1,2 extended the analysis of the diagnostic accuracy of a binary gold standard to the case of a continuous gold standard, and proposed an index measure and presented a nonparametric estimator of the index for a continuous biomarker. Chang3 discussed linearly combining multiple biomarkers to maximize this accuracy index.
It is well known that the performance of a diagnostic test/biomarker may depend on certain characteristics of patients, the environmental factors of diagnostic tests, setting and types of diagnostic tests, and so on. When the gold standard is binary-scale, various methods have been developed for regression analysis of ROC curves and the summary measures of accuracy derived from them to account for the effect of covariates. Methods found in the literature for modeling the effect of covariates on ROC curves can be loosely divided into two categories, the indirect and the direct approach; see, among many others.4,5 The indirect approach is to model the test results as a function of the binary gold standard and covariates and then derive the ROC curve or its summary measures. This method was first proposed by Tosteson and Begg,6 who fitted an ordinal regression model with location and scale parameters on the result of the diagnostic test that is ordered categories. O’Malley et al7 proposed a Bayesian regression method for continuous test results and then evaluated the ROC curve. Faraggi8 described the biomarker values as functions of covariates with the assumption that there is no covariate dependence in the diseased population and then obtained the estimates for the area under the ROC curve and the Youden index. The direct approach is to model the ROC curves or summary measures on covariates and the binary gold standard. Regression models for various ROC summary measures have been proposed, such as the area under the ROC curve based on a conventional analysis of variance (ANOVA).9,10 Pepe11 proposed a regression model for ROC curves directly and provided the parameter estimation using GLM procedures. Kim and Zeng5 generalized an accelerated failure time model from survival analysis to ROC analysis, assuming a relationship between covariates and the baseline ROC curve. Nonparametric and semi-parametric estimation of the effect of covariates was also discussed, in which regression models for ROC summary indices or ROC curves were available; see References 12–15, among others. Regression analysis has also been developed when there is an ignorable verification bias and missing or partially missing gold standard.14,16–18
To the best of our knowledge, however, there is no specific methodology developed in the literature to account for the effect of covariates in assessing the diagnostic accuracy when the gold standard is continuous. To fill this gap, we develop methodology to directly evaluate the effect of covariates on the accuracy index concerning a biomarker and a gold standard, both measured on a continuous scale. A motivating example is to assess the accuracy in detecting colon cancer of two imaging modalities, barium enema (BE) and computed tomographic colonography (CTC), as compared to colonoscopy (CO) which is considered the gold standard. Due to the potential biological differences in race and family history, our interest is to assess the how the diagnostic accuracy varies according to the family history and race of the patient.
The paper is arranged as follows. In Section 2, we propose two test methods to assess the effect of a common covariate on the diagnostic accuracy of a biomarker against a continuous gold standard. In Section 3, we first present simulation results on type I error and power of the tests for various sample sizes, and then demonstrate the methods using imaging modalities data. Some discussion is provided in Section 4. The proofs of the asymptotic results are relegated to the appendix.
2 |. METHODOLOGY
Let be the gold standard, be the biomarker, and be the covariate, for the th subject . We assume that and are continuous and without loss of generality, is discrete, taking possible values . We want to build a model of the diagnostic accuracy on the covariate. Obuchowski1,2 defined the accuracy as the probability that a randomly selected patient with a higher gold standard outcome has a higher biomarker test result than a randomly selected patient with a lower gold standard outcome. Extending this definition, for each level , we define the diagnostic accuracy index as
| (1) |
We estimate with Obuchowski’s1,2 nonparamatric estimator , which is calculated by summing the indicators of all possible pairs of subjects. Thus, at the th level (), we have an unbiased estimator for :
| (2) |
where is the number of observations at the covariate level and the summations are taken over the corresponding sets of observations with the same covariate level, the function given satisfies that
| (3) |
For simplicity, we denote
Our main aim is to investigate whether the accuracy depends on the covariate .
2.1 |. Regression analysis
Assume that the link model is a linear function of the covariate at each level , that is,
| (4) |
where is a strictly monotone function with domain () and range (0, 1). In practice, is usually be adopted as the probit-link function or the logit-link function , where is the standard normal distribution function. The regression coefficients can be estimated using the weighted linear regression technique on the data .
The exact variance of given below is derived using the theory developed for general -statistics,
where , and
Denote . Thus the variance of can also be written in terms of and as
| (5) |
We use a nonparametric unbiased estimator to estimate , where
Substituting and for and , we obtain an estimator of the variance of :
| (6) |
Alternatively, the variance of can be estimated via bootstrapping with replacement from , denoted as . Later we will present simulation results to compare these two variance estimation approaches in controlling error rates.
Using the delta-method, the variance of is found to be
| (7) |
and its estimator
| (8) |
where is the derivative of . Especially, when , and when , where is the standard normal density.
To investigate whether the accuracy depends on the covariate , we test the null hypothesis . To this end, we rewrite our model as
| (9) |
where , and .
Assume that samples at different covariate levels are independent. Then it has
| (10) |
Furthermore, we have that the variance-covariance matrix of the response variable is a diagonal matrix with elements . Thus the regression coefficients can be estimated by
| (11) |
with . Replacing with its estimate obtains the feasible estimator of , still denoted by in the following. Note that . Therefore, the variance of can be approximately estimated by .
A test statistic can then be defined as
| (12) |
which asymptotically follows a standard normal distribution when . The null hypothesis is rejected if , where is the upper percentile of the standard normal distribution.
2.2 |. The Hotelling’s approach
The regression approach in general assumes that the accuracy index and the covariate satisfy certain known functional relationship. In the regression model developed in the previous section, we assume that the transformed accuracy under a known monotone link is a linear function of the covariate values, and obtain estimates of the regression coefficients via weighted least squares method. In simulation we adopt the normal probit link. When such a functional relationship is not clear we may test a more general hypothesis , with the alternative hypothesis : not all s are equal. Equivalently, the test can be written in the form of a union-intersection test,
| (13) |
We propose a test statistic analogous to the Hotelling’s to test the above null hypothesis. Since all the s are -statistics, they have asymptotically normal distributions and are mutually independent. Therefore,
Or equivalently, is asymptotically where where
Then our Hotelling’s statistic is
| (14) |
where the estimate of is obtained with replacing (or . Under the null hypothesis, asymptotically has a distribution with degrees of freedom. If the observed is greater than , we reject the null hypothesis at the significance level, where is the upper percentile of the distribution with degrees of freedom. Technical details on the asymptotic distribution of are provided in the appendix.
Remark 1. When , the test statistic based on the regression method and the Hotelling’s are respectively
Due to the monotonicity of , it follows from the delta-method that the two methods perform similarly well, especially when and are large, see simulation results in the next section.
Remark 2. The test based on regression analysis usually has higher power if the accuracy index increases as the value of increases (or decreases), even if the link function is misspecified, otherwise Hotelling’s test performs better.
Remark 3. For the original sample of the continuous random variables in (2) can be replaced by the indicator function , but the latter will lead to an underestimation of under bootstrap sampling, due to the repetition of some subjects in each bootstrap sample.
3 |. SIMULATION STUDY AND A REAL EXAMPLE
We conducted simulation studies to examine the finite sample performance of the two methods proposed in the previous section, the -test based on the regression analysis and the Hotelling test. For each test, we also compared the two estimates of the variance of : one is the approximate variance (6) given in Section 2.1, and the other is the bootstrap variance with 1000 repetitions. We compared the type I error rates and powers of the two tests under various alternatives. This section summarizes the simulation results and demonstrates the proposed methods with a real example.
3.1 |. Comparison of regression analysis and Hotelling’s test
We assume that given the covariate follows a bivariate normal distribution. Then it follows from Reference 19 that
| (15) |
where () and () are random samples from the distribution of () given ,
Without the loss of generality, we assume
and is a function of , which is the correlation coefficient between the biomarker and the gold standard when the covariate is at the kth level. In our simulations, is taken as the probit link function , the covariate was set to have or 5 levels, and variances of and are equal () or unequal (). For each level, we assumed that we could obtain , or 150 observations. We generated 1000 data sets to obtain the empirical type I error rates and power under various alternative hypotheses.
3.1.1 |. The type I error
Under the null hypothesis , which is equivalent to , we fix for each level of . Table 1 summarizes the type I error rates of the regression-based test and the Hotelling test under various configurations of variance estimation, number of levels of the covariate, and sample sizes. As expected, for relatively large sample sizes, the two test approaches control the type I error quite satisfactorily at the nominal level of 0.05 using either the approximate or bootstrap variance, that is, or . For a relatively small sample size (such as ), however, using the approximate variance yield inflated type I error rates in which case the bootstrap variance is preferred, especially when is relatively large. Partly because the two proposed test statistics, and , involve the estimation of the unknown parameters, s, the smaller and the larger , the more their distributions of and deviate from their asymptotic distributions under the null hypothesis . Furthermore, the type I error rates of the two tests seem to be less affected by the value of .
TABLE 1.
Comparison of type I error rates of the regression-based test and the Hotelling T2 test (runs:1000).
| n | Test | Variance | The number of levels of Z |
|||||||
|---|---|---|---|---|---|---|---|---|---|---|
|
K = 2 |
K = 3 |
K = 4 |
K = 5 |
|||||||
| 30 | Regression | Approximate | 0.072 | 0.089 | 0.081 | 0.084 | 0.084 | 0.105 | 0.092 | 0.124 |
| Bootstrapping | 0.042 | 0.054 | 0.046 | 0.046 | 0.049 | 0.058 | 0.055 | 0.066 | ||
| Hotelling | Approximate | 0.070 | 0.083 | 0.091 | 0.094 | 0.111 | 0.120 | 0.149 | 0.137 | |
| Bootstrapping | 0.040 | 0.053 | 0.042 | 0.045 | 0.041 | 0.059 | 0.065 | 0.064 | ||
| 60 | Regression | Approximate | 0.070 | 0.062 | 0.081 | 0.078 | 0.078 | 0.051 | 0.059 | 0.061 |
| Bootstrapping | 0.057 | 0.040 | 0.063 | 0.057 | 0.057 | 0.039 | 0.042 | 0.048 | ||
| Hotelling | Approximate | 0.072 | 0.061 | 0.067 | 0.075 | 0.092 | 0.062 | 0.078 | 0.077 | |
| Bootstrapping | 0.054 | 0.040 | 0.039 | 0.058 | 0.065 | 0.042 | 0.051 | 0.047 | ||
| 100 | Regression | Approximate | 0.056 | 0.067 | 0.051 | 0.042 | 0.063 | 0.067 | 0.075 | 0.064 |
| Bootstrapping | 0.048 | 0.056 | 0.046 | 0.036 | 0.050 | 0.054 | 0.060 | 0.054 | ||
| Hotelling | Approximate | 0.057 | 0.066 | 0.066 | 0.048 | 0.054 | 0.052 | 0.077 | 0.065 | |
| Bootstrapping | 0.048 | 0.053 | 0.052 | 0.032 | 0.036 | 0.039 | 0.057 | 0.050 | ||
| 150 | Regression | Approximate | 0.049 | 0.057 | 0.062 | 0.057 | 0.056 | 0.052 | 0.052 | 0.059 |
| Bootstrapping | 0.045 | 0.050 | 0.058 | 0.049 | 0.048 | 0.045 | 0.047 | 0.056 | ||
| Hotelling | Approximate | 0.049 | 0.055 | 0.058 | 0.046 | 0.054 | 0.052 | 0.066 | 0.072 | |
| Bootstrapping | 0.044 | 0.048 | 0.050 | 0.040 | 0.047 | 0.041 | 0.055 | 0.055 | ||
| 30 | Regression | Approximate | 0.079 | 0.091 | 0.083 | 0.086 | 0.106 | 0.111 | 0.109 | 0.127 |
| Bootstrapping | 0.039 | 0.050 | 0.044 | 0.038 | 0.047 | 0.053 | 0.045 | 0.062 | ||
| Hotelling | Approximate | 0.071 | 0.077 | 0.098 | 0.087 | 0.105 | 0.124 | 0.138 | 0.138 | |
| Bootstrapping | 0.033 | 0.039 | 0.030 | 0.035 | 0.030 | 0.042 | 0.037 | 0.042 | ||
| 60 | Regression | Approximate | 0.071 | 0.061 | 0.081 | 0.089 | 0.071 | 0.060 | 0.070 | 0.067 |
| Bootstrapping | 0.048 | 0.046 | 0.059 | 0.062 | 0.049 | 0.036 | 0.046 | 0.042 | ||
| Hotelling | Approximate | 0.068 | 0.058 | 0.060 | 0.072 | 0.082 | 0.067 | 0.088 | 0.070 | |
| Bootstrapping | 0.042 | 0.038 | 0.033 | 0.040 | 0.044 | 0.029 | 0.042 | 0.038 | ||
| 100 | Regression | Approximate | 0.048 | 0.049 | 0.053 | 0.064 | 0.062 | 0.053 | 0.062 | 0.082 |
| Bootstrapping | 0.043 | 0.045 | 0.042 | 0.049 | 0.039 | 0.045 | 0.048 | 0.042 | ||
| Hotelling | Approximate | 0.045 | 0.045 | 0.049 | 0.064 | 0.056 | 0.052 | 0.064 | 0.069 | |
| Bootstrapping | 0.035 | 0.037 | 0.040 | 0.044 | 0.046 | 0.046 | 0.034 | 0.030 | ||
| 150 | Regression | Approximate | 0.049 | 0.049 | 0.050 | 0.059 | 0.070 | 0.054 | 0.056 | 0.058 |
| Bootstrapping | 0.045 | 0.056 | 0.061 | 0.060 | 0.055 | 0.051 | 0.037 | 0.047 | ||
| Hotelling | Approximate | 0.047 | 0.048 | 0.056 | 0.065 | 0.070 | 0.061 | 0.067 | 0.074 | |
| Bootstrapping | 0.040 | 0.052 | 0.047 | 0.049 | 0.043 | 0.044 | 0.037 | 0.043 | ||
Note: n is the sample size in each level of Z.
3.1.2 |. Power analysis
Simulations were conducted to estimate the empirical power of the tests under various configurations of the alternative hypothesis : not all s are equal. Under bivariate normal assumption, we considered three settings for generating the covariate dependent correlation coefficient . The first setting is to exactly meet the assumptions of the regression analysis, in which increases by the same amount as the level of is increased by one unit. To generate data for this setting, we fixed and solved for s from equations (15) and , that is,
where we take and for , then . The second setting departs slightly from the first setting in that the logit link function was used instead of the probit link function to generate the data. From the Equations (15) and , we can derive with , so we obtain when and for . The third setting is more general in that the s are different but do not follow any functional trend as the covariate varies, where is computed from Equations (15) in which we take . For all the three settings, the number of levels of the covariate and the sample sizes were the same as the set-up for type I error simulations. Both equal and unequal variances of and were considered. The equal variances case occurs more likely when and measure the same variable, for example, when is the gold standard and is an imperfect instrument; the unequal variances case occurs more likely otherwise.
The simulated powers for the three settings are presented in Tables 2,3, and 4, respectively. As expected, power increases with sample size. The power obtained with the approximate variance appears to be larger than the power obtained with the bootstrap variance. Combining Table 1, we suggest that the approximate variance is reliable for the relatively large sample size. For small sample sizes, however, the bootstrap variance should be used due to the inflation of type I error rates with the approximate variance. It appears that the regression method is more powerful than the Hotelling’s test in monotone settings 1 and 2, even when the regression model assumptions are violated (in settings 2). However, In setting 3, the Hotelling’s test performs significantly better than regression method except the case , just as shown in Table 4. Tables 1,2,3, and 4 all show that the Hotelling’s test performs nearly as well as the regression method may when and is relatively large, which confirms Remark 1 in Section 2. Thus the regression method is preferred if the s do follow some monotone functional trend as the covariate varies, otherwise, Hotelling’s is chosen. For a given sample size , we can also find from Tables 2 and 3 that the powers of the two tests increase with in the monotone settings 1 and 2. The reason for this is that a larger results in the more significant difference between s. For a relatively small , a larger also results in an inflated type I error rate, which will increase the powers of the two tests based on the approximate variances.
TABLE 2.
Comparison of powers of the regression-based test and the Hotelling T2 test under the first setting (runs:1000).
| n | Test | Variance | The number of levels of Z |
|||||||
|---|---|---|---|---|---|---|---|---|---|---|
|
K = 2 |
K = 3 |
K = 4 |
K = 5 |
|||||||
| 30 | Regression | Approximate | 0.223 | 0.223 | 0.566 | 0.590 | 0.908 | 0.919 | 0.998 | 0.995 |
| Bootstrapping | 0.139 | 0.131 | 0.432 | 0.454 | 0.854 | 0.863 | 0.996 | 0.988 | ||
| Hotelling | Approximate | 0.207 | 0.200 | 0.460 | 0.479 | 0.823 | 0.837 | 0.985 | 0.982 | |
| Bootstrapping | 0.118 | 0.105 | 0.252 | 0.272 | 0.601 | 0.625 | 0.897 | 0.884 | ||
| 60 | Regression | Approximate | 0.348 | 0.342 | 0.864 | 0.890 | 0.999 | 1.000 | 1.000 | 1.000 |
| Bootstrapping | 0.290 | 0.297 | 0.819 | 0.845 | 0.998 | 1.000 | 1.000 | 1.000 | ||
| Hotelling | Approximate | 0.339 | 0.335 | 0.789 | 0.795 | 0.991 | 0.998 | 1.000 | 1.000 | |
| Bootstrapping | 0.266 | 0.267 | 0.672 | 0.688 | 0.970 | 0.987 | 1.000 | 1.000 | ||
| 100 | Regression | Approximate | 0.527 | 0.521 | 0.974 | 0.981 | 1.000 | 1.000 | 1.000 | 1.000 |
| Bootstrapping | 0.489 | 0.468 | 0.972 | 0.975 | 1.000 | 1.000 | 1.000 | 1.000 | ||
| Hotelling | Approximate | 0.524 | 0.514 | 0.956 | 0.960 | 1.000 | 1.000 | 1.000 | 1.000 | |
| Bootstrapping | 0.470 | 0.448 | 0.929 | 0.937 | 1.000 | 1.000 | 1.000 | 1.000 | ||
| 150 | Regression | Approximate | 0.716 | 0.720 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 |
| Bootstrapping | 0.678 | 0.690 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | ||
| Hotelling | Approximate | 0.715 | 0.709 | 0.990 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | |
| Bootstrapping | 0.664 | 0.658 | 0.985 | 0.988 | 1.000 | 1.000 | 1.000 | 1.000 | ||
Note: The data were generated under the assumptions of regression analysis with a probit link function, in which increases by the same amount as the level of Z is increased by one unit.
TABLE 3.
Comparison of powers of the regression-based test and the Hotelling T2 test under the second setting (runs:1000).
| n | Test | Variance | The number of levels of Z |
|||||||
|---|---|---|---|---|---|---|---|---|---|---|
|
K = 2 |
K = 3 |
K = 4 |
K = 5 |
|||||||
| 30 | Regression | Approximate | 0.114 | 0.114 | 0.288 | 0.256 | 0.510 | 0.511 | 0.765 | 0.773 |
| Bootstrapping | 0.074 | 0.075 | 0.199 | 0.188 | 0.402 | 0.415 | 0.679 | 0.687 | ||
| Hotelling | Approximate | 0.112 | 0.109 | 0.231 | 0.240 | 0.390 | 0.415 | 0.608 | 0.636 | |
| Bootstrapping | 0.069 | 0.072 | 0.146 | 0.144 | 0.228 | 0.240 | 0.401 | 0.428 | ||
| 60 | Regression | Approximate | 0.151 | 0.144 | 0.383 | 0.415 | 0.761 | 0.796 | 0.959 | 0.956 |
| Bootstrapping | 0.119 | 0.120 | 0.330 | 0.363 | 0.721 | 0.743 | 0.948 | 0.938 | ||
| Hotelling | Approximate | 0.149 | 0.141 | 0.325 | 0.336 | 0.620 | 0.639 | 0.886 | 0.859 | |
| Bootstrapping | 0.120 | 0.117 | 0.257 | 0.274 | 0.538 | 0.559 | 0.821 | 0.784 | ||
| 100 | Regression | Approximate | 0.210 | 0.224 | 0.598 | 0.614 | 0.955 | 0.950 | 1.000 | 1.000 |
| Bootstrapping | 0.183 | 0.191 | 0.565 | 0.569 | 0.935 | 0.940 | 1.000 | 1.000 | ||
| Hotelling | Approximate | 0.208 | 0.221 | 0.534 | 0.518 | 0.850 | 0.880 | 1.000 | 1.000 | |
| Bootstrapping | 0.176 | 0.186 | 0.463 | 0.452 | 0.790 | 0.810 | 0.985 | 1.000 | ||
| 150 | Regression | Approximate | 0.225 | 0.235 | 0.770 | 0.782 | 0.980 | 0.982 | 1.000 | 1.000 |
| Bootstrapping | 0.190 | 0.200 | 0.760 | 0.768 | 0.980 | 0.983 | 1.000 | 1.000 | ||
| Hotelling | Approximate | 0.225 | 0.230 | 0.705 | 0.712 | 0.965 | 0.965 | 1.000 | 1.000 | |
| Bootstrapping | 0.190 | 0.200 | 0.650 | 0.635 | 0.950 | 0.960 | 1.000 | 1.000 | ||
Note: The data were generated under the assumptions of regression analysis with a logit link function, in which increases by the same amount as the level of Z is increased by one unit.
TABLE 4.
Comparison of powers of the regression-based test and the Hotelling T2 test under the third setting (runs:1000).
| n | Test | Variance | The number of levels of Z |
|||||||
|---|---|---|---|---|---|---|---|---|---|---|
|
K = 2 |
K = 3 |
K = 4 |
K = 5 |
|||||||
| 30 | Regression | Approximate | 0.300 | 0.272 | 0.121 | 0.103 | 0.230 | 0.197 | 0.326 | 0.342 |
| Bootstrapping | 0.186 | 0.188 | 0.059 | 0.054 | 0.140 | 0.126 | 0.205 | 0.223 | ||
| Hotelling | Approximate | 0.266 | 0.257 | 0.234 | 0.248 | 0.497 | 0.484 | 0.534 | 0.539 | |
| Bootstrapping | 0.156 | 0.155 | 0.110 | 0.104 | 0.279 | 0.279 | 0.273 | 0.302 | ||
| 60 | Regression | Approximate | 0.457 | 0.460 | 0.126 | 0.113 | 0.305 | 0.303 | 0.516 | 0.510 |
| Bootstrapping | 0.381 | 0.387 | 0.091 | 0.094 | 0.238 | 0.234 | 0.452 | 0.436 | ||
| Hotelling | Approximate | 0.443 | 0.449 | 0.386 | 0.382 | 0.741 | 0.729 | 0.780 | 0.785 | |
| Bootstrapping | 0.347 | 0.360 | 0.274 | 0.296 | 0.630 | 0.621 | 0.668 | 0.687 | ||
| 100 | Regression | Approximate | 0.688 | 0.644 | 0.162 | 0.172 | 0.458 | 0.451 | 0.737 | 0.705 |
| Bootstrapping | 0.642 | 0.601 | 0.139 | 0.148 | 0.417 | 0.410 | 0.687 | 0.673 | ||
| Hotelling | Approximate | 0.683 | 0.634 | 0.587 | 0.578 | 0.934 | 0.943 | 0.959 | 0.961 | |
| Bootstrapping | 0.626 | 0.586 | 0.516 | 0.514 | 0.902 | 0.909 | 0.934 | 0.939 | ||
| 150 | Regression | Approximate | 0.855 | 0.839 | 0.213 | 0.198 | 0.600 | 0.611 | 0.870 | 0.888 |
| Bootstrapping | 0.831 | 0.822 | 0.204 | 0.174 | 0.585 | 0.595 | 0.848 | 0.866 | ||
| Hotelling | Approximate | 0.852 | 0.836 | 0.759 | 0.745 | 0.986 | 0.993 | 0.998 | 0.996 | |
| Bootstrapping | 0.824 | 0.816 | 0.715 | 0.706 | 0.986 | 0.990 | 0.998 | 0.992 | ||
Note: The data were generated more generally, in which s are different but do not follow any functional trend as the covariate Z varies.
3.2 |. A real example
We used data from a study that assessed three types of imaging modalities: barium enema (BE), computed tomographic colonography (CTC) and colonoscopy (CO) to detect neoplastic colon polyps and cancers. Colonography () is considered the gold standard for detecting neoplastic polyps or colon cancer. In this study, 614 patients completed all the three modalities. For the purpose of our example, we used data from the computed tomographic colonography and colonography modalities. There were 273 patients who had neoplastic lesions detected by either computed tomographic colonography and colonography. The median size of the largest lesion detected by colonography was 8 mm (range = 6–70), whereas the size of the lesion detected by computed tomographic colonography in the same area of the colon was 7 mm (range = 3–70). Due to the potential biological differences in race and family history, our interest was to assess the diagnostic accuracies of the sizes of the lesion sizes detected by the computed tomographic colonography and colonography according to the patient’s family history and race. The null hypothesis to be tested was that the diagnostic accuracy was the same across the four combination of family history and race, that is, . However, because there were only five patients with family history of colon cancer and were from other racial groups, this group was deleted from the analysis due to the small numbers.
Let the class variable indicate that the patient is of ‘White/no family history’, ‘Other race /no family history’ and ‘White/family history’. And in the regression analysis, we adopt the probit-link model
Then, the null hypothesis was tested by both the regression statistic given by (11) and the Hotelling statistic given by (12) with approximate (bootstrapping) variance, where the approximate variance of was defined by (6), and its bootstrapping variance was estimated by bootstrapping with replacement from s. The diagnostic accuracy of the size detected by the computed tomographic colonography and colonography by family history and race is presented in Table 5. Table 5 shows that there were no statistical differences in accuracies across the different groups of family history and racial groups using either the regression analysis approach or the Hotelling . We conclude that there were no differences in the diagnostic accuracies of computed tomographic colonography and colonography among the three groups of family history and race.
TABLE 5.
Comparison of accuracies among races by family history.
| Race by family history | Regression analysis | Hotelling T2 Test | ||
|---|---|---|---|---|
| White/No family history | 0.877 | 0.017(0.017) | χ2 = 1.517(1.382) | |
| Other race/No family history | 0.887 | 0.013(0.015) | P-value=0.468(0.501) | |
| White/Family history | 0.853 | 0.0245(0.0247) | P-value=0.525(0.509) |
Note: The numbers before and in (⋅) mean the corresponding values calculated by approximate method and bootstrapping.
4 |. DISCUSSION
In this article, we developed two methods, regression analysis and Hotelling’s , to investigate the effect of covariates on the diagnostic accuracy of a continuous biomarker when the gold standard is also measured on a continuous scale. Approximate variance estimates were derived for inference on the accuracy index and were compared with variance estimates using bootstrap. Simulation studies indicate that the two methods with either variance estimates control type error rates when sample sizes are relatively large. For small sample sizes, the bootstrap variance estimates are preferred. The regression analysis, even if the model is slightly misspecified, performs more powerful in monotone setting, otherwise, the Hoteling’s is preferred.
The regression analysis approach focuses on a discrete covariate but the method can be readily extended to deal with a continuous covariate. Alternatively, one can discretize the continuous covariate by binning, just like Metz et al20 and Zhou and Lin21 categorized continuous data for ROC curve, and use the regression analysis directly or the Hoteling’s developed in the article to test the null hypothesis of no difference in accuracy. Caution needs to be given if the test fails to reject the null hypothesis since the p-value from the test may depend on how the continuous covariate is discretized. However, it is still an challenging issue to how to find better category boundaries for the continuous covariate in Equation (1) and needs further study.
ACKNOWLEDGEMENTS
Research of Aiyi Liu is supported by the Intramural Research Program of the Eunice Kennedy Shriver National Institute of Child Health and Human Development (NICHD), the National Institutes of Health (NIH). Research of Mixia Wu is supported by National Natural Science Foundation of China (12271034). The authors are grateful to the editors and referees for their detailed suggestions which considerably improved the quality of the paper.
APPENDIX A. ASYMPTOTICS
Since is a categorical covariate and samples at different covariate levels are assumed to be independent, the derivation of the asymptotic distribution of is the same with the case with no covariates.
We will derive the asymptotic distribution of the empirical estimator of the accuracy index, which is a -statistic for continuous probability distribution in 2-dimension with a symmetric kernel
| (A1) |
where if if and , otherwise 0. It is easy to prove that
| (A2) |
We first introduce the theoretical result on -statistic from Hoeffding.22
Lemma 1.
Let be an even integer. If (implied by ), then
where is a -statistic with kernel is the corresponding projection of , and is the symmetric kernel of their difference.
The observations s are i.i.d. with a common distribution . Define with
and denote .
Note that
| (A3) |
| (A4) |
We define a projection of of ,
| (A5) |
In terms of the function , we have
Since s are i.i.d with mean 0 and variance , it follows that by the central limit theorem, where means ‘converge in distribution’. To prove that the asymptotic distribution of are the same as , we need to show that and are asymptotically equivalent. It is sufficient to show that .
Note that
The difference may itself be expressed as a -statistic,
based on the symmetric kernel
Note that ,
and are independent and identically distributed. We have and
According to Lemma 1,
Therefore,
Note that samples at different covariate levels are assumed to be independent. Thus, are mutually independent, each is asymptotically normally distributed, that is,
| (A6) |
where . According to the delta-method, are also mutually independent, the approximate distribution of each is
| (A7) |
Note that the true variance of is given in (5) and . Therefore, when is not enough large, replacing with in (A6) and (A7) obtains better approximate distributions of and .
Footnotes
CONFLICT OF INTEREST STATEMENT
The authors declare no potential conflict of interests.
REFERENCES
- 1.Obuchowski NA. Estimating and comparing diagnostic tests’ accuracy when the gold standard is not binary. Acad Radiol. 2005;12:1198–1204. [DOI] [PubMed] [Google Scholar]
- 2.Obuchowski NA. An ROC-type measure of diagnostic accuracy when the gold standard is continuous-scale. Stat Med. 2006;25:481–493. [DOI] [PubMed] [Google Scholar]
- 3.Chang YCI. Maximizing an ROC-type measure via linear combination of markers when the gold reference is continuous. Stat Med. 2013;32:1893–1903. [DOI] [PubMed] [Google Scholar]
- 4.Pepe MS. Three approaches to regression analysis of receiver operating characteristic curves for continuous test results. Biometrics. 1998;54:124–135. [PubMed] [Google Scholar]
- 5.Kim E, Zeng D. Semiparametric ROC analysis using accelerated regression models. Stat Sin. 2013;23:829–851. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Tosteson ANA, Begg CB. A general regression methodology for ROC curve estimation. Med Decis Making. 1988;8:204–215. [DOI] [PubMed] [Google Scholar]
- 7.O’Malley AJ, Zou KH, Fielding JR, Tempany C. Bayesian regression methodology for estimating a receiver operating characteristic curve with two radiologic applications: prostate biopsy and spiral CT of ureteral stones. Acad Radiol. 2001;8:713–725. [DOI] [PubMed] [Google Scholar]
- 8.Faraggi D Adjusting receiver operating characteristic curves and related indices for covariates. J R Stat Soc Ser D Stat. 2003;52:179–192. [Google Scholar]
- 9.Thompson ML, Zucchini W. On the statistical analysis of ROC curves. Stat Med. 1989;8:1277–1290. [DOI] [PubMed] [Google Scholar]
- 10.Obuchowski NA, Beiden SV, Berbaum KS, et al. Multireader, multicase receiver operating characteristic analysis: an empirical comparison of five methods. Acad Radiol. 2004;11:980–995. [DOI] [PubMed] [Google Scholar]
- 11.Pepe MS. An interpretation for the ROC curve and inference using GLM procedures. Biometrics. 2000;56:352–359. [DOI] [PubMed] [Google Scholar]
- 12.Cai T Semi-parametric ROC regression analysis with placement values. Biostatistics. 2004;5:45–60. [DOI] [PubMed] [Google Scholar]
- 13.Zhang Z, Huang Y. A linear regression framework for the receiver operating characteristic (ROC) curve analysis. J Biom Biostat. 2005;3(2):5726. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Zhou XH, Castelluccio P, Zhou C. Nonparametric estimation of ROC curves in the absence of a gold standard. Biometrics. 2005;61:600–609. [DOI] [PubMed] [Google Scholar]
- 15.Lin H, Zhou XH, Li G. A direct semiparametric receiver operating characteristic curve regression with unknown link and baseline functions. Stat Sin. 2012;22:14–27. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Rodenberg C, Zhou XH. ROC curve estimation when covariates affect the verification process. Biometrics. 2000;56:1256–1262. [DOI] [PubMed] [Google Scholar]
- 17.Liu D, Zhou XH. Semiparametric estimation of the covariate-specific ROC curve in presence of ignorable verification bias. Biometrics. 2011;67:906–916. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Liu D, Zhou XH. Covariate adjustment in estimating the area under ROC curve with partially missing gold standard. Biometrics. 2013;69:91–100. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Fang HB, Fang KT, Kotz S. The meta-elliptical distributions with given marginals. J Multivar Anal. 2002;82:1–16. [Google Scholar]
- 20.Metz CE, Herman BA, Shen JH. Maximum likelihood estimation of receiver operating characteristic (ROC) curves from continuously-distributed data. Stat Med. 1998;17(9):1033–1053. [DOI] [PubMed] [Google Scholar]
- 21.Zhou XH, Lin HZ. Semi-parametric maximum likelihood estimates for ROC curves of continuous-scale tests. Stat Med. 2008;27:5271–5290. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Hoeffding W A class of statistics with asymptotically normal distribution. Ann. Math. Stat. 1948;19:293–325. [Google Scholar]
