Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2025 Jun 30.
Published in final edited form as: Stat Med. 2023 Jul 17;42(22):4015–4027. doi: 10.1002/sim.9845

Significance tests for covariates in the diagnostic accuracy index of a biomarker against a continuous gold standard

Mixia Wu 1, Xian Sun 2, Aiyi Liu 3, Chenchen Peng 1, Zhaohai Li 4
PMCID: PMC12207746  NIHMSID: NIHMS2090594  PMID: 37455675

Summary

Receiver operating characteristic (ROC) curve is a popular tool to describe and compare the diagnostic accuracy of biomarkers when a binary-scale gold standard is available. However, there are many examples of diagnostic tests whose gold standards are continuous. Hence, Several extensions of receiver operating characteristic (ROC) curve are proposed to evaluate the diagnostic potential of biomarkers when the gold standard is continuous-scale. Moreover, in evaluating these biomarkers, it is often necessary to consider the effects of covariates on the diagnostic accuracy of the biomarker of interest. Covariates may include subject characteristics, expertise of the test operator, test procedures or aspects of specimen handling. Applying the covariate adjustment to the case that the gold standard is continuous is challenging and has not been addressed in the literature. To fill the gap, we propose two general testing frameworks to account for the covariates effect on diagnostic accuracy. Simulation studies are conducted to compare the proposed tests. Data from a study that assessed three types of imaging modalities with the purpose of detecting neoplastic colon polyps and cancers are used to illustrate the proposed methods.

Keywords: biomarkers, covariate adjustment, diagnostic accuracy, regression analysis

1 |. INTRODUCTION

In medical diagnostics, the receiver operating characteristic (ROC) curve is a useful tool to assess a biomarker’s accuracy and compare accuracies of competing biomarkers. In traditional ROC analysis, a subject is diagnosed as diseased and nondiseased based on the subject’s level of the biomarker. The diagnostic accuracy of the biomarker depends on the existence of a binary-scale gold standard, usually obtained from clinical follow-up, surgical verification, autopsy, and so on. Sensitivity and Specificity at various cutpoints of the biomarker’s test results summarize the diagnostic accuracies of the biomarker.

In practice however, there are many examples of gold standards whose outcomes are not on a binary-scale, but rather on a continuous-scale. For example, glycosylated hemoglobin is usually used as a primary index of diabetic control, and is originally measured as a continuous-valued variable. Setting an inappropriate threshold to dichotomize a continuous gold standard results in imprecision of the diagnostic accuracy of the biomarker and conceals an important relationship between the biomarker and the gold standard. Obuchowski1,2 extended the analysis of the diagnostic accuracy of a binary gold standard to the case of a continuous gold standard, and proposed an index measure and presented a nonparametric estimator of the index for a continuous biomarker. Chang3 discussed linearly combining multiple biomarkers to maximize this accuracy index.

It is well known that the performance of a diagnostic test/biomarker may depend on certain characteristics of patients, the environmental factors of diagnostic tests, setting and types of diagnostic tests, and so on. When the gold standard is binary-scale, various methods have been developed for regression analysis of ROC curves and the summary measures of accuracy derived from them to account for the effect of covariates. Methods found in the literature for modeling the effect of covariates on ROC curves can be loosely divided into two categories, the indirect and the direct approach; see, among many others.4,5 The indirect approach is to model the test results as a function of the binary gold standard and covariates and then derive the ROC curve or its summary measures. This method was first proposed by Tosteson and Begg,6 who fitted an ordinal regression model with location and scale parameters on the result of the diagnostic test that is ordered categories. O’Malley et al7 proposed a Bayesian regression method for continuous test results and then evaluated the ROC curve. Faraggi8 described the biomarker values as functions of covariates with the assumption that there is no covariate dependence in the diseased population and then obtained the estimates for the area under the ROC curve and the Youden index. The direct approach is to model the ROC curves or summary measures on covariates and the binary gold standard. Regression models for various ROC summary measures have been proposed, such as the area under the ROC curve based on a conventional analysis of variance (ANOVA).9,10 Pepe11 proposed a regression model for ROC curves directly and provided the parameter estimation using GLM procedures. Kim and Zeng5 generalized an accelerated failure time model from survival analysis to ROC analysis, assuming a relationship between covariates and the baseline ROC curve. Nonparametric and semi-parametric estimation of the effect of covariates was also discussed, in which regression models for ROC summary indices or ROC curves were available; see References 1215, among others. Regression analysis has also been developed when there is an ignorable verification bias and missing or partially missing gold standard.14,1618

To the best of our knowledge, however, there is no specific methodology developed in the literature to account for the effect of covariates in assessing the diagnostic accuracy when the gold standard is continuous. To fill this gap, we develop methodology to directly evaluate the effect of covariates on the accuracy index concerning a biomarker and a gold standard, both measured on a continuous scale. A motivating example is to assess the accuracy in detecting colon cancer of two imaging modalities, barium enema (BE) and computed tomographic colonography (CTC), as compared to colonoscopy (CO) which is considered the gold standard. Due to the potential biological differences in race and family history, our interest is to assess the how the diagnostic accuracy varies according to the family history and race of the patient.

The paper is arranged as follows. In Section 2, we propose two test methods to assess the effect of a common covariate on the diagnostic accuracy of a biomarker against a continuous gold standard. In Section 3, we first present simulation results on type I error and power of the tests for various sample sizes, and then demonstrate the methods using imaging modalities data. Some discussion is provided in Section 4. The proofs of the asymptotic results are relegated to the appendix.

2 |. METHODOLOGY

Let Yi be the gold standard, Xi be the biomarker, and Zi be the covariate, for the ith subject (i=1,2,,N). We assume that X and Y are continuous and without loss of generality, Z is discrete, taking K possible values z1,,zK. We want to build a model of the diagnostic accuracy on the covariate. Obuchowski1,2 defined the accuracy as the probability that a randomly selected patient with a higher gold standard outcome has a higher biomarker test result than a randomly selected patient with a lower gold standard outcome. Extending this definition, for each level k(=1,2,,K), we define the diagnostic accuracy index as

θk=PYiYjXiXj>0Zi=Zj=Zk. (1)

We estimate θk with Obuchowski’s1,2 nonparamatric estimator θˆk, which is calculated by summing the indicators of all possible pairs of subjects. Thus, at the kth level (k=1,2,K), we have an unbiased estimator for θk:

θˆk=nk211i<jnkΨXiXjYiYjZi=Zj=zk, (2)

where nk is the number of observations at the covariate level k and the summations are taken over the corresponding sets of observations with the same covariate level, the function Ψ(z) given Z=z satisfies that

Ψ(tz)=1,ift>0,0.5,ift=0,0,otherwise. (3)

For simplicity, we denote

Ψi,jzk=ΨXiXjYiYjZi=Zj=zk.

Our main aim is to investigate whether the accuracy θ depends on the covariate Z.

2.1 |. Regression analysis

Assume that the link model Gθk is a linear function of the covariate Z at each level k, that is,

Gθk=b0+b1zk, (4)

where G() is a strictly monotone function with domain (,+) and range (0, 1). In practice, Gθk is usually be adopted as the probit-link function Φ1θk or the logit-link function logθk/1+θk, where Φ() is the standard normal distribution function. The regression coefficients b=b0,b1T can be estimated using the weighted linear regression technique on the data Gθk,zk.

The exact variance of θˆk given below is derived using the theory developed for general U-statistics,

Var(θˆk)=E(θˆkθk)2=nk22E1i<jnkΨi,jzkθk2=nk22nk2VarΨi,jzk+6nk3CovΨi,jzk,Ψi,lzk

where Var(Ψi,jzk)=E(Ψi,jzkθk)2=θkθk2, and

Cov(Ψi,jzk,Ψi,lzk)=Cov(Ψi,jzk,Ψm,jzk)=Cov(Ψi,jzk,Ψj,lzk)=Cov(Ψi,jzk,Ψm,izk)=E(Ψi,jzkθk)(Ψi,lzkθk)=E(Ψi,jzkΨi,lzk)θk2.

Denote pk=E(Ψi,jzkΨi,lzk). Thus the variance of θˆk can also be written in terms of pk and θk as

σk2=Var(θˆk)=2nknk1θk22nk3nknk2θk2+4nk2nknk1pk. (5)

We use a nonparametric unbiased estimator pˆk to estimate pk, where

pˆk=13nk311i<j<lnk{Ψi,jzkΨi,lzk+Ψi,jzkΨj,lzk+Ψi,lzkΨj,lzk}.

Substituting θk and pk for θˆk and pˆk, we obtain an estimator of the variance of θˆk:

σˆk2=Var^(θˆk)=2nknk1θˆk2nk3nknk1θˆk2+4nk2nknk1pˆk. (6)

Alternatively, the variance of θˆk can be estimated via bootstrapping with replacement from Xi,YiZi=Zk,i=1,2,,N}, denoted as σˆbootstrap,k2. Later we will present simulation results to compare these two variance estimation approaches in controlling error rates.

Using the delta-method, the variance of G(θˆk) is found to be

wk=Var(G(θˆk))=gθk2σk2 (7)

and its estimator

wˆk=Var^(G(θˆk))=g(θˆk)2σˆk2org(θˆk)2σˆbootstrap,k2, (8)

where g() is the derivative of G(). Especially, gθk=1/ϕΦ1θk when Gθk=Φ1θk, and gθk=1/θk1+θk when Gθk=logθk/1+θk, where ϕ is the standard normal density.

To investigate whether the accuracy θ depends on the covariate Z, we test the null hypothesis H0:b1=0. To this end, we rewrite our model G(θˆk)=b0+b1zk as

R=Zb+e, (9)

where R=(G(θˆ1),G(θˆ2),,G(θˆK))T,Z=((1,1,,1)T,(z1,z2,,zK)T)T,b=(b0,b1)T, and eNK(0,W).

Assume that samples at different covariate levels are independent. Then it has

Cov(θˆk,θˆm)=0,Cov(G(θˆk),G(θˆm))=0,1kmK. (10)

Furthermore, we have that the variance-covariance matrix W of the response variable R is a diagonal matrix with elements w1,w2,,wK. Thus the regression coefficients b can be estimated by

bˆ=ZTW1Z1ZTW1R, (11)

with Cov(bˆ)=(ZTW1Z)1. Replacing W with its estimate W^=Diag(wˆ1,wˆ2,,wˆK) obtains the feasible estimator of b, still denoted by bˆ in the following. Note that bˆ1=(0,1)bˆ. Therefore, the variance of bˆ1 can be approximately estimated by Var^(bˆ1)=(0,1)(ZTW^1Z)1(0,1)T.

A test statistic can then be defined as

T0=bˆ1Var^(bˆ1), (12)

which asymptotically follows a standard normal distribution when b1=0. The null hypothesis is rejected if T0>Zα/2, where Zα/2 is the upper α/2 percentile of the standard normal distribution.

2.2 |. The Hotelling’s T2 approach

The regression approach in general assumes that the accuracy index and the covariate satisfy certain known functional relationship. In the regression model developed in the previous section, we assume that the transformed accuracy under a known monotone link is a linear function of the covariate values, and obtain estimates of the regression coefficients via weighted least squares method. In simulation we adopt the normal probit link. When such a functional relationship is not clear we may test a more general hypothesis H0:θ1=θ2==θK, with the alternative hypothesis H1: not all θks are equal. Equivalently, the test can be written in the form of a union-intersection test,

H0:m<KθmθK=0H1:m<KθmθK0. (13)

We propose a test statistic analogous to the Hotelling’s T2 to test the above null hypothesis. Since all the θˆks are U-statistics, they have asymptotically normal distributions and are mutually independent. Therefore,

E(θˆmθˆK)=θmθK,Var(θˆmθˆK)=Var(θˆm)+Var(θˆK)=σm2+σK2,m=1,2,,K1,Cov(θˆmθˆK,θˆlθˆK)=Var(θˆK)=σK2,ml.

Or equivalently, θˆ is asymptotically N(μ,Σ) where where

θˆ=(θˆ1θˆK,θˆ2θˆK,,θˆK1θˆK)T,μ=θ1θK,θ2θK,,θK1θKT,Σ=σ12+σK2σK2σK2σK2σ22+σK2σK2σK2σK2σK12+σK2.

Then our Hotelling’s T2 statistic is

T2=θˆTΣˆ1θˆ, (14)

where the estimate Σˆ of Σ is obtained with σk2 replacing σˆk2 (or σˆbootstrap,k2),k=1,,K. Under the null hypothesis, T2 asymptotically has a χ2 distribution with K1 degrees of freedom. If the observed T2 is greater than χK12(α), we reject the null hypothesis H0 at the α significance level, where χK12(α) is the upper α percentile of the χ2 distribution with K1 degrees of freedom. Technical details on the asymptotic distribution of θˆ are provided in the appendix.

Remark 1. When K=2, the test statistic T0 based on the regression method and the Hotelling’s T2 are respectively

T0=G(θˆ1)G(θˆ2)wˆ1+wˆ2,T2=(θˆ1θˆ2)2σˆ12+σˆ22.

Due to the monotonicity of G(), it follows from the delta-method that the two methods perform similarly well, especially when n1 and n2 are large, see simulation results in the next section.

Remark 2. The test based on regression analysis usually has higher power if the accuracy index θ increases as the value of Z increases (or decreases), even if the link function is misspecified, otherwise Hotelling’s T2 test performs better.

Remark 3. For the original sample of the continuous random variables (X,Y),ΨYiYjXiXjZi=Zj=Zk in (2) can be replaced by the indicator function IYiYjXiXj>0Zi=Zj=Zk, but the latter will lead to an underestimation of θk under bootstrap sampling, due to the repetition of some subjects in each bootstrap sample.

3 |. SIMULATION STUDY AND A REAL EXAMPLE

We conducted simulation studies to examine the finite sample performance of the two methods proposed in the previous section, the Z-test based on the regression analysis and the Hotelling T2 test. For each test, we also compared the two estimates of the variance of θk: one is the approximate variance (6) given in Section 2.1, and the other is the bootstrap variance with 1000 repetitions. We compared the type I error rates and powers of the two tests under various alternatives. This section summarizes the simulation results and demonstrates the proposed methods with a real example.

3.1 |. Comparison of regression analysis and Hotelling’s T2 test

We assume that given the covariate Z,(X,Y) follows a bivariate normal distribution. Then it follows from Reference 19 that

θk=PXiXjYiYj>0Zi=Zj=Zk=12+1πarcsinρk, (15)

where (Xi,Yi) and (Xj,Yj) are random samples from the distribution of (X,Y) given Z=zk,

ρk=CorrX,YZ=zk=CovX,YZ=zkVarXZ=zkVarYZ=zk.

Without the loss of generality, we assume

(X,Y)|Z=zkN200,σX2σXσYρkσXσYρkσY2,

and θk is a function of ρk, which is the correlation coefficient between the biomarker X and the gold standard Y when the covariate is at the kth level. In our simulations, Gθk is taken as the probit link function Φ1θk, the covariate Z was set to have K=2,3,4 or 5 levels, and variances of X and Y are equal (σY=σX=1) or unequal (σY=1σX=1.2). For each level, we assumed that we could obtain n1==nK=n=30,60,100, or 150 observations. We generated 1000 data sets to obtain the empirical type I error rates and power under various alternative hypotheses.

3.1.1 |. The type I error

Under the null hypothesis H0:θ1==θK, which is equivalent to H0:ρ1==ρK=ρ, we fix ρ=0.6,0.8 for each level of Z. Table 1 summarizes the type I error rates of the regression-based test and the Hotelling T2 test under various configurations of variance estimation, number of levels of the covariate, and sample sizes. As expected, for relatively large sample sizes, the two test approaches control the type I error quite satisfactorily at the nominal level of 0.05 using either the approximate or bootstrap variance, that is, σˆk2 or σˆbootstrap,k2. For a relatively small sample size (such as n=30), however, using the approximate variance yield inflated type I error rates in which case the bootstrap variance is preferred, especially when K is relatively large. Partly because the two proposed test statistics, T0 and T2, involve the estimation of the 2K unknown parameters, θk,σk2 s, the smaller n and the larger K, the more their distributions of T0 and T2 deviate from their asymptotic distributions under the null hypothesis H0:θ1==θK. Furthermore, the type I error rates of the two tests seem to be less affected by the value of ρ.

TABLE 1.

Comparison of type I error rates of the regression-based test and the Hotelling T2 test (runs:1000).

n Test Variance The number of levels of Z
K = 2
K = 3
K = 4
K = 5
σX2=σY2 σX2σY2 σX2=σY2 σX2σY2 σX2=σY2 σX2σY2 σX2=σY2 σX2σY2
ρ1==ρK=ρ=0.6
30 Regression Approximate 0.072 0.089 0.081 0.084 0.084 0.105 0.092 0.124
Bootstrapping 0.042 0.054 0.046 0.046 0.049 0.058 0.055 0.066
Hotelling Approximate 0.070 0.083 0.091 0.094 0.111 0.120 0.149 0.137
Bootstrapping 0.040 0.053 0.042 0.045 0.041 0.059 0.065 0.064
60 Regression Approximate 0.070 0.062 0.081 0.078 0.078 0.051 0.059 0.061
Bootstrapping 0.057 0.040 0.063 0.057 0.057 0.039 0.042 0.048
Hotelling Approximate 0.072 0.061 0.067 0.075 0.092 0.062 0.078 0.077
Bootstrapping 0.054 0.040 0.039 0.058 0.065 0.042 0.051 0.047
100 Regression Approximate 0.056 0.067 0.051 0.042 0.063 0.067 0.075 0.064
Bootstrapping 0.048 0.056 0.046 0.036 0.050 0.054 0.060 0.054
Hotelling Approximate 0.057 0.066 0.066 0.048 0.054 0.052 0.077 0.065
Bootstrapping 0.048 0.053 0.052 0.032 0.036 0.039 0.057 0.050
150 Regression Approximate 0.049 0.057 0.062 0.057 0.056 0.052 0.052 0.059
Bootstrapping 0.045 0.050 0.058 0.049 0.048 0.045 0.047 0.056
Hotelling Approximate 0.049 0.055 0.058 0.046 0.054 0.052 0.066 0.072
Bootstrapping 0.044 0.048 0.050 0.040 0.047 0.041 0.055 0.055
ρ1==ρK=ρ=0.8
30 Regression Approximate 0.079 0.091 0.083 0.086 0.106 0.111 0.109 0.127
Bootstrapping 0.039 0.050 0.044 0.038 0.047 0.053 0.045 0.062
Hotelling Approximate 0.071 0.077 0.098 0.087 0.105 0.124 0.138 0.138
Bootstrapping 0.033 0.039 0.030 0.035 0.030 0.042 0.037 0.042
60 Regression Approximate 0.071 0.061 0.081 0.089 0.071 0.060 0.070 0.067
Bootstrapping 0.048 0.046 0.059 0.062 0.049 0.036 0.046 0.042
Hotelling Approximate 0.068 0.058 0.060 0.072 0.082 0.067 0.088 0.070
Bootstrapping 0.042 0.038 0.033 0.040 0.044 0.029 0.042 0.038
100 Regression Approximate 0.048 0.049 0.053 0.064 0.062 0.053 0.062 0.082
Bootstrapping 0.043 0.045 0.042 0.049 0.039 0.045 0.048 0.042
Hotelling Approximate 0.045 0.045 0.049 0.064 0.056 0.052 0.064 0.069
Bootstrapping 0.035 0.037 0.040 0.044 0.046 0.046 0.034 0.030
150 Regression Approximate 0.049 0.049 0.050 0.059 0.070 0.054 0.056 0.058
Bootstrapping 0.045 0.056 0.061 0.060 0.055 0.051 0.037 0.047
Hotelling Approximate 0.047 0.048 0.056 0.065 0.070 0.061 0.067 0.074
Bootstrapping 0.040 0.052 0.047 0.049 0.043 0.044 0.037 0.043

Note: n is the sample size in each level of Z.

3.1.2 |. Power analysis

Simulations were conducted to estimate the empirical power of the tests under various configurations of the alternative hypothesis H1: not all θks are equal. Under bivariate normal assumption, we considered three settings for generating the covariate dependent correlation coefficient ρk. The first setting is to exactly meet the assumptions of the regression analysis, in which Φ1θk increases by the same amount as the level of Z is increased by one unit. To generate data for this setting, we fixed b=b0,b1T and solved for ρks from equations (15) and Φ1θk=b0+b1zk, that is,

ρk=sinΦb0+b1zk0.5/π,

where we take b=(0.6,0.2)T and zk=k for k=1,,K, then ρ1,,ρ5=(0.786,0.878,0.935,0.968,0.985)T. The second setting departs slightly from the first setting in that the logit link function log(θ/(1+θ)) was used instead of the probit link function Φ1θk to generate the data. From the Equations (15) and logθk/1+θk=b0+b1zk, we can derive ρk=sinθk0.5/π with θk=expb0+b1zk/1expb0+b1zk, so we obtain ρ1,,ρ5=(0.562,0.664,0.747,0.813,0.864)T when b=(0.6,0.2)T and zk=k for k=1,,K. The third setting is more general in that the θks are different but do not follow any functional trend as the covariate varies, where θk is computed from Equations (15) in which we take ρ1,,ρ5=(0.80,0.90,0.85,0.70,0.75)T. For all the three settings, the number of levels of the covariate and the sample sizes were the same as the set-up for type I error simulations. Both equal and unequal variances of X and Y were considered. The equal variances case occurs more likely when X and Y measure the same variable, for example, when Y is the gold standard and X is an imperfect instrument; the unequal variances case occurs more likely otherwise.

The simulated powers for the three settings are presented in Tables 2,3, and 4, respectively. As expected, power increases with sample size. The power obtained with the approximate variance appears to be larger than the power obtained with the bootstrap variance. Combining Table 1, we suggest that the approximate variance is reliable for the relatively large sample size. For small sample sizes, however, the bootstrap variance should be used due to the inflation of type I error rates with the approximate variance. It appears that the regression method is more powerful than the Hotelling’s test in monotone settings 1 and 2, even when the regression model assumptions are violated (in settings 2). However, In setting 3, the Hotelling’s T2 test performs significantly better than regression method except the case K=2, just as shown in Table 4. Tables 1,2,3, and 4 all show that the Hotelling’s T2 test performs nearly as well as the regression method may when K=2 and n is relatively large, which confirms Remark 1 in Section 2. Thus the regression method is preferred if the θk s do follow some monotone functional trend as the covariate varies, otherwise, Hotelling’s T2 is chosen. For a given sample size n, we can also find from Tables 2 and 3 that the powers of the two tests increase with K in the monotone settings 1 and 2. The reason for this is that a larger K results in the more significant difference between θks. For a relatively small n, a larger K also results in an inflated type I error rate, which will increase the powers of the two tests based on the approximate variances.

TABLE 2.

Comparison of powers of the regression-based test and the Hotelling T2 test under the first setting (runs:1000).

n Test Variance The number of levels of Z
K = 2
K = 3
K = 4
K = 5
σX2=σY2 σX2σY2 σX2=σY2 σX2σY2 σX2=σY2 σX2σY2 σX2=σY2 σX2σY2
30 Regression Approximate 0.223 0.223 0.566 0.590 0.908 0.919 0.998 0.995
Bootstrapping 0.139 0.131 0.432 0.454 0.854 0.863 0.996 0.988
Hotelling Approximate 0.207 0.200 0.460 0.479 0.823 0.837 0.985 0.982
Bootstrapping 0.118 0.105 0.252 0.272 0.601 0.625 0.897 0.884
60 Regression Approximate 0.348 0.342 0.864 0.890 0.999 1.000 1.000 1.000
Bootstrapping 0.290 0.297 0.819 0.845 0.998 1.000 1.000 1.000
Hotelling Approximate 0.339 0.335 0.789 0.795 0.991 0.998 1.000 1.000
Bootstrapping 0.266 0.267 0.672 0.688 0.970 0.987 1.000 1.000
100 Regression Approximate 0.527 0.521 0.974 0.981 1.000 1.000 1.000 1.000
Bootstrapping 0.489 0.468 0.972 0.975 1.000 1.000 1.000 1.000
Hotelling Approximate 0.524 0.514 0.956 0.960 1.000 1.000 1.000 1.000
Bootstrapping 0.470 0.448 0.929 0.937 1.000 1.000 1.000 1.000
150 Regression Approximate 0.716 0.720 1.000 1.000 1.000 1.000 1.000 1.000
Bootstrapping 0.678 0.690 1.000 1.000 1.000 1.000 1.000 1.000
Hotelling Approximate 0.715 0.709 0.990 1.000 1.000 1.000 1.000 1.000
Bootstrapping 0.664 0.658 0.985 0.988 1.000 1.000 1.000 1.000

Note: The data were generated under the assumptions of regression analysis with a probit link function, in which Φ1θk increases by the same amount as the level of Z is increased by one unit.

TABLE 3.

Comparison of powers of the regression-based test and the Hotelling T2 test under the second setting (runs:1000).

n Test Variance The number of levels of Z
K = 2
K = 3
K = 4
K = 5
σX2=σY2 σX2σY2 σX2=σY2 σX2σY2 σX2=σY2 σX2σY2 σX2=σY2 σX2σY2
30 Regression Approximate 0.114 0.114 0.288 0.256 0.510 0.511 0.765 0.773
Bootstrapping 0.074 0.075 0.199 0.188 0.402 0.415 0.679 0.687
Hotelling Approximate 0.112 0.109 0.231 0.240 0.390 0.415 0.608 0.636
Bootstrapping 0.069 0.072 0.146 0.144 0.228 0.240 0.401 0.428
60 Regression Approximate 0.151 0.144 0.383 0.415 0.761 0.796 0.959 0.956
Bootstrapping 0.119 0.120 0.330 0.363 0.721 0.743 0.948 0.938
Hotelling Approximate 0.149 0.141 0.325 0.336 0.620 0.639 0.886 0.859
Bootstrapping 0.120 0.117 0.257 0.274 0.538 0.559 0.821 0.784
100 Regression Approximate 0.210 0.224 0.598 0.614 0.955 0.950 1.000 1.000
Bootstrapping 0.183 0.191 0.565 0.569 0.935 0.940 1.000 1.000
Hotelling Approximate 0.208 0.221 0.534 0.518 0.850 0.880 1.000 1.000
Bootstrapping 0.176 0.186 0.463 0.452 0.790 0.810 0.985 1.000
150 Regression Approximate 0.225 0.235 0.770 0.782 0.980 0.982 1.000 1.000
Bootstrapping 0.190 0.200 0.760 0.768 0.980 0.983 1.000 1.000
Hotelling Approximate 0.225 0.230 0.705 0.712 0.965 0.965 1.000 1.000
Bootstrapping 0.190 0.200 0.650 0.635 0.950 0.960 1.000 1.000

Note: The data were generated under the assumptions of regression analysis with a logit link function, in which logθk=1+θk increases by the same amount as the level of Z is increased by one unit.

TABLE 4.

Comparison of powers of the regression-based test and the Hotelling T2 test under the third setting (runs:1000).

n Test Variance The number of levels of Z
K = 2
K = 3
K = 4
K = 5
σX2=σY2 σX2σY2 σX2=σY2 σX2σY2 σX2=σY2 σX2σY2 σX2=σY2 σX2σY2
30 Regression Approximate 0.300 0.272 0.121 0.103 0.230 0.197 0.326 0.342
Bootstrapping 0.186 0.188 0.059 0.054 0.140 0.126 0.205 0.223
Hotelling Approximate 0.266 0.257 0.234 0.248 0.497 0.484 0.534 0.539
Bootstrapping 0.156 0.155 0.110 0.104 0.279 0.279 0.273 0.302
60 Regression Approximate 0.457 0.460 0.126 0.113 0.305 0.303 0.516 0.510
Bootstrapping 0.381 0.387 0.091 0.094 0.238 0.234 0.452 0.436
Hotelling Approximate 0.443 0.449 0.386 0.382 0.741 0.729 0.780 0.785
Bootstrapping 0.347 0.360 0.274 0.296 0.630 0.621 0.668 0.687
100 Regression Approximate 0.688 0.644 0.162 0.172 0.458 0.451 0.737 0.705
Bootstrapping 0.642 0.601 0.139 0.148 0.417 0.410 0.687 0.673
Hotelling Approximate 0.683 0.634 0.587 0.578 0.934 0.943 0.959 0.961
Bootstrapping 0.626 0.586 0.516 0.514 0.902 0.909 0.934 0.939
150 Regression Approximate 0.855 0.839 0.213 0.198 0.600 0.611 0.870 0.888
Bootstrapping 0.831 0.822 0.204 0.174 0.585 0.595 0.848 0.866
Hotelling Approximate 0.852 0.836 0.759 0.745 0.986 0.993 0.998 0.996
Bootstrapping 0.824 0.816 0.715 0.706 0.986 0.990 0.998 0.992

Note: The data were generated more generally, in which θks are different but do not follow any functional trend as the covariate Z varies.

3.2 |. A real example

We used data from a study that assessed three types of imaging modalities: barium enema (BE), computed tomographic colonography (CTC) and colonoscopy (CO) to detect neoplastic colon polyps and cancers. Colonography (Y) is considered the gold standard for detecting neoplastic polyps or colon cancer. In this study, 614 patients completed all the three modalities. For the purpose of our example, we used data from the computed tomographic colonography (X) and colonography modalities. There were 273 patients who had neoplastic lesions detected by either computed tomographic colonography and colonography. The median size of the largest lesion detected by colonography was 8 mm (range = 6–70), whereas the size of the lesion detected by computed tomographic colonography in the same area of the colon was 7 mm (range = 3–70). Due to the potential biological differences in race and family history, our interest was to assess the diagnostic accuracies of the sizes of the lesion sizes detected by the computed tomographic colonography and colonography according to the patient’s family history and race. The null hypothesis to be tested was that the diagnostic accuracy was the same across the four combination of family history and race, that is, θ1=θ2=θ3=θ4. However, because there were only five patients with family history of colon cancer and were from other racial groups, this group was deleted from the analysis due to the small numbers.

Let the class variable Z=1,3,4 indicate that the patient is of ‘White/no family history’, ‘Other race /no family history’ and ‘White/family history’. And in the regression analysis, we adopt the probit-link model

Φ1θk=b0+b1zk,z1=1,z2=3,z3=4.

Then, the null hypothesis was tested by both the regression statistic T0 given by (11) and the Hotelling T2 statistic given by (12) with approximate (bootstrapping) variance, where the approximate variance of θˆk was defined by (6), and its bootstrapping variance was estimated by bootstrapping with replacement from Xi,YiZi=Zks. The diagnostic accuracy of the size detected by the computed tomographic colonography (X) and colonography (Y) by family history and race (Z) is presented in Table 5. Table 5 shows that there were no statistical differences in accuracies across the different groups of family history and racial groups using either the regression analysis approach or the Hotelling T2. We conclude that there were no differences in the diagnostic accuracies of computed tomographic colonography and colonography among the three groups of family history and race.

TABLE 5.

Comparison of accuracies among races by family history.

Race by family history θˆ Var(θˆ)^ Regression analysis Hotelling T2 Test
White/No family history 0.877 0.017(0.017) bˆ1=0.042(0.043) χ2 = 1.517(1.382)
Other race/No family history 0.887 0.013(0.015) Var^(bˆ1)=0.066(0.066) P-value=0.468(0.501)
White/Family history 0.853 0.0245(0.0247) P-value=0.525(0.509)

Note: The numbers before and in (⋅) mean the corresponding values calculated by approximate method and bootstrapping.

4 |. DISCUSSION

In this article, we developed two methods, regression analysis and Hotelling’s T2, to investigate the effect of covariates on the diagnostic accuracy of a continuous biomarker when the gold standard is also measured on a continuous scale. Approximate variance estimates were derived for inference on the accuracy index and were compared with variance estimates using bootstrap. Simulation studies indicate that the two methods with either variance estimates control type error rates when sample sizes are relatively large. For small sample sizes, the bootstrap variance estimates are preferred. The regression analysis, even if the model is slightly misspecified, performs more powerful in monotone setting, otherwise, the Hoteling’s T2 is preferred.

The regression analysis approach focuses on a discrete covariate but the method can be readily extended to deal with a continuous covariate. Alternatively, one can discretize the continuous covariate by binning, just like Metz et al20 and Zhou and Lin21 categorized continuous data for ROC curve, and use the regression analysis directly or the Hoteling’s T2 developed in the article to test the null hypothesis of no difference in accuracy. Caution needs to be given if the test fails to reject the null hypothesis since the p-value from the test may depend on how the continuous covariate is discretized. However, it is still an challenging issue to how to find better category boundaries for the continuous covariate in Equation (1) and needs further study.

ACKNOWLEDGEMENTS

Research of Aiyi Liu is supported by the Intramural Research Program of the Eunice Kennedy Shriver National Institute of Child Health and Human Development (NICHD), the National Institutes of Health (NIH). Research of Mixia Wu is supported by National Natural Science Foundation of China (12271034). The authors are grateful to the editors and referees for their detailed suggestions which considerably improved the quality of the paper.

APPENDIX A. ASYMPTOTICS

Since Z is a categorical covariate and samples at different covariate levels are assumed to be independent, the derivation of the asymptotic distribution of θˆk is the same with the case with no covariates.

We will derive the asymptotic distribution of the empirical estimator of the accuracy index, which is a U-statistic for continuous probability distribution in 2-dimension with a symmetric kernel

hX1,Y1,X2,Y2=ΨX1X2Y1Y2, (A1)

where Ψ(t)=1 if X1X2Y1Y2>0,0.5 if X1=X2 and Y1=Y2, otherwise 0. It is easy to prove that

EhX1,Y1,X2,Y2=θ,VarhX1,Y1,X2,Y2=θθ2. (A2)

We first introduce the theoretical result on U-statistic from Hoeffding.22

Lemma 1.

Let v be an even integer. If EHv< (implied by Ehv<), then

E(UnUˆn)v=Onv,n,

where Un is a U-statistic with kernel h,Uˆn is the corresponding projection of Un, and H is the symmetric kernel of their difference.

The observations Xi,Yi s are i.i.d. with a common distribution F(x,y). Define h˜1x1,y1=h1x1,y1θ with

h1x1,y1=Ehx1,y1,X2,Y2=Px1X2y1Y2>0=Px1<X2,y1<Y2+Px1>X2,y1>Y2=1FXx1FYy1+2Fx1,y1,

and denote p=PY1Y2X1X2>0,Y1Y3X1X3>0.

Note that

Eh˜1X1,Y1=EhX1,Y1,X2,Y2θ=0, (A3)
Varh˜1X1,Y1=Varh1X1,Y1=CovΨY1Y2X1X2,ΨY1Y3X1X3=PY1Y2X1X2>0,Y1Y3X1X3>0θ2=pθ2=:ζ. (A4)

We define a projection of θˆ* of θˆ,

θˆ*=i=1nEθˆXi,Yin1θ. (A5)

In terms of the function h˜1, we have

θˆ*θ=2ni=1nh˜1Xi,Yi.

Since 2h˜1Xi,Yis are i.i.d with mean 0 and variance 4ζ, it follows that n(θˆ*θ)N(0,4ζ) by the central limit theorem, where means ‘converge in distribution’. To prove that the asymptotic distribution of θˆ are the same as θˆ*, we need to show that n(θˆθ) and n(θˆ*θ) are asymptotically equivalent. It is sufficient to show that nE(θˆθˆ*)2=0.

Note that

nE(θˆθˆ*)2=nVar(θˆθˆ*)

The difference θˆθˆ* may itself be expressed as a U-statistic,

θˆθˆ*=2n(n1)1i<jnHXi,Yi,Xj,Yj

based on the symmetric kernel

Hxi,yi,xj,yj=hxi,yi,xj,yjh˜1xi,yih˜1xj,yjθ.

Note that EhXi,Yi,Xj,Yj=θ,Eh˜1Xi,Yi=Eh˜1Xj,Yj=0,

EhXi,Yi,Xj,Yjθh˜1Xi,Yi=EEhXi,Yi,Xj,Yjθh˜1Xi,YiXi,Yi=E[h˜12Xi,Yi]=Varh˜1Xi,Yi=ζ,

h˜1Xi,Yi and h˜1Xi,Yi are independent and identically distributed. We have E(H)=0 and

EH2=VarhXi,Yi,Xj,Yj+Eh˜1Xi,Yi+h˜1Xj,Yj22EhXi,Yi,Xj,Yjθh˜1Xi,Yi+h˜1Xj,Yj=VarhXi,Yi,Xj,YjVarh˜1Xi,YiVarh˜1Xj,Yj=θ+θ22p<.

According to Lemma 1,

nE(θˆ*θˆ)2=nOn2=On1,n,

Therefore,

n(θˆθ)N(0,4ζ),

Note that samples at different covariate levels are assumed to be independent. Thus, θˆ1,,θˆK are mutually independent, each is asymptotically normally distributed, that is,

nk(θˆkθk)N0,4ζk, (A6)

where ζk=pkθk2. According to the delta-method, G(θˆk) are also mutually independent, the approximate distribution of each is

nkG(θˆk)GθkN0,4g2θkζk. (A7)

Note that the true variance of θˆk is σk2 given in (5) and limnknkσk2=4ζk. Therefore, when nk is not enough large, replacing 4ζk with nkσk2 in (A6) and (A7) obtains better approximate distributions of θˆk and G(θˆk).

Footnotes

CONFLICT OF INTEREST STATEMENT

The authors declare no potential conflict of interests.

REFERENCES

  • 1.Obuchowski NA. Estimating and comparing diagnostic tests’ accuracy when the gold standard is not binary. Acad Radiol. 2005;12:1198–1204. [DOI] [PubMed] [Google Scholar]
  • 2.Obuchowski NA. An ROC-type measure of diagnostic accuracy when the gold standard is continuous-scale. Stat Med. 2006;25:481–493. [DOI] [PubMed] [Google Scholar]
  • 3.Chang YCI. Maximizing an ROC-type measure via linear combination of markers when the gold reference is continuous. Stat Med. 2013;32:1893–1903. [DOI] [PubMed] [Google Scholar]
  • 4.Pepe MS. Three approaches to regression analysis of receiver operating characteristic curves for continuous test results. Biometrics. 1998;54:124–135. [PubMed] [Google Scholar]
  • 5.Kim E, Zeng D. Semiparametric ROC analysis using accelerated regression models. Stat Sin. 2013;23:829–851. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Tosteson ANA, Begg CB. A general regression methodology for ROC curve estimation. Med Decis Making. 1988;8:204–215. [DOI] [PubMed] [Google Scholar]
  • 7.O’Malley AJ, Zou KH, Fielding JR, Tempany C. Bayesian regression methodology for estimating a receiver operating characteristic curve with two radiologic applications: prostate biopsy and spiral CT of ureteral stones. Acad Radiol. 2001;8:713–725. [DOI] [PubMed] [Google Scholar]
  • 8.Faraggi D Adjusting receiver operating characteristic curves and related indices for covariates. J R Stat Soc Ser D Stat. 2003;52:179–192. [Google Scholar]
  • 9.Thompson ML, Zucchini W. On the statistical analysis of ROC curves. Stat Med. 1989;8:1277–1290. [DOI] [PubMed] [Google Scholar]
  • 10.Obuchowski NA, Beiden SV, Berbaum KS, et al. Multireader, multicase receiver operating characteristic analysis: an empirical comparison of five methods. Acad Radiol. 2004;11:980–995. [DOI] [PubMed] [Google Scholar]
  • 11.Pepe MS. An interpretation for the ROC curve and inference using GLM procedures. Biometrics. 2000;56:352–359. [DOI] [PubMed] [Google Scholar]
  • 12.Cai T Semi-parametric ROC regression analysis with placement values. Biostatistics. 2004;5:45–60. [DOI] [PubMed] [Google Scholar]
  • 13.Zhang Z, Huang Y. A linear regression framework for the receiver operating characteristic (ROC) curve analysis. J Biom Biostat. 2005;3(2):5726. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Zhou XH, Castelluccio P, Zhou C. Nonparametric estimation of ROC curves in the absence of a gold standard. Biometrics. 2005;61:600–609. [DOI] [PubMed] [Google Scholar]
  • 15.Lin H, Zhou XH, Li G. A direct semiparametric receiver operating characteristic curve regression with unknown link and baseline functions. Stat Sin. 2012;22:14–27. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Rodenberg C, Zhou XH. ROC curve estimation when covariates affect the verification process. Biometrics. 2000;56:1256–1262. [DOI] [PubMed] [Google Scholar]
  • 17.Liu D, Zhou XH. Semiparametric estimation of the covariate-specific ROC curve in presence of ignorable verification bias. Biometrics. 2011;67:906–916. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Liu D, Zhou XH. Covariate adjustment in estimating the area under ROC curve with partially missing gold standard. Biometrics. 2013;69:91–100. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Fang HB, Fang KT, Kotz S. The meta-elliptical distributions with given marginals. J Multivar Anal. 2002;82:1–16. [Google Scholar]
  • 20.Metz CE, Herman BA, Shen JH. Maximum likelihood estimation of receiver operating characteristic (ROC) curves from continuously-distributed data. Stat Med. 1998;17(9):1033–1053. [DOI] [PubMed] [Google Scholar]
  • 21.Zhou XH, Lin HZ. Semi-parametric maximum likelihood estimates for ROC curves of continuous-scale tests. Stat Med. 2008;27:5271–5290. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Hoeffding W A class of statistics with asymptotically normal distribution. Ann. Math. Stat. 1948;19:293–325. [Google Scholar]

RESOURCES