A Procedure for Determining Whether a Simple Combination of Diagnostic Tests May Be Noninferior to the Theoretical Optimum Combination

Hua Jin; Ying Lu

doi:10.1177/0272989X08318462

. Author manuscript; available in PMC: 2026 Jan 2.

Published in final edited form as: Med Decis Making. 2008 Jun 12;28(6):909–916. doi: 10.1177/0272989X08318462

A Procedure for Determining Whether a Simple Combination of Diagnostic Tests May Be Noninferior to the Theoretical Optimum Combination

Hua Jin ¹, Ying Lu ²

PMCID: PMC12756413 NIHMSID: NIHMS2044815 PMID: 18556633

Abstract

Diagnosis accuracy can be improved by combining several complementary diagnostic tests. The likelihood ratio (LR) function is the optimal combination because it achieves the largest area under the receiver operating characteristic (ROC) curve. Derivation and interpretation of the LR function, however, can be complicated. Linear discriminant/logistic functions are simple but may not be optimal. In this article, the authors propose a statistical framework to determine when such linear combinations are noninferior alternatives to the LR function. They propose a nonparametric procedure to calculate LR functions and estimate the optimal area under the curve (AUC). The authors then define noninferiority of a simpler combination to the LR function with regard to the AUC of ROC curves. A bootstrap procedure is proposed to test the noninferiority. Monte Carlo simulation experiments are used to evaluate the performance of the proposed methods. Data from the Study of Osteoporotic Fractures are used to demonstrate the procedure.

Keywords: multiple diagnostic tests, receiver operating characteristic (ROC) curve, area under the curve (AUC), likelihood ratio function (LR), linear discriminant function, noninferiority, bootstrap

The results of several diagnostic tests are commonly available to clinicians. These tests are often complementary, and combining the results may make diagnosis easier. Because the sensitivity and specificity of a diagnostic test depend on the threshold used to define abnormality, the utility of the test can be assessed by determining the receiver operating characteristic (ROC) curve that plots the true positive rate (TPR), or sensitivity, against the false positive rate (FPR), or 1-specificity, over a range of thresholds.^1-3 The area under the ROC curve (AUC) is a common index of diagnostic utility, and the larger the AUC the better the test is at discriminating between 2 populations.

Su and Liu⁴ proved that the linear discriminant (LD) function is the best way to combine multivariate normally distributed diagnostic variables to produce an AUC under the ROC curve that is maximized among all linear functions. Pepe and Thompson⁵ relaxed the multivariate normal assumption and proposed a distribution-free method to find the optimum linear combinations of markers for AUC of ROC curves. Thompson⁶ considered a decision rule for sequential diagnostic tests without restriction of linear combinations of the individual test results.

Baker noted the optimality of the likelihood ratio function in statistical literature for combining multiple tests,⁷ although it has long been known in signal-detection theory.^8,9 He directly approximated the likelihood ratio function using nonparametric estimation of the FPR and the TPR based on discrete diagnostic testing results.¹⁰ McIntosh and Pepe¹¹ proposed an alternative approach using standard binary regression methodology.

According to the Neyman-Pearson theorem,¹² the likelihood ratio (LR) test is the uniformly most powerful test. If we consider a diagnostic question as a hypothesis testing problem, the type I error rate corresponds to the FPR and the statistical power is the TPR. Thus, for a given FPR and several diagnostic tests, we can find a cutoff value of the LR function to construct an LR-based decision rule, which has the maximum TPR over all alternative combination rules. The corresponding ROC area is the largest because the TPR is uniformly highest for all given FPRs. Practical application of LR functions based on the observed sample is not straightforward when distributional assumptions of test variables are difficult to specify. Simple combinations of test results, such as linear combinations, are preferable, as long as they preserve acceptable efficiency relative to the optimum LR combination.

In this article, we propose a procedure for estimating the optimum AUC of the LR function and to determine whether a simpler combination, such as a linear or quadratic form, is noninferior to the LR function. In the next section, we present our nonparametric approach to estimate the LR function and the optimal AUC. We then define the noninferiority of a simple combination to the LR function on the basis of AUCs. A bootstrap test procedure is proposed to test for noninferiority. In the third section, Monte Carlo simulation experiments are performed to evaluate the performance of the proposed test. The fourth section demonstrates an application of the proposed method to data from the Study of Osteoporotic Fractures (SOF). Discussion and conclusions are presented in the last section.

STATISTICAL METHODS

Model

Suppose we have $K$ continuous test variables (and/or risk factors). Let $D$ be the indicator of disease status ( $D = 1$ for diseased subjects and $D = 0$ for the nondiseased). Denote the random variable of the $k$ th test on the diseased subjects $(D = 1)$ as $X_{k}$ , $k = 1, \dots, K$ , and the random vector of test results as $X = (X_{1}, \dots, X_{k})$ , and the random vector of test results for the nondiseased $(D = 0)$ as $Y = (Y_{1}, \dots, Y_{k})$ . Furthermore, let $f_{1} (x)$ and $f_{0} (x)$ be their density functions, respectively, where $x = (x_{1}, \dots, x_{k})$ and the LR function are

f (x) = \frac{f_{1} (x)}{f_{0} (x)} .

The LR rule is to classify a subject with testing results $x$ as a case if

f (x) > c (s_{0}),

(1)

where $c (s_{0})$ satisfies $P \{f (Y) > c (s_{0})\} = s_{0}$ . According to the Neyman-Pearson theorem, the combination rule in equation (1) leads to the best combination of $x$ , which has the highest TPR for any given FPR $s_{0}$ among all possible rules based on the observed $x$ , and thus has the largest area under the ROC curve: $A U C_{f} = P {f (X) > f (Y)}$ .¹ McIntosh and Pepe¹¹ called this LR rule the uniformly most sensitive (UMS) test to combine multiple tests.

Estimation of LR Function $f (x)$ and the Corresponding Optimum AUC

The LR function can derive the best combination with the largest AUC of all ROC curves. For example, when both $X = (X_{1}, \dots, X_{K})$ and $Y = (Y_{1}, \dots, Y_{K})$ are normally distributed multivariate random variables with equal covariance matrices, the best combination is the linear discriminant function.^4,13 When the covariance matrices are different for cases and controls, the optimal combination is in a quadratic form.¹⁴ However, beyond normal distributions, the LR function may not be a linear or polynomial combination of $x$ . To determine whether a simpler (linear or quadratic) combination of $x$ can replace the best but complicated LR function, we first have to know the best AUC by LR function.

Estimation of $f (x)$ can be reduced to estimating multivariate density functions for cases and controls using nonparametric density estimations. There have been many reports on the topic using 1 of the 3 major approaches: parametric, nonparametric, and semiparametric.^15-21 In this article, we use the fixed-width kernel density estimator for estimation of these density functions^22,23 and we use their ratio to estimate the LR function. Specifically, we use superscripts to denote the observed samples, such that $X^{i} = (X_{1}^{i}, \dots, X_{K}^{i})$ is the observed diagnostic results for the $i$ th diseased subject $(i = 1, \dots, m)$ and $Y^{j} = (Y_{1}^{j}, \dots, Y_{K}^{j})$ is the observed diagnostic results for the $j$ th nondiseased subject $(j = 1, \dots, n)$ .We use

{\hat{f}}_{1} (x) = \frac{1}{m h_{1}^{K}} \sum_{i = 1}^{m} W_{1} (\frac{x - X^{i}}{h_{1}})

to estimate $f_{1} (x)$ , where $W_{1}$ is the Gaussian kernel

W_{1} (x) = {(2 π)}^{- K / 2} {(\det H_{1})}^{- 1 / 2} \exp (- x^{T} H_{1}^{- 1} x / 2),

(2)

$H_{1} = cov (X, X)$ , and the optimal window width $h_{1}^{* 15}$ is

h_{1}^{*} = {(\frac{4}{(2 K + 1) m})}^{1 / (K + 4)} .

Similarly, $f_{0} (x)$ can be estimated by

{\hat{f}}_{0} (x) = \frac{1}{n {(h_{0}^{*})}^{K}} \sum_{j = 1}^{n} W_{0} (\frac{x - Y^{j}}{h_{0}^{*}}),

(3)

Where $h_{0}^{*} = {(\frac{4}{(2 K + 1) n})}^{1 / (K + 4)}$ , $W_{0} (x) = {(2 π)}^{- K / 2} {(\det H_{0})}^{- 1 / 2} \exp (- X^{T} H_{0}^{- 1} X / 2)$ , and $H_{0} = cov (Y, Y)$ . The LR function is estimated by

\hat{f} (x) = \frac{{\hat{f}}_{1} (x)}{{\hat{f}}_{0} (x)} .

(4)

Because the optimum AUC based on LR function is $A U C_{f} = P {f (X) > f (Y)}$ , we use the nonparametric estimator of AUC of ROC based on U-statistics²:

A \hat{U} C_{f} = \frac{1}{m n} \sum_{i = 1}^{n} \sum_{j = 1}^{n} I (\hat{f} (X^{i}) > \hat{f} (Y^{j})),

(5)

where $I (A) = 1$ if $A$ is true and 0 otherwise, and $\hat{f} (x)$ is given in (4).

Density estimation of $\hat{f} (X^{i})$ and $\hat{f} (Y^{j})$ uses observations of all diseased and nondiseased subjects. Therefore, $\hat{f} (X^{i})$ ’s and $\hat{f} (Y^{j})$ ’s are no longer independent samples. Conventional estimation of standard error of AUC for independent samples fails. We use the bootstrap method²⁴ to estimate the standard error of the estimator (5). The bootstrap samples $m$ patients and $n$ controls with replacement from patients and controls, respectively. Based on derived samples, we estimate the LR function using equations (2)–(4) and then calculate AUC based on (5). The bootstrap experiments are repeated 50 times, and the standard error is estimated based on the sample standard error of the bootstrap results.

Noninferiority and Testing Statistics

As explained in the first section, we are interested in whether simpler combinations, such as an LD function or a logistic regression (LG) function, are noninferior to the optimum LR function. In general terms, we focus on a special class of functions $G = {g : g (x; θ) ∣ θ \in Θ}$ , where $g (x; θ)$ is a known, prespecified simple form function with parameter $θ$ belonging to a specific parameter space $Θ$ . We assume that an increase of the function $g$ represents an increase of disease risk. We can select a $θ_{0} \in Θ$ , such that $A U C_{g} = \Pr \{g (X; θ_{0}) > g (Y; θ_{0})\}$ . One example is the LD function regardless of distributions of testing results.

The $A U C_{g}$ can be estimated similarly by the U-statistics in (5) as

A \hat{U} C_{g} = \frac{1}{m n} \sum_{i = 1}^{n} \sum_{j = 1}^{n} I (g (X^{i}; θ_{0}) > g (Y^{j}; θ_{0})) .

Let $δ (A U C_{f}, A U C_{g})$ be a function of differences between the AUCs of 2 classification rules of $f$ and $g$ . Possible choices are $δ = A U C_{f} - A U C_{g}^{25}$ ; $δ = 1 - A U C_{g} / A U C_{f}^{26}$ ; or $δ = \log \frac{1 + A U C_{f}}{1 - A U C_{f}} - \log \frac{1 + A U C_{g}}{1 - A U C_{g}}$ . Noninferiority of $g$ against the optimum LR function $f$ is defined as d being less than a prespecified level $δ_{0} (δ_{0} > 0)$ . Because we want to prove the noninferiority, noninferiority must be the alternative hypothesis.²⁷ Thus, the hypotheses to be tested are

H_{0} : δ \geq δ_{0} v s . H_{1} : δ < δ_{0} .

Let $\hat{δ} = δ (A \hat{U} C_{f}, A \hat{U} C_{g})$ , and the bootstrap test statistic is

s t = \frac{(\hat{δ} - δ_{0})}{se (\hat{δ})},

(6)

where $s e (\hat{δ})$ is the sample standard error of bootstrap samples of $\hat{δ}$ . We will reject the null hypothesis $H_{0}$ and conclude that $g$ is noninferior to $f$ , if $s t \leq z_{α}$ , where $z_{α}$ is the 100αth percentile of the standard normal distribution. Again, we propose to take 50 bootstrap samples according to Efron and Tibshirani.²⁴

After examining the distribution of bootstrap samples and the corresponding type I error rates, the difference function $δ = \log \frac{1 + A U C_{f}}{1 - A U C_{f}} - \log \frac{1 + A U C_{g}}{1 - A U C_{g}}$ resulted in the best normal approximation for the test statistics (6), and so we selected it to define the noninferiority for simple functions against the optimum LR function. For a given $A U C_{f}$ , d is a monotone decreasing function of $A U C_{g}$ . The relationship between $δ (A U C_{f}, A U C_{g})$ and $A U C_{f} - A U C_{g}$ are given in Table 1.

Table 1.

The Corresponding Values Between the Difference of the artanh Functions (i.e., $δ = \log \frac{1 + A U C_{f}}{1 - A U C_{f}} - \log \frac{1 + A U C_{g}}{1 - A U C_{g}}$ ) and the Arithmetic Difference (i.e., $A U C_{f} - A U C_{g}$ )

$δ$	$A U C_{f}$
$δ$	0.70	0.75	0.80	0.85	0.90

0.20	0.055	0.047	0.039	0.030	0.021
0.25	0.069	0.060	0.050	0.039	0.027
0.30	0.085	0.073	0.061	0.047	0.033
0.40	0.117	0.101	0.084	0.061	0.046
0.50	0.151	0.131	0.110	0.086	0.060

Open in a new tab

MONTE CARLO SIMULATION EXPERIMENTS

We used Monte Carlo simulation experiments to investigate the performance of the proposed estimator of LR function, the optimum AUC (5) and the bootstrap test statistics (6), including the accuracy of (5), actual type I error rates, and its power of (6) under various hypothetical conditions and sample sizes. Because of computational limitations, our Monte Carlo simulations were limited to 2 diagnostic tests. We repeated each simulation condition 500 times. The 95% confidence interval for a 5% type I error rate under the null in 500 repetitions is (3.09%, 6.91%).

First, we designed experiments to test results that followed bivariate normal distributions under equal or different covariance matrices. We supposed that $X = (X_{1}, X_{2}) \sim N (μ_{11}, μ_{12}, σ_{11}, σ_{12}, r_{1})$ and $Y = (Y_{1}, Y_{2}) \sim N (μ_{01}, μ_{02}, σ_{01}, σ_{02}, r_{0})$ , where $μ_{11}$ , $μ_{12}$ , $σ_{11}$ , $σ_{12}$ , and $r_{1}$ were means, standard deviations, and their correlation of $X_{1}$ , $X_{2}$ and $μ_{01}$ , $μ_{02}$ , $σ_{01}$ , $σ_{02}$ , and $r_{0}$ were means, standard deviations, and correlation of $Y_{1}$ , $Y_{2}$ . We considered several different cases with different sets of parameters and sample sizes for the diseased and nondiseased groups. In these simulations, we first calculated the $A U C_{f}$ based on mathematical integration and compared the mean and standard deviation of sample $A \hat{U} C_{f}$ from the theoretical $A U C_{f}$ . We also compared the linear discriminant function (LD) or quadratic discrimination function (QD) against the LR function. The null hypothesis was $H_{0} : δ = \log \frac{1 + A U C_{f}}{1 - A U C_{f}} - \log \frac{1 + A U C_{g}}{1 - A U C_{g}} \geq δ_{0}$ against the alternative $H_{1} : δ < δ_{0}$ , where $A U C_{g}$ was the AUC of the ROC curve generated by function $g$ , either LD or QD, and $δ_{0}$ is the prespecified constant.

Tables 2 and 3 show the estimation accuracy of estimator (5) and true AUC by LR function. It showed that the nonparametric density estimation and U-statistics could derive accurate estimates of AUC.

Table 2.

Assessing Type I Error Rates of the Proposed Testing Algorithm for the Null Hypotheses $H_{0} : δ = \log \frac{1 + A U C_{f}}{1 - A U C_{f}} - \log \frac{1 + A U C_{L D}}{1 - A U C_{L D}} = δ_{0} = 0.25$

$μ_{11}$ , $μ_{12}$ , $σ_{11}$ , $σ_{12}$ , $r_{1}$	$μ_{01}$ , $μ_{02}$ , $σ_{01}$ , $σ_{02}$ , $r_{0}$	$A U C_{f}$	$(m, n)$	$\bar{x}$	Bias	RMSE^a	Type I Error

(0.39, 0.39, 1, 1, 0.6)	(0, 0, 1, 1, 0.1)	0.70	(50, 50)	0.716	0.016	0.049	0.046
			(100, 100)	0.707	0.007	0.035	0.068
(0.80, 0.80, 1, 1, 0.7)	(0, 0, 1, 1, 0.1)	0.80	(50, 50)	0.814	0.014	0.042	0.042
			(100, 100)	0.806	0.006	0.032	0.042
(1.37, 1.37, 1, 1, 0.75)	(0, 0, 1, 1, 0.15)	0.90	(50, 50)	0.906	0.006	0.028	0.054
			(100, 100)	0.900	0.000	0.021	0.058

Open in a new tab

^a.

RMSE denotes root of mean square error.

Table 3.

Assessing Powers of the Proposed Testing Algorithm for the Hypotheses $H_{0} : δ = \log \frac{1 + A U C_{f}}{1 - A U C_{f}} - \log \frac{1 + A U C_{L D}}{1 - A U C_{L D}} = δ_{0} = 0.25 v. H_{1} : δ = 0$

$μ_{11}$ , $μ_{12}$ , $σ_{11}$ , $σ_{12}$ , $r_{1}$	$μ_{01}$ , $μ_{02}$ , $σ_{01}$ , $σ_{02}$ , $r_{0}$	$A U C_{f}$	$(m, n)$	$\bar{x}$	Bias	RMSE^a	Power

(0.34, 0.34, 1, 1, 0.8)	(0, 0, 1, 1, 0.8)	0.60	(50, 50)	0.648	0.048	0.046	0.32
			(75, 75)	0.632	0.032	0.040	0.61
			(100, 100)	0.624	0.024	0.037	0.81
(0.64, 0.64, 1, 1, 0.5)	(0, 0, 1, 1, 0.5)	0.70	(50, 50)	0.725	0.025	0.047	0.61
			(75, 75)	0.712	0.012	0.042	0.90
			(100, 100)	0.709	0.009	0.034	0.97
(0.92, 0.92, 1, 1, 0.2)	(0, 0, 1, 1, 0.2)	0.80	(50, 50)	0.808	0.008	0.042	0.74
			(75, 75)	0.806	0.006	0.035	0.97
			(100, 100)	0.805	0.005	0.029	0.99
(1.28, 1.28, 1, 1, 0)	(0, 0, 1, 1, 0)	0.90	(50, 50)	0.904	0.004	0.030	0.80
			(75, 75)	0.903	0.003	0.025	0.97
			(100, 100)	0.902	0.002	0.022	0.99

Open in a new tab

^a.

RMSE denotes root of mean square error.

Table 2 also showed the rejection rate of the proposed test statistic when the null hypotheses were true (ie, $H_{0} : δ = \log \frac{1 + A U C_{f}}{1 - A U C_{f}} - \log \frac{1 + A U C_{L D}}{1 - A U C_{L D}} = δ_{0}$ ) for the prespecified threshold of noninferiority $δ_{0}$ as 0.25. Because the QD combination is the LR function when correlation coefficients are different, the alternative hypothesis is also true for QD (in fact, $δ = \log \frac{1 + A U C_{f}}{1 - A U C_{f}} - \log \frac{1 + A U C_{Q D}}{1 - A U C_{Q D}} = 0$ ). We could see that the proposed algorithms had rejection rates within the 95% confidence interval of the 5% error rate for both moderate sample size $m = n = 100$ and relatively small sample $m = n = 50$ : Furthermore, it follows from Table 3 that the powers of the test statistics are reasonable for these sample sizes because most of the powers are greater than 0.80.

Our second design of simulation studies was based on (asymmetric) gamma distribution. In this example, we want to show that the LD is inferior to the LR whereas the quadratic combination (QD) is noninferior to the LR. We simulated $Y_{1}$ and $Y_{2}$ from standard normal distribution with correlation coefficient as 0.5, and $X_{1}$ and $X_{2}$ from a mixture of 2 correlated bivariate gamma distributions that we generated in the following steps. First, we generated a random variable $ξ$ distributed with 0–1 distribution with $P (ξ = 0) = 0.57$ . If $ξ = 0$ , we generated 2 independent $X_{1} \sim Gamma (0.2, 1)$ and $X_{2} \sim Gamma (0.1, 1)$ , and they were independent; otherwise, we generated 3 independent variables $X_{0} \sim Gamma (0.3, 1)$ , $X_{10} \sim Gamma(0.5, 1)$ , $X_{20} \sim Gamma (1, 1)$ , and defined $X_{1} = X_{10} + X_{0}$ and $X_{2} = X_{20} + X_{0}$ , respectively.

In this case, the true values of the AUC for LR, LD, and QD were, respectively, 0.736, 0.674, and 0.711, and the corresponding difference $δ = \log \frac{1 + A U C_{f}}{1 - A U C_{f}} - \log \frac{1 + A U C_{g}}{1 - A U C g}$ between LR and LD was 0.25. Here we are interested in the rejection rate, the proportion of simulations that reject the null hypothesis of inferiority in 500 simulated repetitions. The noninferiority limit $δ_{0}$ for LD and QD to LR in our simulation was 0.25. The nonparametric estimator of AUC for LR was 0.752 (with $s = 0.070$ ) for small sample size $m = n = 50$ , and 0.739 $(s = 0.053)$ for moderate sample size $m = n = 100$ : For the test of noninferiority of LD to LR, the rejection rate was 0.040 and 0.066 for small and moderate sample sizes, respectively, which showed good control of type I errors for both sample sizes. On the other hand, for the test of noninferiority of QD to LR, the rejected region was calculated according to thresholds for the null hypothesis. In 500 repetitions, rejecting rates (statistical power) were 0.170 and 0.222 for the 2 different sample sizes. The powers were not high because of the small difference in AUC between QD and LR and relatively small sample sizes $m = n \leq 100$ .

In summary, our proposed test performed well under studied cases, even with small sample sizes such as $m = n = 50$ :

AN APPLICATION EXAMPLE

In this section, we apply our method to data obtained from the SOF for illustration. From 1986 to 1988, SOF recruited 9704 white women aged 65 years or older from 4 areas of the United States.²⁸ At baseline, bone mineral density (BMD) was measured at the calcaneus, distal radius, and proximal radius using single photon absorptiometry. At the second visit (1988–89), surviving participants had BMD measurements of the posterior-anterior spine (L1-L4) and proximal femur (neck, trochanter, total hip regions of interest) using dual x-ray absorptiometry. Fractures of the hip were recorded for each subject at each visit. Details of the study design and the data have been published previously.^28,29

We included 7112 women from the study. All of these women had forearm, calcaneal, hip, and spine BMD measurements. Furthermore, they all had known 5-year hip fracture status: either they were followed for 5 years after visit 2 without hip fracture or a hip fracture occurred within 5 years after visit 2. Women who were lost to follow-up within 5 years without known hip fracture were excluded. A total of 229 women had hip fractures during the 5-year follow-up. Many clinical variables at baseline have been previously identified as significant predictors of hip fractures within 5 years.^29–33 Our previous study suggested that femoral neck BMD, age, and height loss after age 25 could be used to build a noninferior classification rule to the optimum recursive partitioning rule.³⁴ In this example, we will use these 3 variables and compare the AUC under ROC curves of LR function and the LG function.

Let $D = 1$ stand for hip fracture, $X_{1}$ , $X_{2}$ , and $X_{3}$ denote age, femoral neck BMD, and loss of height, respectively. First, the LR function described in section 2 leads to the optimum discriminant function with the estimated area under the ROC curve of 0.79. However, the contributions of these parameters were not apparent. Logistic regression offers a simple linear combination of these 3 variables that can be used to classify subjects according to the predicted probability of fracture:

P (D = 1 ∣ (X_{1}, X_{2}, X_{3})) = 1 - \frac{1}{1 + \exp (- 3.89 + 0.075 X_{1} - 8.90 X_{2} + 0.10 X_{3})}

And $g (x) = 0.075 x_{1} - 8.90 x_{2} + 0.10 x_{3}$ was the corresponding linear combination. Our estimated AUC under linear combination was 0.80, the same as the AUC of the LR function.

Because we are specifically interested in how this function will perform with fairly small sample sizes, and to avoid bootstrapping the data from 6900 women without hip fracture and estimating their joint density, we constructed a subset of the data. Our small dataset consists of all 229 women who were fractured and another 229 randomly chosen women without fracture. Based on these new data, we examine whether the AUC of the logistic linear function is noninferior to that of the optimal LR function with a prespecified threshold of $δ_{0} = 0.25$ . The bootstrap test used 50 samples. We repeated our bootstrap testing procedure 500 times, and the 95% empirical confidence interval of the $P$ value was (0, 0.002). In fact, all the $P$ values were smaller than 0.05, which suggests that the logistic function is noninferior to the LR function and, therefore, should be one of the optimum classifications.

In summary, based on SOF data, considering only the relation between the 3 predictors (age, femoral neck BMD, and loss of height) and 5-year hip fracture, the logistic regression equation could offer the best prediction of hip fracture.

DISCUSSION AND CONCLUSION

It is important to use all diagnostic predictors simultaneously to establish a new predictor with better diagnostic efficiency. Although the likelihood ratio function leads to the uniformly most powerful composite predictor, relatively simple alternatives, such as the most commonly used linear combinations, are better for practical applications. Simple functions are often used to interpret the relationship between predictors and outcomes in medical literature. We proposed a bootstrap test for whether these simpler classification functions are noninferior to the optimum LR function. Monte Carlo simulations showed appropriate type I error rates and reasonable power for the proposed test.

For practical applications, it is important to determine a suitable margin of noninferiority d₀ as well as the function that defines the differences between 2 AUCs. We selected the artanh function (i.e., $\log \frac{1 + A U C_{f}}{1 - A U C_{f}} - \log \frac{1 + A U C_{g}}{1 - A U C_{g}}$ ) as the measure of difference to improve the normality of the bootstrap test. There is no one-to-one mapping from the arithmetical difference (i.e., $A U C_{f} - A U C_{g}$ ) to the artanh difference. When we select $δ_{0}$ as 0.25, the corresponding arithmetical differences lie in interval [0.027, 0.069] for $0.70 \leq A U C_{f} \leq 0.90$ . In general, one can base on the expected $A U C_{f}$ and targeted noninferiority margin to select the threshold of $δ_{0}$ .

The estimated $A U C_{f}$ depends on good estimations of the multivariate densities. We employed the global bandwidth kernel estimator, which led to satisfactory results both in the simulation studies and with the SOF data. The effect of more efficient estimates of the multivariate densities of our proposed test will require further investigation. One difficulty of estimating multivariate density is the increase of dimensionalities. For very high dimensions, one may need to do the factor analysis first and then model the nonparametric densities on the major factors.

Combining the results of diagnostic tests can provide information beyond the sum of the parts. However, an inefficient combination may contain redundant information, and if all the covariates are included, the model may be overwhelmed and ineffective. As with stepwise regression analysis, we can try different combinations of test results. The combination with the lowest number of variables that remains noninferior to the LR function can be used to formulate simple rules. Furthermore, the simple function g is not limited to linear or polynomial functions. Any function that can be interpreted and has a closed explicit form can be tested for noninferiority to the optimum LR function.

Sample size also affects our method. According to Gurler and Prewitt,³⁵ a moderate sample size for a good estimation of density is 300. One hundred is considered a small sample size. So, in theory, a study with 100 diseased subjects $(m)$ and 100 nondiseased subjects $(n)$ may still be too small to give a reliable estimation of density functions that construct the LR function. However, our limited simulation studies suggest that a sample of 100 will still produce stable results in estimating the AUC of the ROC curve. This is significant because most clinical imaging studies are relatively small. The required sample size will increase with the number of tests to be combined.

Our study has some limitations. First, our bootstrap test procedure was evaluated based on limited simulation experiments and it is not clear if all these properties will hold for more complicated applications. Second, the computational burden will increase with the number of diagnostic variables as well as the number of study subjects. The programs used in this article ran on a PC using S-plus software. More efficient programming might improve computational efficiency.

In conclusion, this article proposes a bootstrap test procedure to determine whether a simpler combination of diagnostic tests is noninferior to the theoretical optimum combination based on the likelihood ratio function.

ACKNOWLEDGMENTS

The authors would like to thank the editor and reviewers for their constructive comments. This study was supported by grants from the National Institutes of Health R03 AR47104 and R01EB004079.

Contributor Information

Hua Jin, South China Normal University, School of Mathematical Sciences, Guangzhou, China.

Ying Lu, University of California, San Francisco, Department of Radiology, Department of Epidemiology and Biostatistics, and UCSF Helen Diller Family Comprehensive Cancer Center.

REFERENCES

1.Hanley J Receiver operating characteristic (ROC) methodology: the state of the art. Crit Rev Diagnost Imag. 1989;29:307–35. [PubMed] [Google Scholar]
2.DeLong ER, DeLong DM, Clark-Pearson DL. Comparing the areas under two or more correlated receiver operating characteristic curves: a non-parametric approach. Biometrics. 1988;44:837–45. [PubMed] [Google Scholar]
3.Metz CE. ROC methodology in radiology imaging. Invest Radiol. 1986;21:720–33. [DOI] [PubMed] [Google Scholar]
4.Su J, Liu J. Linear combinations of multiple diagnostic markers. J Am Stat Assoc. 1993;88(424):1350–5. [Google Scholar]
5.Pepe MS, Thompson ML. Combing diagnostic test results to increase accuracy. Biostatistics. 2000;1:123–40. [DOI] [PubMed] [Google Scholar]
6.Thompson ML. Assessing the diagnostic accuracy of a sequence of tests. Biostatistics. 2003;4:341–51. [DOI] [PubMed] [Google Scholar]
7.Baker S Evaluating multiple diagnostic tests with partial verification. Biometrics. 1995;51:330–7. [PubMed] [Google Scholar]
8.Green DM, Swets JA. Signal Detection Theory and Psychophysics. New York: John Wiley; 1966. [Google Scholar]
9.Egan JP. Signal Detection Theory and ROC Analysis. New York: Academic Press; 1975. [Google Scholar]
10.Baker SG. Identifying combinations of cancer markers for further study as triggers of early intervention. Biometrics. 2000;56:1082–7. [DOI] [PubMed] [Google Scholar]
11.McIntosh M, Pepe M. Combining several screening tests: optimality of the risk score. Biometrics. 2002;58:657–64. [DOI] [PubMed] [Google Scholar]
12.Neyman J, Pearson ES. On the problem of the most efficient tests of statistical hypothesis. Philos Trans R Soc Lond A. 1933; 231:289–337. [Google Scholar]
13.Fisher RA. The use of multiple measurements in taxonomic problems. Ann Eugen. 1936;7:179–88. [Google Scholar]
14.Randles RH, Broffitt JD, Ramberg JS, Hogg RV. Generalized linear and quadratic discriminant functions using robust estimates. J Am Statist Assoc. 1978;73:564–8. [Google Scholar]
15.Silverman BW. Density Estimation for Statistics and Data Analysis. London: Chapman and Hall; 1989. [Google Scholar]
16.Izenman AJ. Recent developments in nonparametric density estimation. J Am Stat Assoc. 1991;86:205–24. [Google Scholar]
17.Scott DW. Multivariate Density Estimation. Theory, Practice, and Visualization. New York: John Wiley; 1992. [Google Scholar]
18.Hjort NL, Glad IK. Nonparametric density estimation with a parametric start. Ann Stat. 1995;23:882–904. [Google Scholar]
19.Hjort NL, Jones MC. Locally parametric nonparametric density estimation. Ann Stat. 1996;24:1619–47. [Google Scholar]
20.Hjort NL. Bayesian approaches to non- and semiparametric density estimation. In: Bernando JM, Berger JO, Dawid AP, Smith AFM, eds. Bayesian Statistics. New York: Oxford University Press; 1996. p 223–53. [Google Scholar]
21.Hall P Biometrika centenary: nonparametrics. Biometrika. 2001;88:143–65. [Google Scholar]
22.Wand MP. Error analysis for general multivariate kernel estimators. J Nonparametr Stat. 1992;2:1–15. [Google Scholar]
23.Wand MP, Jones MC. Comparison of smoothing parameterizations in bivariate kernel density estimation. J Am Stat Assoc. 1993;88:520–8. [Google Scholar]
24.Efron B, Tibshirani RJ. An Introduction to the Bootstrap. San Francisco: Chapman & Hall; 1993. [Google Scholar]
25.Zhou XH, Obuchowski NA, McClish DK. Statistical Methods in Diagnostic Medicine. New York: John Wiley; 2002. [Google Scholar]
26.Lachenbruch P, Lynch C. Assessing screening tests: extensions of McNemar’s test. Stat Med. 1998;17:2207–17. [DOI] [PubMed] [Google Scholar]
27.Lu Y, Jin H, Genant HK. On the non-inferiority of a diagnostic test based on paired observations. Stat Med. 2003;22:3029–44. [DOI] [PubMed] [Google Scholar]
28.Cummings S, Black DM, Nevitt MC, et al. Bone density at various sites for prediction of hip fractures: the study of osteoporotic fractures. Lancet. 1993;341:72–5. [DOI] [PubMed] [Google Scholar]
29.Cummings S, Nevitt MC, Browner WS, et al. Risk factors for hip fracture in white women. Study of osteoporosis research group [see Comments]. N Engl J Med. 1995;332:767–73. [DOI] [PubMed] [Google Scholar]
30.Cummings S, Browner WS, Bauer D, et al. Endogenous hormones and the risk of hip and vertebral fractures among older women. Study of Osteoporotic Fractures Research Group. N Engl J Med. 1998;339:767–8. [DOI] [PubMed] [Google Scholar]
31.Bauer DC, Sklarin PM, Stone KL, et al. Biochemical markers of bone turnover and prediction of hip bone loss in older women: the study of osteoporotic fractures. J Bone Mineral Res. 1999;14:1404–10. [DOI] [PubMed] [Google Scholar]
32.Arden NK, Nevitt MC, Lane NE, et al. Osteoarthritis and risk of falls, rates of bone loss, and osteoporotic fractures. Study of Osteoporotic Fractures Research Group. Arthritis Rheum. 1999;42:1378–85. [DOI] [PubMed] [Google Scholar]
33.Sellmeyer D, Stone K, Sebastian A, et al. A high ratio of dietary animal to vegetable protein increases the rate of bone loss and the risk of fracture in postmenopausal women. Study of Osteoporotic Fractures Research Group. Am J Clin Nutr. 2001;73:118–22. [DOI] [PubMed] [Google Scholar]
34.Jin H, Lu Y, Stone KL, et al. Classification algorithms for hip fracture prediction based on recursive partitioning methods. Med Decis Making. 2004;24(4):386–97. [DOI] [PubMed] [Google Scholar]
35.Gurler Ü, Prewitt K. Bivariate density estimation with randomly truncated data. J Multivar Anal. 2000;74:88–115. [Google Scholar]

[R1] 1.Hanley J Receiver operating characteristic (ROC) methodology: the state of the art. Crit Rev Diagnost Imag. 1989;29:307–35. [PubMed] [Google Scholar]

[R2] 2.DeLong ER, DeLong DM, Clark-Pearson DL. Comparing the areas under two or more correlated receiver operating characteristic curves: a non-parametric approach. Biometrics. 1988;44:837–45. [PubMed] [Google Scholar]

[R3] 3.Metz CE. ROC methodology in radiology imaging. Invest Radiol. 1986;21:720–33. [DOI] [PubMed] [Google Scholar]

[R4] 4.Su J, Liu J. Linear combinations of multiple diagnostic markers. J Am Stat Assoc. 1993;88(424):1350–5. [Google Scholar]

[R5] 5.Pepe MS, Thompson ML. Combing diagnostic test results to increase accuracy. Biostatistics. 2000;1:123–40. [DOI] [PubMed] [Google Scholar]

[R6] 6.Thompson ML. Assessing the diagnostic accuracy of a sequence of tests. Biostatistics. 2003;4:341–51. [DOI] [PubMed] [Google Scholar]

[R7] 7.Baker S Evaluating multiple diagnostic tests with partial verification. Biometrics. 1995;51:330–7. [PubMed] [Google Scholar]

[R8] 8.Green DM, Swets JA. Signal Detection Theory and Psychophysics. New York: John Wiley; 1966. [Google Scholar]

[R9] 9.Egan JP. Signal Detection Theory and ROC Analysis. New York: Academic Press; 1975. [Google Scholar]

[R10] 10.Baker SG. Identifying combinations of cancer markers for further study as triggers of early intervention. Biometrics. 2000;56:1082–7. [DOI] [PubMed] [Google Scholar]

[R11] 11.McIntosh M, Pepe M. Combining several screening tests: optimality of the risk score. Biometrics. 2002;58:657–64. [DOI] [PubMed] [Google Scholar]

[R12] 12.Neyman J, Pearson ES. On the problem of the most efficient tests of statistical hypothesis. Philos Trans R Soc Lond A. 1933; 231:289–337. [Google Scholar]

[R13] 13.Fisher RA. The use of multiple measurements in taxonomic problems. Ann Eugen. 1936;7:179–88. [Google Scholar]

[R14] 14.Randles RH, Broffitt JD, Ramberg JS, Hogg RV. Generalized linear and quadratic discriminant functions using robust estimates. J Am Statist Assoc. 1978;73:564–8. [Google Scholar]

[R15] 15.Silverman BW. Density Estimation for Statistics and Data Analysis. London: Chapman and Hall; 1989. [Google Scholar]

[R16] 16.Izenman AJ. Recent developments in nonparametric density estimation. J Am Stat Assoc. 1991;86:205–24. [Google Scholar]

[R17] 17.Scott DW. Multivariate Density Estimation. Theory, Practice, and Visualization. New York: John Wiley; 1992. [Google Scholar]

[R18] 18.Hjort NL, Glad IK. Nonparametric density estimation with a parametric start. Ann Stat. 1995;23:882–904. [Google Scholar]

[R19] 19.Hjort NL, Jones MC. Locally parametric nonparametric density estimation. Ann Stat. 1996;24:1619–47. [Google Scholar]

[R20] 20.Hjort NL. Bayesian approaches to non- and semiparametric density estimation. In: Bernando JM, Berger JO, Dawid AP, Smith AFM, eds. Bayesian Statistics. New York: Oxford University Press; 1996. p 223–53. [Google Scholar]

[R21] 21.Hall P Biometrika centenary: nonparametrics. Biometrika. 2001;88:143–65. [Google Scholar]

[R22] 22.Wand MP. Error analysis for general multivariate kernel estimators. J Nonparametr Stat. 1992;2:1–15. [Google Scholar]

[R23] 23.Wand MP, Jones MC. Comparison of smoothing parameterizations in bivariate kernel density estimation. J Am Stat Assoc. 1993;88:520–8. [Google Scholar]

[R24] 24.Efron B, Tibshirani RJ. An Introduction to the Bootstrap. San Francisco: Chapman & Hall; 1993. [Google Scholar]

[R25] 25.Zhou XH, Obuchowski NA, McClish DK. Statistical Methods in Diagnostic Medicine. New York: John Wiley; 2002. [Google Scholar]

[R26] 26.Lachenbruch P, Lynch C. Assessing screening tests: extensions of McNemar’s test. Stat Med. 1998;17:2207–17. [DOI] [PubMed] [Google Scholar]

[R27] 27.Lu Y, Jin H, Genant HK. On the non-inferiority of a diagnostic test based on paired observations. Stat Med. 2003;22:3029–44. [DOI] [PubMed] [Google Scholar]

[R28] 28.Cummings S, Black DM, Nevitt MC, et al. Bone density at various sites for prediction of hip fractures: the study of osteoporotic fractures. Lancet. 1993;341:72–5. [DOI] [PubMed] [Google Scholar]

[R29] 29.Cummings S, Nevitt MC, Browner WS, et al. Risk factors for hip fracture in white women. Study of osteoporosis research group [see Comments]. N Engl J Med. 1995;332:767–73. [DOI] [PubMed] [Google Scholar]

[R30] 30.Cummings S, Browner WS, Bauer D, et al. Endogenous hormones and the risk of hip and vertebral fractures among older women. Study of Osteoporotic Fractures Research Group. N Engl J Med. 1998;339:767–8. [DOI] [PubMed] [Google Scholar]

[R31] 31.Bauer DC, Sklarin PM, Stone KL, et al. Biochemical markers of bone turnover and prediction of hip bone loss in older women: the study of osteoporotic fractures. J Bone Mineral Res. 1999;14:1404–10. [DOI] [PubMed] [Google Scholar]

[R32] 32.Arden NK, Nevitt MC, Lane NE, et al. Osteoarthritis and risk of falls, rates of bone loss, and osteoporotic fractures. Study of Osteoporotic Fractures Research Group. Arthritis Rheum. 1999;42:1378–85. [DOI] [PubMed] [Google Scholar]

[R33] 33.Sellmeyer D, Stone K, Sebastian A, et al. A high ratio of dietary animal to vegetable protein increases the rate of bone loss and the risk of fracture in postmenopausal women. Study of Osteoporotic Fractures Research Group. Am J Clin Nutr. 2001;73:118–22. [DOI] [PubMed] [Google Scholar]

[R34] 34.Jin H, Lu Y, Stone KL, et al. Classification algorithms for hip fracture prediction based on recursive partitioning methods. Med Decis Making. 2004;24(4):386–97. [DOI] [PubMed] [Google Scholar]

[R35] 35.Gurler Ü, Prewitt K. Bivariate density estimation with randomly truncated data. J Multivar Anal. 2000;74:88–115. [Google Scholar]

PERMALINK

A Procedure for Determining Whether a Simple Combination of Diagnostic Tests May Be Noninferior to the Theoretical Optimum Combination

Hua Jin, PhD

Ying Lu, PhD

Abstract

STATISTICAL METHODS

Model

Estimation of LR Function $f (x)$ and the Corresponding Optimum AUC

Noninferiority and Testing Statistics

Table 1.

MONTE CARLO SIMULATION EXPERIMENTS

Table 2.

Table 3.

AN APPLICATION EXAMPLE

DISCUSSION AND CONCLUSION

ACKNOWLEDGMENTS

Contributor Information

REFERENCES

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

A Procedure for Determining Whether a Simple Combination of Diagnostic Tests May Be Noninferior to the Theoretical Optimum Combination

Hua Jin, PhD

Ying Lu, PhD

Abstract

STATISTICAL METHODS

Model

Estimation of LR Function f(x) and the Corresponding Optimum AUC

Noninferiority and Testing Statistics

Table 1.

MONTE CARLO SIMULATION EXPERIMENTS

Table 2.

Table 3.

AN APPLICATION EXAMPLE

DISCUSSION AND CONCLUSION

ACKNOWLEDGMENTS

Contributor Information

REFERENCES

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

Estimation of LR Function $f (x)$ and the Corresponding Optimum AUC