Optimizing area under the ROC curve using semi-supervised learning

Shijun Wang; Diana Li; Nicholas Petrick; Berkman Sahiner; Marius George Linguraru; Ronald M Summers

doi:10.1016/j.patcog.2014.07.025

. Author manuscript; available in PMC: 2016 Jan 1.

Published in final edited form as: Pattern Recognit. 2014 Aug 6;48(1):276–287. doi: 10.1016/j.patcog.2014.07.025

Optimizing area under the ROC curve using semi-supervised learning

Shijun Wang ^a, Diana Li ^a, Nicholas Petrick ^b, Berkman Sahiner ^b, Marius George Linguraru ^c,^d, Ronald M Summers ^a,^*

PMCID: PMC4226543 NIHMSID: NIHMS620086 PMID: 25395692

Abstract

Receiver operating characteristic (ROC) analysis is a standard methodology to evaluate the performance of a binary classification system. The area under the ROC curve (AUC) is a performance metric that summarizes how well a classifier separates two classes. Traditional AUC optimization techniques are supervised learning methods that utilize only labeled data (i.e., the true class is known for all data) to train the classifiers. In this work, inspired by semi-supervised and transductive learning, we propose two new AUC optimization algorithms hereby referred to as semi-supervised learning receiver operating characteristic (SSLROC) algorithms, which utilize unlabeled test samples in classifier training to maximize AUC. Unlabeled samples are incorporated into the AUC optimization process, and their ranking relationships to labeled positive and negative training samples are considered as optimization constraints. The introduced test samples will cause the learned decision boundary in a multidimensional feature space to adapt not only to the distribution of labeled training data, but also to the distribution of unlabeled test data. We formulate the semi-supervised AUC optimization problem as a semi-definite programming problem based on the margin maximization theory. The proposed methods SSLROC1 (1-norm) and SSLROC2 (2-norm) were evaluated using 34 (determined by power analysis) randomly selected datasets from the University of California, Irvine machine learning repository. Wilcoxon signed rank tests showed that the proposed methods achieved significant improvement compared with state-of-the-art methods. The proposed methods were also applied to a CT colonography dataset for colonic polyp classification and showed promising results.¹

Keywords: Receiver operating characteristic, AUC, Semi-supervised learning, Transfer learning, Semidefinite programming, RankBoost, SVMROC, SSLROC

1. Introduction

Receiver operating characteristic (ROC) analysis is a standard methodology to evaluate the performance of a classification system [1–12]. It is applied extensively within clinical medicine [13–15]. The ROC curve is a two-dimensional plot which illustrates the relationship between the true positive rate (sensitivity) and the false positive rate (1 – specificity) of a binary classifier. In essence, a classifier seeks the optimal mapping of samples from a multi-dimensional feature space to a one-dimensional decision space during the training process. After the training process, the classifier can be applied to test samples whose labels are unknown and make a prediction for each test sample. The value of the prediction should be numerical (not binary categories) in order to make ROC analysis. Based on the predictions of the test set from a trained classifier, user of the classifier can select a specific diagnostic threshold to differentiate positive from negative samples for his or her specific application by finding the threshold along the ROC curve which maximizes sensitivity at the highest acceptable false positive rate (or cost).

The area under the ROC curve (AUC) is a univariate description of the ROC curve [1]. It ranges from 0.5 to 1, with larger values representing higher system performance. The AUC is equal to the probability that the decision value assigned to a randomly-drawn positive sample is greater than the value assigned to a randomly-drawn negative sample. Flach et al. proved that AUC is coherent and linearly related to expected loss [12]. The AUC statistic is commonly used to compare different classification systems. Previous studies have shown that AUC is statistically consistent and a more discriminative measure than classification accuracy [3,4].

Although some researchers have recommended the use of AUC for the evaluation of machine learning algorithms when a single performance metric needs to be used for the evaluation [1], others have pointed out some shortcomings of the use of the AUC. Lobo et al. cited a number of limitations of the use of AUC in evaluating the performance of species distribution (presence–absence) models [16] in ecology. Among the more general limitations are that the AUC summarizes performance over regions of the ROC space in which one would rarely operate, and that the goodness-of-fit of a model is ignored by the AUC. Hanczar et al. studied the problem of comparing estimates of AUC, true positive rate (TPR) and false positive rate (FPR) with true metrics when classifier training and performance estimation are performed on small-sample datasets [17]. They found that generally there is weak regression of the true metric on the estimated metric for all three figures of merit (AUC, TPR and FPR) studied. Clearly, AUC needs to be carefully considered as an endpoint in both classifier evaluation and classifier design. However, when a single figure of merit needs to be used for classifier design, and the operating point of the classifier (a specific desired FPR or TPR) is not defined a priori, AUC remains a strong alternative to other figures of merit. AUC continues to be a very widely used endpoint in classifier evaluation and design, and many approaches to classifier design only indirectly maximize the AUC by optimizing some other cost functions, such as classification accuracy [18]. Our study does not try to define the scenarios for which AUC is an appropriate metric, but to instead discuss and compare approaches for optimizing AUC when it is deemed appropriate. Direct optimization of the AUC for a binary classifier is an interesting problem that may lead to improved performance for such applications.

In previous work, Rakotomamonjy first showed that support vector machines (SVMs) can maximize both AUC and accuracy [5]. He proposed a quadratic programming-based algorithm for AUC maximization by considering the margins between positive and negative training samples. Hereafter, we will refer this method as “SVMROC”. Subsequently, Brefeld and Scheffer presented a rigorous derivation of an AUC-maximizing SVM by imposing a convex bound and a margin item to the optimization problem [6]. They not only gave a strict analytical solution to the AUC-maximizing problem, but also showed an approximate solution based on clustering the constraints for large datasets.

Learning by an ensemble of classifiers is a very effective learning mechanism and a mainstream scheme used in machine learning [19,20]. Ensemble learning refers to a collection of methods that learn a target function by training a number of individual learners and combining their predictions together. Bagging [21] and boosting [22] are two of the best-known ensemble learning methods. Inspired by the “collaborative filtering” problem of ranking movies for a user based movie ratings from other users, Freund et al. proposed an efficient algorithm, termed RankBoost, for combining preferences based on the boosting approach [23]. RankBoost was originally designed for ranking problems. AUC optimization promotes ranks of positive training samples and decreases ranks of negative training samples, and is therefore essentially a ranking problem. RankBoost can thus be applied to AUC optimization, and has been widely used as a baseline method for this problem.

To maximize AUC for large scale and high dimensional data, Gao et al. proposed a one-pass AUC optimization technique called OPAUC [24]. The most prominent feature of this technique is that it only scans the data once as a single sequence and, therefore, does not require storage of the whole training set. OPAUC employs a square loss to measure the ranking error between two instances from different classes. A regression based algorithm was developed to calculate the first and second-order statistics of the training data and store them in memory. By this way, the storage requirement of OPAUC is only determined by the dimension of the data, not the number of instances of the data.

In recent years, semi-supervised learning (SSL) has emerged as an alternate approach to supervised learning in machine learning with advantages in many real life applications. Semi-supervised learning falls between supervised and unsupervised learning [25,26]. It utilizes both labeled data (usually a small amount), in which the true class is known, and unlabeled data (usually many), in which the data class is unknown, during the training process. Semi-supervised learning algorithms were developed primarily because the labeling of data is typically expensive, and even impossible in some applications. It is especially useful for medical problems because the acquisition of labels is very expensive and time consuming for many clinical trials. Previous studies of semi-supervised learning focused on classification and clustering problems [25,26]. For classification problems, classification accuracy is a widely-used evaluation indicator to test semi-supervised learning methods.

Traditional AUC optimization techniques are supervised learning methods, which only utilize labeled data in classifier training. Previous studies on SSL have shown that by utilizing distribution or manifold information of test samples, SSL algorithms can achieve higher classification performance compared with supervised learning algorithms. Thus, one natural idea will be to apply the mechanism of SSL to the problem of AUC optimization. In addition, SSL also has a close connection to transductive learning. Traditional supervised learning algorithms attempt the difficult task of learning general rules from training data, but transductive learning reasons from observed training data to test cases directly [27,25]. This is quite different from traditional inductive learning, which only considers functions learned from a training set and ignores statistical connection between training and test sets. In transductive learning, an unlabeled test dataset is used during classifier training in order to predict class membership for the given test dataset based on the labels of training samples. Trans-ductive learning focuses on how to transfer the knowledge gained from the training samples to the unlabeled test samples in an efficient and accurate way. The motivation behind transductive learning is also applicable to the AUC optimization problem.

As an example of transductive learning, Sindhwani and Keerthi proposed semi-supervised linear support vector classifiers (named “SVMlin”) to handle partially-labeled large scale datasets with possibly very large and sparse features [28,29]. They applied modified finite Newton techniques to linear transductive SVM which is significantly more efficient and scalable than traditional dual optimization techniques for solving quadratic programming problems.

In the literature, there is little work on applying SSL or transductive learning to AUC optimization explicitly. Amini et al. proposed a boosting algorithm (“SSRankBoost”) for learning bipartite ranking functions with partially labeled data [30]. Bipartite ranking problem refers to a ranking problem which assigns higher scores to relevant examples than to irrelevant ones for a given dataset which has wide applications in document analysis area. Along the same line, Ralaivola proposed a semi-supervised bipartite ranking algorithm with the normalized Rayleigh coefficient [31]. Later, Usunier et al. proposed a multiview semi-supervised learning algorithm for ranking multilingual documents [32]. Since AUC optimization has close relationship with ranking problem, works on learning bipartite ranking functions can also be applied to AUC optimization problems.

In this work, inspired by semi-supervised and transductive learning, we propose two new AUC optimization algorithms hereby referred to as semi-supervised learning receiver operating characteristic algorithms (SSLROC1 and SSLROC2), which utilize unlabeled test samples for classifier training. Unlabeled test samples are incorporated into the AUC optimization process, and their ranking relationships to positive and negative training samples are considered as optimization constraints. The introduced test samples make the learned decision boundary in a multi-dimensional feature space to adapt not only to the distribution of labeled training data, but also to the distribution of unlabeled test data. We formulate the semi-supervised AUC optimization problem as a semi-definite programming (SDP) problem [33] based on the margin maximization theory.

The paper is organized as follows: we first introduce the AUC optimization problem in Section 2. The AUC optimization problem is then formulated as a semi-supervised learning problem based on the margin maximization theory and solved using semi-definite programming in Section 3. In Section 4, we list 34 datasets (from University of California, Irvine machine learning repository) which are used to evaluate the proposed method, and show comparisons with state-of-the-art classification or AUC optimization methods. We also show results from the proposed method for a colonic polyp classification problem based on a biomedical imaging dataset. In Section 5 we conclude our findings and discuss computational complexity issues and future research directions.

2. Maximizing AUC with large margin learning

For a two-class classification problem, given training samples {(x₁, y₁), …, (x_n, y_n)}, y_i ∈{−1, + 1}, the optimization problem for maximizing the area under the ROC curve is defined as follows:

Optimization problem 1

max AUC = max_{w} \frac{\sum_{i = 1}^{n^{+}} \sum_{j = 1}^{n^{-}} I (ξ_{i j} > 0)}{n^{+} \times n^{-}}

(1)

with $ξ_{i j} = 〈 w, ϕ (x_{i}^{+}) 〉 - 〈 w, ϕ (x_{j}^{-}) 〉$ , i = 1, 2, …, n⁺, j = 1, 2, …, n⁻, where n⁺ and n⁻ are the numbers of positive (+1) and negative (−1) training samples, respectively; ϕ: X → F denotes a mapping function which maps the input space X into a new feature space F; w is the weight of the linear classifier; I is the indicator function (1: when condition holds; 0: otherwise). The key idea of above formulation of AUC maximization is to assign higher prediction values for positive training samples compared with negative training samples and make the learned classifier work for as many positive–negative sample pairs as possible.

Since optimization problem 1 is not differentiable, Rakotomamonjy proposed the following approximately equivalent problem (1-norm or 2-norm) based on a large margin learning theory [5]:

Optimization problem 2

min_{w} \frac{1}{2} {‖ w ‖}^{2} + C \sum_{i = 1}^{n^{+}} \sum_{j = 1}^{n^{-}} ξ_{i j}^{r}

(2)

with

\begin{array}{l} 〈 w, ϕ (x_{i}^{+}) 〉 - 〈 w, ϕ (x_{j}^{-}) 〉 \geq 1 - ξ_{i j}, \\ ξ_{i j} \geq 0, \\ i = 1, 2, \dots, n^{+}, j = 1, 2, \dots, n^{-}, \\ r \in {1, 2} . \end{array}

The above constrained quadratic programming optimization problem 2 can be solved using the Lagrange multiplier optimization method [5]. Optimization problem 2 attempts to identify a linear classifier in the reproducing kernel Hilbert space, which makes correct predictions for every positive–negative pair in the training set with certain relaxation ξ_ij ≥ 0, i = 1, 2, …, n⁺, j = 1, 2, …, n⁻.

3. A semi-supervised learning method for AUC optimization

In optimization problem 2 we consider only training samples during the AUC optimization process. Therefore, this is a supervised learning algorithm in essence. It has been shown in the semi-supervised learning literature that adding information from unlabeled test samples can be helpful in identifying a more accurate decision boundary in classification problems [25,26]. One natural question is how to best utilize the information contained in the unlabeled test set to help maximize the AUC during optimization in large margin leaning classifiers (e.g., SVMs).

To extend large margin learning to the semi-supervised learning domain, Bennett and Demiriz proposed a semi-supervised support vector machine (S³VM) [34]. S³VM minimizes both the classification accuracy and the function capacity based on available data in both training and test sets. The key idea in the formulation of S³VM is the incorporation of unlabeled test sample constraints within the large margin learning framework. Because the labels for the test samples are unknown, two constraints are imposed in the optimization problem for each test sample. This corresponds to the situation in which the unknown test sample is first assumed to be a positive sample, and then a negative sample. Later, Sindhwani and Keerthi proposed semi-supervised linear SVMs to handle large scale data [28,29].

Inspired by the above-mentioned work on semi-supervised SVMs, in this paper we proposed two new semi-supervised algorithms to solve the AUC optimization problem 2. The basic idea is to incorporate unlabeled data in the AUC optimization framework shown in problem 2 and guess labels of unlabeled data during the optimization process. For each test sample, we first assume it is positive and compare it with all negative training samples; then we assume it is negative and compare it with all positive training samples. By this way, we hope we can rank potential positive samples higher compared with potential negative samples in the test set with the guidance of labeled training samples. In another words, here we propose to utilize unlabeled test data which is the essence of semi-supervised learning and try to rank as many positive test samples higher (compared with negative samples) as possible which is the essence of AUC optimization. More specifically, for a two-class classification problem, given positive training samples ${(x_{1}^{+}, y_{1}^{+}), \dots, (x_{p}^{+}, y_{p}^{+})}, y_{i}^{+} \in {+ 1}$ i = 1, 2, …, p, negative training samples ${(x_{1}^{-}, y_{1}^{-}), \dots, (x_{q}^{-}, y_{q}^{-})}, y_{j}^{-} \in {- 1}$ , j = 1, 2, …, q, and test samples {(x₁), …, (x_r)} without labels, the optimization problem for maximizing the AUC under the semi-supervised learning settings is defined as

Optimization problem 3 (1-norm)

\begin{array}{l} min_{w, ξ, η, u, d} \frac{1}{2} {‖ w ‖}^{2} + \frac{C_{1}}{2} \sum_{i = 1}^{p} \sum_{j = 1}^{q} ξ_{i j} + \frac{C_{2}}{2} \sum_{m = 1}^{r} (\sum_{j = 1}^{q} η_{m j} + \sum_{i = 1}^{p} μ_{m i}) \\ s . t . (〈 w, ϕ (x_{i}^{+}) 〉 - 〈 w, ϕ (x_{j}^{-}) 〉) \geq 1 - ξ_{i j}, \\ ξ_{i j} \geq 0, i = 1, 2, \dots, p, j = 1, 2, \dots, q, \\ (〈 w, ϕ (x_{m}) 〉 - 〈 w, ϕ (x_{j}^{-}) 〉) + M (1 - d_{m}) \geq 1 - η_{m j}, \\ η_{m j} \geq 0, m = 1, 2, \dots, r, j = 1, 2, \dots, q, \\ - 1 \times (〈 w, ϕ (x_{m}) 〉 - 〈 w, ϕ (x_{i}^{+}) 〉) + {M d}_{m} \geq 1 - μ_{m i}, \\ μ_{m i} \geq 0, m = 1, 2, \dots, r, i = 1, 2, \dots, p, \end{array}

(3)

where w is the linear classifier to be identified; margin size parameter M is a sufficiently large constant introduced to handle the margins between test samples and positive/negative training samples; C₁ and C₂ are trade-off parameters to balance classifier complexity, training error of training samples, and impact from unlabeled test samples; ξ_ij ≥ 0, i = 1, 2, …, p, j = 1, 2, …, q, are margins introduced to accommodate non-linear separable positive–negative pairs in the training set; η_mj ≥ 0, m = 1, 2, …, r, j = 1, 2, …, q, are margins introduced for test-negative sample pairs; μ_mi ≥ 0, m = 1, 2, …, r, i = 1, 2, …, p, are margins introduced for test-positive sample pairs. d_m ∈ {0, 1}, m = 1, 2, …, r, are estimated labels of the unlabeled test samples (0 means negative sample). The objective function shown in Eq. (3) contains three parts: the first part is a penalty item on the complexity of the classifier; the second part weighted by C₁ contains training errors on positive–negative pairs; the last part weighted by C₂ deals with the empirical errors from unlabeled test data. In the constraints shown above, there are also three parts: the first part shows the pair-wise empirical errors from the training set; the second/third part shows empirical errors when compare test samples with negative/positive training samples (labels of test samples are estimated during the optimization process). Based on d_m and M, for each test sample, although there are two constraints in Eq. (3), actually only one constraint will take effect. In our experiments, we kept C₁ and C₂ equal to make the algorithm simple. A key advance in this approach is the inclusion of manifold information of test samples as part of the AUC maximizing (or ranking) constraints regarding positive/negative training samples.

Theorem 1

The optimal solution for quadratic optimization problem 3 can be found by solving the following semidefinite programming (SDP) problem:

\begin{array}{l} min_{d_{3}, ψ, ζ} t \\ s . t . [\begin{matrix} K & \frac{(d_{3} - ψ + ζ)}{\sqrt{2}} \\ \frac{{(d_{3} - ψ + ζ)}^{T}}{\sqrt{2}} & t - ψ^{T} e_{3} \end{matrix}] \geq 0, \\ ψ \geq 0, ζ \geq 0, \end{array}

(4)

where

\begin{array}{l} d_{3} = (\begin{matrix} e \\ - d^{N} \\ - d^{P} \end{matrix}), \\ d_{(m - 1) * q + j}^{N} = (M (1 - d_{m}) - 1), m = 1, 2, \dots, r, j = 1, 2, \dots, q, \\ d_{(m - 1) * p + i}^{P} = ({M d}_{m} - 1), m = 1, 2, \dots, r, i = 1, 2, \dots, p, \\ K_{i_{1} j_{1}, i_{2} j_{2}}^{PNPN} = k_{i_{1} i_{2}} - k_{i_{1} j_{2}} - k_{j_{1} i_{2}} + k_{j_{1} j_{2}}, \\ K_{{i j}_{1}, {m j}_{2}}^{PNUN} = k_{i m} - k_{{i j}_{2}} - k_{j_{1} m} + k_{j_{1} j_{2}}, \\ K_{i_{1} j, {m i}_{2}}^{PNUP} = k_{i_{1} m} - k_{i_{1} i_{2}} - k_{j m} + k_{{j i}_{2}}, \\ K_{m_{1} j_{1}, m_{2} j_{2}}^{UNUN} = k_{m_{1} m_{2}} - k_{m_{1} j_{2}} - k_{j_{1} m_{2}} + k_{j_{1} j_{2}}, \\ K_{m_{1} j, m_{2}, i}^{UNUP} = k_{m_{1} m_{2}} - k_{m_{1} i} - k_{{j m}_{2}} + k_{j i}, \\ K_{m_{1} i_{1}, m_{2} i_{2}}^{UPUP} = k_{m_{1} m_{2}} - k_{m_{1} i_{2}} - k_{i_{1} m_{2}} + k_{i_{1} i_{2}}, \end{array}

k_ij = 〈ϕ(x_i), ϕ(x_j) 〉, ϕ(x) is a chosen kernel function, x_i and x_j are samples from the set denoted by the corresponding superscripts, K^PNUP = K^{UPPN^T}, K^PNUN = K^{UNPN^T}, K^UNUP = K^{UPUN^T} and

K = [\begin{matrix} + K^{PNPN} + K^{PNUN} - K^{PNUP} \\ + K^{UNPN} + K^{UNUN} - K^{UNUP} \\ - K^{UPPN} - K^{UPUN} + K^{UPUP} \end{matrix}] .

The proof of this theorem is shown in Appendix A. In the above definitions, P means positive training samples; N means negative training samples; U means unknown test samples. Each block in the block matrix K contains kernel function values from four datasets denoted by its superscript.

The AUC optimization problem using semi-supervised learning can also be formulated using 2-norm soft margin:

Optimization problem 4 (2-norm)

\begin{array}{l} min_{w, ξ, η, μ, d} \frac{1}{2} {‖ w ‖}^{2} + \frac{C_{1}}{2} \sum_{i = 1}^{p} \sum_{j = 1}^{q} ξ_{i j}^{2} + \frac{C_{2}}{2} \sum_{m = 1}^{r} (\sum_{j = 1}^{q} η_{m j}^{2} + \sum_{i = 1}^{p} μ_{m i}^{2}) \\ s . t . (〈 w, ϕ (x_{i}^{+}) 〉 - 〈 w, ϕ (x_{j}^{-}) 〉) \geq 1 - ξ_{i j} \\ ξ_{i j} \geq 0, i = 1, 2, \dots, p, j = 1, 2, \dots, q \\ (〈 w, ϕ (x_{m}) 〉 - 〈 w, ϕ (x_{j}^{-}) 〉) + M (1 - d_{m}) \geq 1 - η_{m j} \\ η_{m j} \geq 0, m = 1, 2, \dots, r, j = 1, 2, \dots, q \\ - 1 \times (〈 w, ϕ (x_{m}) 〉 - 〈 w, ϕ (x_{i}^{+}) 〉) + {M d}_{m} \geq 1 - μ_{m i} \\ μ_{m i} \geq 0, m = 1, 2, \dots, r, i = 1, 2, \dots, p \end{array}

(5)

Theorem 2

The optimal solution for quadratic optimization problem 4 can be found by solving the following SDP problem:

\begin{array}{l} min_{d_{3}, ζ} \\ s . t . [\begin{matrix} K & \frac{(d_{3} + ζ)}{\sqrt{2}} \\ \frac{{(d_{3} + ζ)}^{T}}{\sqrt{2}} & t \end{matrix}] \geq 0, ζ \geq 0, \end{array}

where

d_{3} = (\begin{matrix} e \\ - d^{N} \\ - d^{P} \end{matrix}),

and

K = [\begin{matrix} + K^{PNPN} + K^{PNUN} - K^{PNUP} \\ + K^{UNPN} + K^{UNUN} - K^{UNUP} \\ - K^{UPPN} - K^{UPUN} + K^{UPUP} \end{matrix}] .

The proof of Theorem 2 is similar to Theorem 1.

4. Experimental validation

4.1. Experimental settings

To evaluate the proposed SSLROC1 (1-norm) and SSLROC2 (2-norm) AUC optimization methods, we compared them with SVMs [35] and three state-of-the-art supervised AUC optimization methods: SVMROC [5], RankBoost [23] and OPAUC [24]. We also compare the proposed methods with two semi-supervised classifiers SSRank-Boost [30] and SVMlin [28,29] to show the advantages unlabeled data bring to the AUC optimization problem. For each tested method and dataset we used 5 × 2-fold cross validation, which contains 5 repetitions of 2-fold cross validation (CV). The validation method was inspired by Dietterich’s 5 × 2 CV paired t-test study [36], which has a low probability of incorrectly detecting a difference when no difference exists (type-I error), and a reasonable probability of detecting a difference when it exists (power). We calculated the AUC for each test fold of the 5 × 2-fold CV using prediction values from each method, and used the AUC average of the ten test folds to evaluate the performance of each method on each dataset. To determine whether two compared methods have a significant difference across multiple datasets, we used a Wilcoxon paired signed rank test.

For all datasets, we used z-score to normalize all features available such that each feature is centered to have mean of zero and scaled to have standard deviation of one. For SVMs, SVMROC, SSLROC1 and SSLROC2 (the four kernel based learning methods), we used a Gaussian radial basis function (RBF) as a kernel function for the similarity calculation, and the width factor σ was set as the 90th percentile of pairwise distances (in ascending order) between all instances or samples for each dataset. For SVMs and SVMROC, the classifier complexity and training error trade off parameter C was varied from 1 × 10⁻⁴ to 1 × 10², increasing linearly in log 10 scale. For RankBoost, we tuned the number of weak learners from 30 to 90 to identify the optimal parameter. We explored the same parameter space for OPAUC as the authors proposed in ref. [24] to identify the optimal parameter combinations: learning rate η changes from 2⁻¹² to 2¹⁰ and regularization parameter λ changes from 2⁻¹⁰ to 2² (change linearly in log 2 scale). For SVMlin [28,29], we tested the following parameter combinations: regularization parameter λ [10⁻⁴, 10⁴] and λ_u [10⁻², 10²] (linearly changing in the log 10 scale). The parameters used for SSRankBoost [30] are discount factor [0:0.2:1] and the number of unlabeled examples K [1:10]. For the proposed methods SSLROC1 and SSLROC2, trade off parameter C was set from 10⁻³ to 10¹ (change linearly in log 10 scale) and margin size parameter M was set as one of the three values: 0.1, 1 and 10.

Matlab was used as the programming environment in this study. We employed the public open source Matlab toolboxes SDPT3 [37], Sedumi [38] and YALMIP [39] as the SDP solver. For SVMROC and RankBoost we employed the SVM-KM kernel learning toolbox [40] (http://asi.insa-rouen.fr/~arakotom/toolbox/index). SVMlin was downloaded from http://vikas.sindhwani.org/svmlin.html. OPAUC was from Prof. Zhihua Zhou’s lab (http://lamda.nju.edu.cn). SSRankBoost was downloaded from http://ama.liglab.fr/~amini/SSRankBoost/.

4.2. Experimental results on UCI datasets

4.2.1. UCI datasets

To test the proposed ROC optimization algorithm and compare it with SVMs and traditional ROC optimization methods, we employed the University of California, Irvine (UCI) machine learning repository [41]. The UCI machine learning repository contains more than 200 datasets contributed from various application domains and is widely used in the machine learning community to evaluate various algorithms such as clustering, feature extraction, classification, and regression.

To determine the number of datasets needed for the experiments, we performed power analysis [42] using a Wilcoxon paired signed rank test. Power analysis showed that 34 datasets were needed in order to secure a 10% probability of getting a type I error and a 20% probability of getting a type II error (alpha=0.1, power=0.8) for the comparison between the proposed method and SVMs. Thus, we randomly selected 34 classification datasets from the UCI Repository. Note, we did not account for multiple hypotheses in our sample size calculation. All datasets had an attribute that could be used as a class label. Some were multi-class classification problems converted to binary classification problems based on previous published work using these datasets. In Table 1 we list all datasets used in this study along with the number of instances, attributes, and class label for each dataset. Due to computational considerations, we randomly selected 100 instances or samples from each dataset if it contained more than 100 instances.

Table 1.

Characteristics of the 34 UCI datasets employed in this study. Under the class labels, “rest” designates that it was a multi-class problem and that the rest of the classes were combined into one class.

Dataset	Name	#	Attr	Class label (+1)	Class label (−1)
1	Abalone	4177	8	Female	Male, infant
2	Arcene_train	100	10,000	Positive	Negative
3	Blood	748	5	Donated blood	Did not donate
4	Breast	106	9	Car, fad, mas	gla, con, adi
5	Bupaliver	345	6	>5 drinks	<5 drinks
6	Cancer_wbc	669	9	Malignant	Benign
7	Cardio	74	10	Alive after 1 year	Died before 1 year
8	cmc	1473	9	Long/use of contraceptives	No contraceptive use
9	cnae_9	1080	856	Category: range 1–5	Category: range 6–9
10	Credit_g	1000	20	Good credit	Bad credit
11	Derm	366	34	4,5,6	1,2,3
12	E.coli	336	6	pp	rest
13	Glass	214	9	7th type	Rest
14	Heart	270	14	Absence	Presence
15	Hepatitis	155	19	Die	Live
16	House	435	16	Democrat	Republican
17	Ionosphere	351	34	Bad	Good
18	Iris	150	4	setosa	Versicolor, virginica
19	Kidney_inflam	120	6	Bladder inflammation	No inflammation
20	Kr vs. kp	3196	36	White wins	White loses
21	Mushroom	8124	21	Edible	Poisonous
22	Parkinsons	197	23	Parkinson’s	Healthy
23	pima	768	8	Positive for diabetes	No diabetes
24	Post_op	90	8	Patient discharged (s)	Rest
25	sonar	208	60	Rock	Mine
26	Spectf	267	45	1	0
27	Statlog	690	14	Credit approved	Not approved
28	Survival	306	3	Survived 5+ years	Died within 5 years
29	Teach	151	5	Low	Medium, high
30	Tictactoe	958	9	x wins	x loses
31	Vehicle	846	18	van, bus	saab, opel
32	Weight	625	4	Right-leaning	Balanced/left-leaning
33	Wine	178	12	Cultivar 3	Cultivar 1 and 2
34	Zoo	101	17	Aquatic animals	Not aquatic

Open in a new tab

4.2.2. Results

In Table 2 we show the average AUC of the eight compared methods tested on each of the 34 UCI datasets using the 5 × 2-fold CV. For each dataset and each method tested, the AUC shown was from the optimal parameters which achieved the highest AUC performance. We also show the corresponding standard deviation for each method on each dataset. Standard deviation was calculated based on the ten AUC values from 5 × 2-fold CV. In Table 3 we list the numbers of win-tie-loss between the eight methods (pairwise) on the 34 UCI datasets. We observed that compared with state-of-the-art classification methods, SSLROC1 and SSLROC2 showed superior performance on more datasets. In Table 4 we show p values of the Wilcoxon signed rank tests between the eight methods (pairwise) on the 34 UCI datasets. Since the highest p-value is less than α=0.05, Hochberg’s method for multiple tests of statistical significance [43] indicates that SSLROC1 and SSLROC2 have significantly improved performance compared with other methods. Also from the table we find that the difference between the proposed methods SSLROC1 (1-norm) and SSLROC2 (2-norm) does not reach statistical significance.

Table 2.

Average AUCs of eight methods on the 34 UCI machine learning datasets.

Dataset	SVM		SVMROC		RankBoost		OPAUC		SVMlin		SSRankBoost		SSLROC1		SSLROC2
Dataset	Avg	Std	Avg	Std	Avg	Std	Avg	Std	Avg	Std	Avg	Std	Avg	Std	Avg	Std
1	0.700	0.078	0.699	0.079	0.620	0.083	0.701	0.075	0.704	0.078	0.654	0.080	0.703	0.077	0.702	0.078
2	0.806	0.092	0.806	0.092	0.659	0.086	0.701	0.126	0.811	0.094	0.707	0.083	0.815	0.072	0.818	0.069
3	0.735	0.039	0.735	0.036	0.689	0.072	0.716	0.056	0.725	0.045	0.691	0.077	0.737	0.042	0.736	0.048
4	0.906	0.026	0.927	0.038	0.946	0.015	0.877	0.030	0.926	0.031	0.915	0.026	0.910	0.038	0.930	0.026
5	0.631	0.086	0.627	0.088	0.631	0.059	0.641	0.082	0.648	0.095	0.652	0.088	0.638	0.081	0.632	0.084
6	0.993	0.006	0.993	0.006	0.974	0.030	0.993	0.006	0.994	0.006	0.984	0.013	0.994	0.005	0.993	0.005
7	0.984	0.013	0.979	0.013	0.963	0.045	0.984	0.015	0.982	0.013	0.978	0.016	0.983	0.013	0.979	0.012
8	1.00	0.001	1.00	0.001	1.00	0.000	1.00	0.001	1.00	0.001	1.00	0.000	0.999	0.001	1.00	0.001
9	0.932	0.046	0.938	0.049	0.904	0.056	0.958	0.021	0.959	0.025	0.927	0.043	0.943	0.043	0.943	0.043
10	0.734	0.092	0.734	0.092	0.700	0.096	0.725	0.088	0.719	0.099	0.699	0.087	0.733	0.092	0.734	0.091
11	0.993	0.008	0.994	0.007	0.960	0.032	0.994	0.007	0.991	0.011	0.981	0.012	0.996	0.005	0.996	0.005
12	0.948	0.037	0.952	0.044	0.860	0.112	0.931	0.052	0.917	0.070	0.922	0.069	0.947	0.050	0.951	0.047
13	0.963	0.044	0.964	0.045	0.905	0.110	0.954	0.061	0.964	0.034	0.934	0.063	0.970	0.030	0.970	0.036
14	0.923	0.039	0.925	0.036	0.880	0.051	0.928	0.038	0.923	0.038	0.892	0.050	0.925	0.037	0.925	0.038
15	0.856	0.061	0.853	0.061	0.753	0.080	0.842	0.056	0.854	0.052	0.803	0.065	0.854	0.057	0.851	0.060
16	0.988	0.012	0.988	0.011	0.983	0.016	0.989	0.010	0.989	0.010	0.984	0.016	0.991	0.006	0.990	0.009
17	0.951	0.037	0.946	0.044	0.914	0.049	0.889	0.049	0.933	0.023	0.901	0.059	0.964	0.026	0.962	0.028
18	1.00	0.000	1.00	0.000	0.998	0.006	1.00	0.000	1.00	0.000	0.999	0.002	1.00	0.000	1.00	0.000
19	1.00	0.000	1.00	0.000	1.00	0.000	1.00	0.000	1.00	0.000	1.00	0.000	1.00	0.000	1.00	0.000
20	0.884	0.035	0.884	0.046	0.934	0.042	0.873	0.045	0.891	0.049	0.934	0.042	0.892	0.033	0.888	0.041
21	0.954	0.037	0.957	0.036	0.955	0.040	0.954	0.039	0.949	0.041	0.956	0.035	0.961	0.023	0.961	0.023
22	0.878	0.040	0.878	0.041	0.885	0.047	0.879	0.043	0.877	0.037	0.885	0.039	0.878	0.040	0.879	0.041
23	0.766	0.066	0.767	0.066	0.729	0.060	0.766	0.076	0.778	0.045	0.736	0.043	0.784	0.054	0.777	0.049
24	0.604	0.070	0.606	0.072	0.593	0.059	0.607	0.069	0.595	0.082	0.606	0.062	0.607	0.081	0.606	0.065
25	0.830	0.059	0.850	0.045	0.842	0.048	0.839	0.045	0.833	0.043	0.837	0.055	0.845	0.039	0.849	0.047
26	0.852	0.064	0.852	0.064	0.736	0.101	0.852	0.068	0.834	0.081	0.783	0.058	0.849	0.063	0.852	0.065
27	0.928	0.032	0.925	0.027	0.887	0.036	0.924	0.030	0.926	0.034	0.885	0.046	0.928	0.028	0.927	0.031
28	0.646	0.102	0.644	0.101	0.616	0.067	0.646	0.086	0.642	0.106	0.616	0.077	0.647	0.069	0.658	0.084
29	0.708	0.074	0.710	0.086	0.684	0.062	0.670	0.079	0.675	0.093	0.700	0.068	0.731	0.053	0.726	0.049
30	0.750	0.105	0.744	0.110	0.762	0.091	0.679	0.070	0.692	0.083	0.765	0.089	0.774	0.074	0.782	0.064
31	0.949	0.033	0.951	0.036	0.859	0.050	0.850	0.074	0.937	0.041	0.877	0.034	0.956	0.029	0.966	0.024
32	0.988	0.009	0.989	0.009	0.958	0.028	0.974	0.013	0.988	0.010	0.913	0.034	0.988	0.013	0.988	0.013
33	1.00	0.001	1.00	0.001	0.976	0.053	1.00	0.001	1.00	0.001	0.994	0.008	1.00	0.000	1.00	0.000
34	0.989	0.010	0.990	0.009	0.992	0.007	0.990	0.010	0.991	0.011	0.991	0.009	0.991	0.008	0.991	0.008
Avg	0.875	0.043	0.877	0.044	0.845	0.053	0.863	0.045	0.872	0.044	0.856	0.045	0.880	0.038	0.881	0.038

Open in a new tab

Table 3.

Numbers of wins–ties–losses (superior–equal–inferior AUC) between the eight methods (pairwise) on the 34 UCI datasets. For three numbers shown in each entry of the table, the first is the number of wins of corresponding method shown on the left column compared with the corresponding method shown on the top; middle is the number of ties between them and the third is the number of losses. (M1:SVM; M2:SVMROC; M3:RankBoost; M4:OPAUC; M5:SVMlin; M6:SSRankBoost; M7:SSLROC1; M8:SSLROC2).

Method	M1	M2	M3	M4	M5	M6	M7	M8
M1	0–34–0	13–6–15	25–1–8	17–2–15	19–2–13	23–1–10	7–2–25	5–2–27
M2	15–6–13	0–34–0	26–1–7	19–2–13	18–2–14	27–1–6	7–2–25	6–2–26
M3	8–1–25	7–1–26	0–34–0	11–1–22	9–1–24	9–3–22	5–1–28	5–1–28
M4	15–2–17	13–2–19	22–1–11	0–34–0	13–3–18	21–1–12	7–2–25	9–2–23
M5	13–2–19	14–2–18	24–1–9	18–3–13	0–34–0	23–1–10	7–2–25	9–2–23
M6	10–1–23	6–1–27	22–3–9	12–1–21	10–1–23	0–34–0	5–1–28	5–1–28
M7	25–2–7	25–2–7	28–1–5	25–2–7	25–2–7	28–1–5	0–34–0	15–6–13
M8	27–2–5	26–2–6	28–1–5	23–2–9	23–2–9	28–1–5	13–6–15	0–34–0

Open in a new tab

Table 4.

p Values of the Wilcoxon signed rank tests between the four methods (pairwise). (M1:SVM; M2:SVMROC; M3:RankBoost; M4:OPAUC; M5:SVMlin; M6:SSRankBoost; M7:SSLROC1; M8:SSLROC2).

Method	M1	M2	M3	M4	M5	M6	M7	M8
M1	1.000	0.539	0.000	0.082	0.246	0.002	0.000	0.000
M2	0.539	1.000	0.000	0.026	0.145	0.000	0.003	0.000
M3	0.000	0.000	1.000	0.012	0.001	0.003	0.000	0.000
M4	0.082	0.026	0.012	1.000	0.117	0.098	0.001	0.001
M5	0.246	0.145	0.001	0.117	1.000	0.006	0.001	0.004
M6	0.002	0.000	0.003	0.098	0.006	1.000	0.000	0.000
M7	0.000	0.003	0.000	0.001	0.001	0.000	1.000	0.820
M8	0.000	0.000	0.000	0.001	0.004	0.000	0.820	1.000

Open in a new tab

For the proposed SSLROC1 and SSLROC2 methods, there are two critical parameters which control their generalization ability. They are training error trade-off parameter C and margin size parameter M. To identify the influence of C and M on the performance of the proposed methods, in Fig. 1 we show the average AUC of SSLROC1 and SSLROC2 on three example UCI datasets when different C and M were used in the experiment. From these four example cases, we can find a trend in the parameter combinations which leads to better performance. To reduce computation load, we only explored a small parameter space spanned by M and C. There were 15 combinations of them in total which are few. For example, in the work of OPAUC, the authors tested a parameter space spanned by the learning rate eta (2⁻¹²−2¹⁰) and regularization parameter lambda (2⁻¹⁰−2²), 299 parameter combinations in total. From the trends shown on the four UCI datasets, we see that there is a high probability that exploring a larger parameter space will lead to better AUC performance.

Fig. 1 — Average AUCs of 5 × 2-fold CV on three example UCI datasets. The first column corresponds to SSLROC1 and the second column corresponds to SSLROC2. Each row corresponds to one dataset. For both SSLROC1 and SSLROC2, we show results in the same parameter space spanned by log 10(C) and log 10(M).

4.3. Experimental results on CTC dataset

Colorectal cancer is the second-leading cause of cancer death in Americans [44]. Computed tomographic colonography (CTC), also known as virtual colonoscopy, provides a less invasive alternative to optical colonoscopy in screening patients for colonic polyps [45]. In Fig. 2, we show 3D volume rendering of a segmented colon and a typical colonic polyp on the fold. Previous studies showed that computer-aided detection systems can assist radiologists in CTC reading and improve their detection performance [46–49]. To show the effectiveness of our proposed methods and their potential applications in the CTC computer-aided detection system (CAD), we tested all four methods on a CTC dataset and analyzed the results using ROC analysis.

Fig. 2 — 3D volume rendering of a segmented colon (left figure) with spine and ribs; a typical colonic polyp on the fold (right figure).

4.3.1. CTC datasets

Our dataset consisted of CTC examinations of 50 patients collected from three medical centers. Each patient had one or more polyps ≥6 mm confirmed by histopathological evaluation following optical colonoscopy (OC). Each patient was scanned in the supine and prone positions, and each scan was performed during a single breath hold using a 4- or 8-channel CT scanner. CT scanning parameters included 1.25- to 2.5-mm section collimation, 15 mm/s table speed, 1-mm reconstruction interval, 100 mAs, and 120 kVp. For each CT scan in the dataset, we segmented the colon first from the original 3D image [50]. Then we searched the inner surface of the colon to identify initial colonic polyp candidates. Our initial detection scheme based on surface curvature analysis reported 60 colonic polyps 5–30 mm in size and 5234 false positives. The labels of initial detections were determined by OC examination which is a golden standard in CTC. Each initial detection defined as a CAD detection represents a candidate polyp. After initial detection we extracted 157 3D geometric features from each colonic polyp candidate [47]. The polyps were confirmed by traditional optical colonoscopy. To make the problem computationally feasible we filtered the initial dataset to 100 CAD detections, which included 49 true detections and 51 false positives by removing true and false positives with low SVM vote values predicted by a SVM committee classifier [51]. 5 × 2-fold CV was performed on the filtered dataset and test set in CV was treated as unlabeled samples under our SSL learning framework.

4.3.2. Results

In Fig. 3, we show AUCs of eight methods on the CTC dataset. RankBoost showed the highest performance with an AUC of 0.914. The proposed SSLROC2 method was ranked as the second highest performance with AUC of 0.909. Please note that both SSLROC1 and SSLROC2 outperformed all other semi-supervised learning methods for AUC maximization. In Fig. 4, we show comparisons of SSLROC1 and SSLROC2 with different parameters C and M. They both achieved highest performance when log 10(C) = −1 and log 10(M) = 0.

Fig. 4 — Average AUC of SSLROC1 (a) and SSLROC2 (b) on the CTC dataset when different C values (classifier complexity and training error trade-off parameter) and M values (margin size parameter) were used in the experiment. (a) SSLROC1 and (b) SSLROC2.

5. Discussion and conclusion

We proposed two new AUC optimization methods called SSLROC1 and SSLROC2, which introduce test samples in the optimization of margins in a binary classification problem for the purpose of AUC maximization. We tested the proposed methods on 34 randomly selected UCI machine learning datasets. The SSLROC algorithms were found to have superior AUCs in a significantly larger fraction of UCI datasets compared with SVMs, SVMROC, RankBoost, OPAUC, SVMlin, and SSRankBoost which are state-of-the-art classification and AUC optimization methods. The proposed methods also showed advantages in a colonic polyp classification problem for a dataset of CT colonography cases compared with other methods except RankBoost.

SVMs have a complexity of O(kn²) for RBF kernels and O(kn) for linear kernels, where n and k are the number of training samples and features, respectively. For our proposed method the computational complexity will increase to O(25kn⁴/16) and O(25kn²/16) for RBF and linear kernels, respectively, due to the introduction of test samples during AUC optimization. Here we assume that the training and testing sets have the same number of instances, which is the case for 5 × 2-fold cross validation; we also assume that the number of positive and negative samples is equal. For SVMROC, the computational complexity are O(kn⁴/4) and O(kn²/4) for RBF and linear kernels, respectively, under the assumption that the number of positive and negative samples is equal. The computational complexity analysis shows that the computational complexity is two orders of magnitude higher for both SVMROC and SSLROC over SVMs. For this reason the proposed method was applied to only small datasets in our study. However, the increased complexity of our method is balanced by its significantly higher performance over the other techniques. In future work we will investigate how to develop a more computationally efficient algorithm, likely using more efficient algorithms to approximate the solution of the AUC maximization problem in large datasets [8].

Another potential disadvantage of the SSLROC method (and all transductive learning algorithms) is that when a new test dataset is acquired, the algorithm needs to be re-trained using the new test set as unlabeled data. This is in contrast to inductive learning algorithms (including all supervised algorithms), where the trained classifier can be directly applied to a new test dataset. In the field of computer-aided detection and diagnosis for radiological images, it is preferred to have a well trained CAD system and deploy it to hospitals or clinics without further training. Thus, a future research topic of interest will be to combine online and transductive learning to address the retraining issue in transductive AUC learning.

As we showed in the previous section, SSLROC1 and SSLROC2 did not reach statistical significance on the 34 UCI datasets. In the literature, Ng showed that sample complexity which is the minimum number of training examples required to train a good classifier grows only logarithmically as the number of irrelevant features increases in the dataset when L1 regularization is employed [52]; L2 regularization has a worst sample complexity that grows at least linearly. In the work of Zhu et al. on 1-norm SVMs, they also argue that 1-norm SVM has some advantages over 2-norm SVM when data contains redundant noise features [53]. For the proposed methods, the major difference is that we use different norms for the regularization. So based on studies shown above, SSLROC1 should beat SSLROC2 when data contain irrelevant noisy features. However, from experimental results shown in Table 2, we did not observe such kind of trend when we compare average AUCs of SSLROC1 and SSLROC2. We suspect that it might be related with small size data employed in this study. In the future, it will be interesting to investigate how the data size affects the generalization performance of the two proposed methods.

In conclusion, we developed new methods of AUC optimization based on semi-supervised learning and transductive learning that yield improved classifier performance on multiple public datasets. The proposed methods may lead to improved classification performance in diverse realms of data analysis including medical imaging and computer vision.

Acknowledgments

This work was supported by the Intramural Research Programs of the NIH Clinical Center and the Food and Drug Administration. We thank the NIH Biowulf computer cluster and Ms. Sylvia Wilkerson for their support on parallel computations. No official endorsement by the National Institutes of Health or the Food and Drug Administration of any equipment or product of any company mentioned in the publication should be inferred.

Biography

Dr. Shijun Wang received his PhD degree in Control Science and Engineering from Tsinghua University, China, where his research focused on machine learning and complex systems. He earned a BS in Electronic Engineering at Beihang University and an MS in Communication at Second Aerospace Science Academy, China. Dr. Shijun Wang’s current research interests in the Imaging Biomarkers and Computer-Aided Diagnosis Laboratory include machine learning, statistical image analysis and their applications in computer-aided diagnosis. He is an associate editor of Medical Physics and reviewer for IEEE TMI, IEEE TBME, Journal of Artificial Intelligence Research, Journal of Magnetic Resonance Imaging, Medical Physics, Pattern Analysis & Applications, and Journal of Theoretical Biology.

Appendix A. Proof of Theorem 1

Proof

By using the Lagrange multipliers optimization method [54], we transfer the constrained optimization problem 3 into the following unconstrained primal Lagrange function:

\begin{array}{l} L_{p}^{1} (w, ξ, η, μ, α, β, χ, γ, κ, λ) \\ = \frac{1}{2} {‖ w ‖}^{2} + \frac{C_{1}}{2} \sum_{i = 1}^{p} \sum_{j = 1}^{q} ξ_{i j} + \frac{C_{2}}{2} \sum_{m = 1}^{r} (\sum_{j = 1}^{q} η_{m j} + \sum_{i = 1}^{p} μ_{m i}) \\ - \sum_{i = 1}^{p} \sum_{j = 1}^{q} α_{i j} ((〈 w, ϕ (x_{i}^{+}) 〉 - 〈 w, ϕ (x_{j}^{-}) 〉) + ξ_{i j} - 1) - \sum_{i = 1}^{p} \sum_{j = 1}^{q} β_{i j} ξ_{i j} \\ - \sum_{m = 1}^{r} \sum_{j = 1}^{q} χ_{m j} ((〈 w, ϕ (x_{m}) 〉 - 〈 w, ϕ (x_{j}^{-}) 〉) \\ + M (1 - d_{m}) + η_{m j} - 1) - \sum_{m = 1}^{r} \sum_{j = 1}^{q} γ_{m j} η_{m j} \\ - \sum_{m = 1}^{r} \sum_{i = 1}^{p} κ_{m i} (- 1 \times (〈 w, ϕ (x_{m}) 〉 - 〈 w, ϕ (x_{i}^{+}) 〉) + {M d}_{m} + μ_{m i} - 1) \\ - \sum_{m = 1}^{r} \sum_{i = 1}^{p} λ_{m i} μ_{m i} . \end{array}

The Karush–Kuhn–Tucker (KKT) conditions [54] for optimal primal variables w, ξ, η, and μ are

Stationarity:

\begin{array}{l} \frac{\partial L_{p}^{1}}{\partial w} = w - \sum_{i = 1}^{p} \sum_{j = 1}^{q} α_{i j} (ϕ (x_{i}^{+}) - ϕ (x_{j}^{-})) \\ - \sum_{m = 1}^{r} \sum_{j = 1}^{q} χ_{m j} (ϕ (x_{m}) - ϕ (x_{j}^{-})) \\ - \sum_{m = 1}^{r} \sum_{i = 1}^{p} κ_{m i} (- 1 \times (ϕ (x_{m}) - ϕ (x_{i}^{+}))) = 0, \\ \frac{\partial L_{p}^{1}}{\partial ξ} = \frac{C_{1}}{2} e - α - β = 0 \\ \frac{\partial L_{p}^{1}}{\partial η} = \frac{C_{2}}{2} e - χ - γ = 0, \\ \frac{\partial L_{p}^{1}}{\partial μ} = \frac{C_{2}}{2} e - κ - λ = 0; \end{array}

Primal feasibility:

\begin{array}{l} (〈 w, ϕ (x_{i}^{+}) 〉 - 〈 w, ϕ (x_{j}^{-}) 〉) \geq 1 - ξ_{i j}, ξ_{i j} \geq 0, i = 1, 2, \dots, p, j = 1, 2, \dots, q, \\ (〈 w, ϕ (x_{m}) 〉 - 〈 w, ϕ (x_{j}^{-}) 〉) + M (1 - d_{m}) \geq 1 - η_{m j} η_{m j} \geq 0, m = 1, 2, \dots, r, j = 1, 2, \dots, q, \\ - 1 \times (〈 w, ϕ (x_{m}) 〉 - 〈 w, ϕ (x_{i}^{+}) 〉) + {M d}_{m} \geq 1 - μ_{m i}, μ_{m i} \geq 0, m = 1, 2, \dots, r, i = 1, 2, \dots, p; \end{array}

Dual feasibility:

\begin{array}{l} α_{i j} \geq 0, i = 1, 2, \dots, p, j = 1, 2, \dots, q, \\ β_{i j} \geq 0, i = 1, 2, \dots, p, j = 1, 2, \dots, q, \\ χ_{m j} \geq 0, m = 1, 2, \dots, r, j = 1, 2, \dots, q, \\ γ_{m j} \geq 0, m = 1, 2, \dots, r, j = 1, 2, \dots, q, \\ κ_{m i} \geq 0, m = 1, 2, \dots, r, i = 1, 2, \dots, p, \\ λ_{m i} \geq 0, m = 1, 2, \dots, r, i = 1, 2, \dots, p; \end{array}

Complementary slackness:

\begin{array}{l} α_{i j} ((〈 w, ϕ (x_{i}^{+}) 〉 - 〈 w, ϕ (x_{j}^{-}) 〉) + ξ_{i j} - 1) = 0, \\ β_{i j} ξ_{i j} = 0, i = 1, 2, \dots, p, j = 1, 2, \dots, q, \\ χ_{m j} ((〈 w, ϕ (x_{m}) 〉 - 〈 w, ϕ (x_{j}^{-}) 〉) + M (1 - d_{m}) + η_{m j} - 1) = 0, \\ γ_{m j} η_{m j} = 0, m = 1, 2, \dots, r, j = 1, 2, \dots, q, \\ κ_{m i} (- 1 \times (〈 w, ϕ (x_{m}) 〉 - 〈 w, ϕ (x_{i}^{+}) 〉) + {M d}_{m} + μ_{m i} - 1) = 0, \\ λ_{m i} μ_{m i} = 0, m = 1, 2, \dots, r, i = 1, 2, \dots, p, \end{array}

where

\begin{array}{l} α = {α_{11}, α_{12}, \dots, α_{1 q}, α_{21}, α_{22} \dots, α_{2 q}, \dots, α_{p 1} \dots, α_{p q}}, \\ β = {β_{11}, β_{12} \dots, β_{1 q}, β_{21}, β_{22} \dots, β_{2 q}, \dots, β_{p 1} \dots, β_{p q}}, \\ χ = {χ_{11}, χ_{12} \dots, χ_{1 q}, χ_{21}, χ_{22} \dots, χ_{2 q}, \dots, χ_{r 1} \dots, χ_{r q}}, \\ γ = {γ_{11}, γ_{12} \dots, γ_{1 q}, γ_{21}, γ_{22} \dots, γ_{2 q}, \dots, γ_{r 1} \dots, γ_{r q}}, \\ κ = {κ_{11}, κ_{12} \dots, κ_{1 p}, κ_{21}, κ_{22} \dots, κ_{2 p}, \dots, κ_{r 1} \dots, κ_{r p}}, \\ λ = {λ_{11}, λ_{12} \dots, λ_{1 p}, λ_{21}, λ_{22} \dots, λ_{2 p}, \dots, λ_{r 1} \dots, λ_{r_{p}}}, \end{array}

, e is a vector filled with all ones.

The optimal w can be achieved at:

w = \sum_{i = 1}^{p} \sum_{j = 1}^{q} α_{i j} (ϕ (x_{i}^{+}) - ϕ (x_{j}^{-})) + \sum_{m = 1}^{r} \sum_{j = 1}^{q} χ_{m j} (ϕ (x_{m}) - ϕ (x_{j}^{-})) + \sum_{m = 1}^{r} \sum_{i = 1}^{p} κ_{m i} (- 1 \times (ϕ (x_{m}) - ϕ (x_{i}^{+}))) .

Let us define: $K_{i_{1} j_{1}, i_{2} j_{2}}^{PNPN} = k_{i_{1} i_{2}} - k_{i_{1} j_{2}} - k_{j_{1} i_{2}} + k_{j_{1} j_{2}}, K_{{i j}_{1}, {m j}_{2}}^{PNUN} = k_{i m} - k_{{i j}_{2}} - k_{j_{1} m} + k_{j_{1} j_{2}}, K_{i_{1} j, {m i}_{2}}^{PNUP} = k_{i_{1} m} - k_{i_{1} i_{2}} - k_{j m} + k_{{j i}_{2}}, K_{m_{1} j_{1}, m_{2} j_{2}}^{UNUN} = k_{m_{1} m_{2}} - k_{m_{1} j_{2}} - k_{j_{1} m_{2}} + k_{j_{1} j_{2}}, K_{m_{1} j, m_{2} i}^{UNUP} = k_{m_{1} m_{2}} - k_{m_{1} i} - k_{{j m}_{2}} + k_{j i}, K_{m_{1} i_{1}, m_{2} i_{2}}^{UPUP} = k_{m_{1} m_{2}} - k_{m_{1} i_{2}} - k_{i_{1} m_{2}} + k_{i_{1} i_{2}}$ , k_ij = 〈ϕ(x_i), ϕ(x_j)〉, x_i and x_j are samples from the set denoted by the corresponding superscripts.

In the above definitions, P means positive training samples; N means negative training samples; U means unknown test samples. K^PNPN defines the kernel matrix between positive–negative training sample pairs; K^PNUN defines the kernel matrix between positive–negative training sample pairs and test-negative sample pairs; K^PNUP defines the kernel matrix between positive–negative training sample pairs and test-positive sample pairs; K^UNUN defines the kernel matrix between test-negative sample pairs; K^UNUP defines the kernel matrix between test-negative sample pairs and test-positive sample pairs; K^UPUP defines the kernel matrix between test-positive sample pairs. Here negative and positive samples are only from training set and test samples are only from test set.

Therefore:

\begin{array}{l} L_{p}^{1} (w, ξ, η, μ, α, β, χ, γ, κ, λ) \\ = - \frac{1}{2} (α^{T} K^{PNPN} α + 2 α^{T} K^{PNUN} χ - 2 α^{T} K^{PNUP} κ + χ^{T} K^{UNUN} χ \\ - 2 χ^{T} K^{UNUP} κ + κ^{T} K^{UPUP} κ) + \sum_{i = 1}^{p} \sum_{j = 1}^{q} α_{i j} \\ - \sum_{m = 1}^{r} \sum_{j = 1}^{q} χ_{m j} (M (1 - d_{m}) - 1) \\ - \sum_{m = 1}^{r} \sum_{i = 1}^{p} κ_{m i} ({M d}_{m} - 1) . \end{array}

After removing primal variables, we get dual representation of the optimization problem as follows:

\begin{array}{l} max L_{p}^{1} (α, χ, κ) = - \frac{1}{2} \times (α^{T} K^{PNPN} α + 2 α^{T} K^{PNUN} χ - 2 α^{T} K^{PNUP} κ \\ + χ^{T} K^{UNUN} χ - 2 χ^{T} K^{UNUP} κ + κ^{T} K^{UPUP} κ) \\ + α^{T} e - \sum_{m = 1}^{r} \sum_{j = 1}^{q} χ_{k j} (M (1 - d_{m}) - 1) - \sum_{m = 1}^{r} \sum_{i = 1}^{p} κ_{m i} ({M d}_{m} - 1), \end{array}

with constraints: 0 ≤ α ≤ C₁/2, 0 ≤ χ C₂/2, 0 ≤ κ C₂/2. Thus, the Lagrangian of the maximization problem can be defined as

\begin{array}{l} L_{p}^{1} (α, χ, κ, ν, o, θ, ρ, σ, τ) \\ = - \frac{1}{2} \times (α^{T} K^{PNPN} α + 2 α^{T} K^{PNUN} χ - 2 α^{T} K^{PNUP} κ \\ + χ^{T} K^{UNUN} χ - 2 χ^{T} K^{UNUP} κ + κ^{T} K^{UPUP} κ) \\ + α^{T} e - χ^{T} d^{N} - κ^{T} d^{p} + ν^{T} (\frac{C_{1}}{2} e - α) + o^{T} α \\ + θ^{T} (\frac{C_{2}}{2} e - χ) + ρ^{T} χ + σ^{T} (\frac{C_{2}}{2} e - κ) + τ^{T} κ, \end{array}

where $d_{(m - 1) * q + j}^{N} = (M (1 - d_{m}) - 1)$ , m = 1, 2, …, r, j = 1, 2, …, q, $d_{(m - 1) * p + i}^{P} = ({M d}_{m} - 1)$ , m = 1, 2, …, r, i = 1, 2, …, p, s.t. ν ≥ 0, ο ≥ 0, θ ≥ 0, ρ ≥ 0, σ ≥ 0, τ ≥ 0.

Let us define

\begin{array}{l} ω = (\begin{matrix} α \\ χ \\ κ \end{matrix}), ψ = (\begin{matrix} ν \\ θ \\ σ \end{matrix}), ζ = (\begin{matrix} o \\ ρ \\ τ \end{matrix}) and \\ K = [\begin{matrix} + K^{PNPN} + K^{PNUN} - K^{PNUP} \\ + K^{UNPN} + K^{UNUN} - K^{UNUP} \\ - K^{UPPN} - K^{UPUN} + K^{UPUP} \end{matrix}], \end{array}

where K^PNUP = K^{UPPN^T}, K^PNUN = K^{UNPN^T}, K^UNUP = K^{UPUN^T}.

\begin{array}{l} L_{p}^{1} (α, χ, κ, ν, o, θ, ρ, σ, τ) = L_{p}^{1} (ω, ψ, ζ) \\ = - \frac{1}{2} ω^{T} K ω + ω^{T} ((\begin{matrix} e \\ - d^{N} \\ - d^{P} \end{matrix}) - ψ + ζ) + ψ^{T} (\begin{matrix} \frac{C_{1}}{2} e \\ \frac{C_{2}}{2} e \\ \frac{C_{2}}{2} e \end{matrix}) . \end{array}

Let us define

d_{3} = (\begin{matrix} e \\ - d^{N} \\ - d^{P} \end{matrix}) and e_{3} = (\begin{matrix} \frac{C_{1}}{2} e \\ \frac{C_{2}}{2} e \\ \frac{C_{2}}{2} e \end{matrix}) .

Therefore

\begin{array}{l} L_{p}^{1} (ω, ψ, ζ) = - \frac{1}{2} ω^{T} K ω + ω^{T} (d_{3} - ψ + ζ) + ψ^{T} e_{3}, \\ s . t . ω \geq 0, ψ \geq 0, ζ \geq 0. \end{array}

Based on duality, we have the following equivalent problems:

\underset{ω \geq 0 ψ \geq 0, ζ \geq 0}{max min} L_{p}^{1} (ω, ψ, ζ) = \underset{ψ \geq 0, ζ \geq 0 ω \geq 0}{min max} L_{p}^{1} (ω, ψ, ζ) .

The inner maximization could be achieved at:

\begin{array}{l} \frac{\partial L_{p}^{1} (ω, ψ, ζ)}{\partial ω} = - K ω + (d_{3} - ψ + ζ) = 0 \\ \Rightarrow ω = K^{- 1} (d_{3} - ψ + ζ) . \end{array}

\begin{array}{l} \underset{ψ \geq 0, ζ \geq 0 ω \geq 0}{min max} L_{p}^{1} (ω, ψ, ζ) = min_{ψ \geq 0, ζ \geq 0} - \frac{1}{2} ω^{T} K ω \\ + ω^{T} (d_{3} - ψ + ζ) + ψ^{T} e_{3} | ω = K^{- 1} (d_{3} - ψ + ζ) . \\ = min_{ψ \geq 0, ζ \geq 0} \frac{1}{2} {(d_{3} - ψ + ζ)}^{T} K^{- 1} (d_{3} - ψ + ζ) + ψ^{T} e_{3} . \end{array}

Let t ≥ 0 be the upper limit of the minimization problem:

t \geq min_{ψ \geq 0, ζ \geq 0} \frac{1}{2} {(d_{3} - ψ + ζ)}^{T} K^{- 1} (d_{3} - ψ + ζ) + ψ^{T} e_{3} .

Using the Schur complement [54], we will get

\begin{array}{l} [\begin{matrix} K & \frac{(d_{3} - ψ + ζ)}{\sqrt{2}} \\ \frac{{(d_{3} - ψ + ζ)}^{T}}{\sqrt{2}} & t - ψ^{T} e_{3} \end{matrix}] \geq 0, \\ ψ \geq 0, ζ \geq 0. \end{array}

So we have the following SDP problem:

\begin{array}{l} min_{d_{3}, ψ, ζ} t \\ s . t . \\ [\begin{matrix} K & \frac{(d_{3} - ψ + ζ)}{\sqrt{2}} \\ \frac{{(d_{3} - ψ + ζ)}^{T}}{\sqrt{2}} & t - ψ^{T} e_{3} \end{matrix}] \geq 0, \\ ψ \geq 0, ζ \geq 0. \end{array}

In practice, we found that adding a regularizer diag(I₁/C₁, I₂/C₂, I₃/C₂) to K will increase the positive definiteness of K and lead to better performance, where diag is a diagonal matrix and I₁, I₂, and I₃ are identify matrices having the same size as K^PNPN, K^UNUN, and K^UPUP, respectively.

Footnotes

Matlab code of the proposed methods will be released on http://clinicalcen ter.nih.gov/drd/summers.html once the paper is published.

Conflict of interest

Dr. Ronald Summers receives patent royalties and research support from iCAD.

References

1.Bradley AP. The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognit. 1997;30:1145–1159. [Google Scholar]
2.Hand DJ, Till RJ. A simple generalisation of the area under the ROC curve for multiple class classification problems. Mach Learn. 2001;45:171–186. [Google Scholar]
3.Ling CX, Huang J, Zhang H. AUC: a statistically consistent and more discriminating measure than accuracy. 18th International Joint Conferences on Artificial Intelligence (IJCAI ‘03); 2003. [Google Scholar]
4.Cortes C, Mohri M. AUC optimization vs. error rate minimization. Advances in Neural Information Processing. 2004 [Google Scholar]
5.Rakotomamonjy A. Optimizing area under ROC curves with SVMs. ECAI 04 ROC and Artificial Intelligence Workshop; 2004. [Google Scholar]
6.Brefeld U, Scheffer T. AUC maximizing support vector learning. Workshop on ROC Analysis in Machine Learning; 2005. [Google Scholar]
7.Huang J, Ling CX. Using AUC and accuracy in evaluating learning algorithms. IEEE Trans Knowl Data Eng. 2005;17:299–310. [Google Scholar]
8.Calders T, Jaroszewicz S. Efficient AUC optimization for classification. Knowledge Discovery in Databases: PKDD 2007. 2007;4702 [Google Scholar]
9.Lee WH, Gader PD, Wilson JN. Optimizing the area under a receiver operating characteristic curve with application to land-mine detection. IEEE Trans Geosci Remote Sens. 2007;45:389–397. [Google Scholar]
10.Vanderlooy S, Hullermeier E. A critical analysis of variants of the AUC. Mach Learn. 2008;72:247–262. [Google Scholar]
11.Toh KA, Kim J, Lee S. Maximizing area under ROC curve for biometric scores fusion. Pattern Recognit. 2008;41:3373–3392. [Google Scholar]
12.Flach P, Hernández-Orallo J, Ferri C. A coherent interpretation of AUC as a measure of aggregated classification performance. International Conference on Machine Learning; 2011. [Google Scholar]
13.Zweig MH, Campbell G. Receiver-operating characteristic (ROC) plots—a fundamental evaluation tool in clinical medicine. Clin Chem. 1993;39(4):561–577. [PubMed] [Google Scholar]
14.Wang S, McKenna M, Petrick N, Sahiner B, Linguraru MG, Wei Z, Yao J, Summers RM. ROC-like optimization by sample ranking: application to CT colonography. 2012 9th IEEE International Symposium on Biomedical Imaging (ISBI); 2012. pp. 478–481. [Google Scholar]
15.Berrar D, Flach P. Caveats and pitfalls of ROC analysis in clinical microarray research (and how to avoid them) Brief Bioinform. 2012;13(1):83–97. doi: 10.1093/bib/bbr008. [DOI] [PubMed] [Google Scholar]
16.Lobo JM, Jiménez-Valverde A, Real R. AUC: a misleading measure of the performance of predictive distribution models. Glob Ecol Biogeogr. 2007;17(2):145–151. [Google Scholar]
17.Hanczar B, Hua J, Sima C, Weinstein J, Bittner M, Dougherty ER. Small-sample precision of ROC-related estimates. Bioinformatics. 2010;26(6):822–830. doi: 10.1093/bioinformatics/btq037. [DOI] [PubMed] [Google Scholar]
18.Marrocco C, Duin RPW, Tortorella F. Maximizing the area under the ROC curve by pairwise feature combination. Pattern Recognit. 2008;41(6):1961–1974. [Google Scholar]
19.Liu Y, Yao X. Ensemble learning via negative correlation. Neural Netw. 1999;12(10):1399–1404. doi: 10.1016/s0893-6080(99)00073-8. [DOI] [PubMed] [Google Scholar]
20.Dietterich TG. An experimental comparison of three methods for constructing ensembles of decision trees: bagging, and randomization. Mach Learn. 2000;40(2):139–157. [Google Scholar]
21.Breiman L. Bagging predictors. Mach Learn. 1996;24(2):123–140. [Google Scholar]
22.Freund Y, Schapire RE. Computational Learning Theory. Vol. 904. Springer; Berlin/Heidelberg: 1995. A decision-theoretic generalization of on-line learning and an application to boosting; pp. 23–37. [Google Scholar]
23.Freund Y, Iyer R, Schapire R, Singer Y. An efficient boosting algorithm for combining preferences. J Mach Learn Res. 2003;4:933–969. [Google Scholar]
24.Gao W, Jin R, Zhu S, Zhou Z-H. One-pass AUC optimization. 30th International Conference on Machine Learning; 2013. [Google Scholar]
25.Chapelle O, Scholkopf B, Zien A. Semi-supervised Learning. MIT Press; Cambridge, MA, USA: 2006. [Google Scholar]
26.Zhu X. Technical Report. University of Wisconsin; Madison: 2007. Semi-supervised learning literature survey. [Google Scholar]
27.Gammerman A, Vovk V, Vapnik V. Learning by transduction. Proceedings of the Fourteenth Conference on Uncertainty in Artificial Intelligence (UAI); 1998. pp. 148–155. [Google Scholar]
28.Sindhwani V, Keerthi SS. Large scale semi-supervised linear SVMs. The 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval; 2006. pp. 477–484. [Google Scholar]
29.Sindhwani V, Keerthi SS. Large Scale Kernel Machines. MIT Press; Cambridge MA, US: 2007. Newton methods for fast solution of semi-supervised linear SVMs. [Google Scholar]
30.Amini M-R, Truong T-V, Goutte C. A boosting algorithm for learning bipartite ranking functions with partially labeled data. ACM Special Interest Group on Information Retrieval (ACM SIGIR) 2008:99–106. [Google Scholar]
31.Ralaivola L. Semi-supervised bipartite ranking with the normalized Rayleigh coefficient. European Symposium on Artificial Neural Networks—Advances in Computational Intelligence and Learning; 2009. pp. 47–52. [Google Scholar]
32.Usunier N, Amini M-R, Goutte C. Machine Learning and Knowledge Discovery in Databases, Lecture Notes in Computer Science. Vol. 6913. Springer; New York City, US: 2011. Multiview semi-supervised learning for ranking multilingual documents; pp. 443–458. [Google Scholar]
33.Vandenberghe L, Boyd S. Semidefinite programming. SIAM Rev. 1996;38(1):49–95. [Google Scholar]
34.Bennett K, Demiriz A. Semi-supervised support vector machines. In: Michael DAC, Kearns S, Solla Sara A, editors. Advances in Neural Information Processing Systems. Vol. 11. 1999. pp. 368–374. [Google Scholar]
35.Chang CC, Lin CJ. LIBSVM: a library for support vector machines. ACM Trans Intell Syst Technol. 2011;2(3):27:1–27:27. [Google Scholar]
36.Dietterich TG. Approximate statistical tests for comparing supervised classification learning algorithms. Neural Comput. 1998;10(7):1895–1923. doi: 10.1162/089976698300017197. [DOI] [PubMed] [Google Scholar]
37.Tutuncu RH, Toh KC, Todd MJ. Solving semidefinite-quadratic-linear programs using SDPT3. Math Progr. 2003;95(2):189–217. [Google Scholar]
38.Labit Y, Peaucelle D, Henrion D. SEDUMI INTERFACE 1.02: a tool for solving LMI problems with SEDUMI. Proceedings of IEEE International Symposium on Computer Aided Control System Design; 2002. pp. 272–277. [Google Scholar]
39.Lofberg J. YALMIP: a toolbox for modeling and optimization in MATLAB. 2004 IEEE International Symposium on Computer Aided Control Systems Design; 2004. [Google Scholar]
40.Canu S, Grandvalet Y, Guigue V, Rakotomamonjy A. SVM and Kernel Methods Matlab Toolbox. Perception Systemes et Information, INSA de Rouen; Rouen, France. 2005. [Google Scholar]
41.Frank A, Asuncion A. UCI machine learning repository. University of California, School of Information and Computer Science; Irvine, CA: < http://archive.ics.uci.edu/ml>. [Google Scholar]
42.Lachin JM. Introduction to sample-size determination and power analysis for clinical-trials. Control Clin Trials. 1981;2(2):93–113. doi: 10.1016/0197-2456(81)90001-5. [DOI] [PubMed] [Google Scholar]
43.Hochberg Y. A sharper Bonferroni procedure for multiple tests of significance. Biometrika. 1988;75(4):800–802. [Google Scholar]
44.Siegel R, Naishadham D, Jemal A. Cancer statistics 2012. Ca-a Cancer J Clin. 2012;62(1):10–29. doi: 10.3322/caac.20138. [DOI] [PubMed] [Google Scholar]
45.Pickhardt PJ, Choi JR, Hwang I, Butler JA, Puckett ML, Hildebrandt HA, Wong RK, Nugent PA, Mysliwiec PA, Schindler WR. Computed tomographic virtual colonoscopy to screen for colorectal neoplasia in asymptomatic adults. N Engl J Med. 2003;349(23):2191–2200. doi: 10.1056/NEJMoa031618. [DOI] [PubMed] [Google Scholar]
46.Summers RM. Improving the accuracy of CT colonography interpretation: computer-aided diagnosis. Gastrointest Endosc Clin N Am. 2010;20:245–257. doi: 10.1016/j.giec.2010.02.004. [DOI] [PMC free article] [PubMed] [Google Scholar]
47.Wang S, Yao J, Petrick N, Summers RM. Combining statistical and geometric features for colonic polyp detection in CTC based on multiple kernel learning. Int J Comput Intell Appl. 2010;9(1):1–15. doi: 10.1142/S1469026810002744. [DOI] [PMC free article] [PubMed] [Google Scholar]
48.Suzuki K, Zhang J, Xu JW. Massive-training artificial neural network coupled with Laplacian-eigenfunction-based dimensionality reduction for computer-aided detection of polyps in CT colonography. IEEE Trans Med Imag. 2010;29(11):1907–1917. doi: 10.1109/TMI.2010.2053213. [DOI] [PMC free article] [PubMed] [Google Scholar]
49.Wang S, McKenna MT, Nguyen TB, Burns JE, Petrick N, Sahiner B, Summers RM. Seeing is believing: video classification for computed tomographic colonography using multiple-instance learning. IEEE Trans Med Imag. 2012;31(5):1141–1153. doi: 10.1109/TMI.2012.2187304. [DOI] [PMC free article] [PubMed] [Google Scholar]
50.Franaszek M, Summers RM, Pickhardt PJ, Choi JR. Hybrid segmentation of colon filled with air and opacified fluid for CT colonography. IEEE Trans Med Imag. 2006;25(3):358–368. doi: 10.1109/TMI.2005.863836. [DOI] [PubMed] [Google Scholar]
51.Jerebko AK, Malley JD, Franaszek M, Summers RM. Support vector machines committee classification method for computer-aided polyp detection in CT colonography. Acad Radiol. 2005;12(4):479–486. doi: 10.1016/j.acra.2004.04.024. [DOI] [PubMed] [Google Scholar]
52.Ng AY. Feature selection, L1 vs. L2 regularization, and rotational invariance. the 21st International Conference on Machine Learning; 2004. [Google Scholar]
53.Zhu J, Rosset S, Hastie T, Tibshirani R. 1-norm support vector machines. Advances in Neural Information Processing Systems. 2004;16 [Google Scholar]
54.Boyd S, Vandenberghe L. Convex Optimization. Cambridge University Press; Cambridge, England: 2004. [Google Scholar]

[R1] 1.Bradley AP. The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognit. 1997;30:1145–1159. [Google Scholar]

[R2] 2.Hand DJ, Till RJ. A simple generalisation of the area under the ROC curve for multiple class classification problems. Mach Learn. 2001;45:171–186. [Google Scholar]

[R3] 3.Ling CX, Huang J, Zhang H. AUC: a statistically consistent and more discriminating measure than accuracy. 18th International Joint Conferences on Artificial Intelligence (IJCAI ‘03); 2003. [Google Scholar]

[R4] 4.Cortes C, Mohri M. AUC optimization vs. error rate minimization. Advances in Neural Information Processing. 2004 [Google Scholar]

[R5] 5.Rakotomamonjy A. Optimizing area under ROC curves with SVMs. ECAI 04 ROC and Artificial Intelligence Workshop; 2004. [Google Scholar]

[R6] 6.Brefeld U, Scheffer T. AUC maximizing support vector learning. Workshop on ROC Analysis in Machine Learning; 2005. [Google Scholar]

[R7] 7.Huang J, Ling CX. Using AUC and accuracy in evaluating learning algorithms. IEEE Trans Knowl Data Eng. 2005;17:299–310. [Google Scholar]

[R8] 8.Calders T, Jaroszewicz S. Efficient AUC optimization for classification. Knowledge Discovery in Databases: PKDD 2007. 2007;4702 [Google Scholar]

[R9] 9.Lee WH, Gader PD, Wilson JN. Optimizing the area under a receiver operating characteristic curve with application to land-mine detection. IEEE Trans Geosci Remote Sens. 2007;45:389–397. [Google Scholar]

[R10] 10.Vanderlooy S, Hullermeier E. A critical analysis of variants of the AUC. Mach Learn. 2008;72:247–262. [Google Scholar]

[R11] 11.Toh KA, Kim J, Lee S. Maximizing area under ROC curve for biometric scores fusion. Pattern Recognit. 2008;41:3373–3392. [Google Scholar]

[R12] 12.Flach P, Hernández-Orallo J, Ferri C. A coherent interpretation of AUC as a measure of aggregated classification performance. International Conference on Machine Learning; 2011. [Google Scholar]

[R13] 13.Zweig MH, Campbell G. Receiver-operating characteristic (ROC) plots—a fundamental evaluation tool in clinical medicine. Clin Chem. 1993;39(4):561–577. [PubMed] [Google Scholar]

[R14] 14.Wang S, McKenna M, Petrick N, Sahiner B, Linguraru MG, Wei Z, Yao J, Summers RM. ROC-like optimization by sample ranking: application to CT colonography. 2012 9th IEEE International Symposium on Biomedical Imaging (ISBI); 2012. pp. 478–481. [Google Scholar]

[R15] 15.Berrar D, Flach P. Caveats and pitfalls of ROC analysis in clinical microarray research (and how to avoid them) Brief Bioinform. 2012;13(1):83–97. doi: 10.1093/bib/bbr008. [DOI] [PubMed] [Google Scholar]

[R16] 16.Lobo JM, Jiménez-Valverde A, Real R. AUC: a misleading measure of the performance of predictive distribution models. Glob Ecol Biogeogr. 2007;17(2):145–151. [Google Scholar]

[R17] 17.Hanczar B, Hua J, Sima C, Weinstein J, Bittner M, Dougherty ER. Small-sample precision of ROC-related estimates. Bioinformatics. 2010;26(6):822–830. doi: 10.1093/bioinformatics/btq037. [DOI] [PubMed] [Google Scholar]

[R18] 18.Marrocco C, Duin RPW, Tortorella F. Maximizing the area under the ROC curve by pairwise feature combination. Pattern Recognit. 2008;41(6):1961–1974. [Google Scholar]

[R19] 19.Liu Y, Yao X. Ensemble learning via negative correlation. Neural Netw. 1999;12(10):1399–1404. doi: 10.1016/s0893-6080(99)00073-8. [DOI] [PubMed] [Google Scholar]

[R20] 20.Dietterich TG. An experimental comparison of three methods for constructing ensembles of decision trees: bagging, and randomization. Mach Learn. 2000;40(2):139–157. [Google Scholar]

[R21] 21.Breiman L. Bagging predictors. Mach Learn. 1996;24(2):123–140. [Google Scholar]

[R22] 22.Freund Y, Schapire RE. Computational Learning Theory. Vol. 904. Springer; Berlin/Heidelberg: 1995. A decision-theoretic generalization of on-line learning and an application to boosting; pp. 23–37. [Google Scholar]

[R23] 23.Freund Y, Iyer R, Schapire R, Singer Y. An efficient boosting algorithm for combining preferences. J Mach Learn Res. 2003;4:933–969. [Google Scholar]

[R24] 24.Gao W, Jin R, Zhu S, Zhou Z-H. One-pass AUC optimization. 30th International Conference on Machine Learning; 2013. [Google Scholar]

[R25] 25.Chapelle O, Scholkopf B, Zien A. Semi-supervised Learning. MIT Press; Cambridge, MA, USA: 2006. [Google Scholar]

[R26] 26.Zhu X. Technical Report. University of Wisconsin; Madison: 2007. Semi-supervised learning literature survey. [Google Scholar]

[R27] 27.Gammerman A, Vovk V, Vapnik V. Learning by transduction. Proceedings of the Fourteenth Conference on Uncertainty in Artificial Intelligence (UAI); 1998. pp. 148–155. [Google Scholar]

[R28] 28.Sindhwani V, Keerthi SS. Large scale semi-supervised linear SVMs. The 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval; 2006. pp. 477–484. [Google Scholar]

[R29] 29.Sindhwani V, Keerthi SS. Large Scale Kernel Machines. MIT Press; Cambridge MA, US: 2007. Newton methods for fast solution of semi-supervised linear SVMs. [Google Scholar]

[R30] 30.Amini M-R, Truong T-V, Goutte C. A boosting algorithm for learning bipartite ranking functions with partially labeled data. ACM Special Interest Group on Information Retrieval (ACM SIGIR) 2008:99–106. [Google Scholar]

[R31] 31.Ralaivola L. Semi-supervised bipartite ranking with the normalized Rayleigh coefficient. European Symposium on Artificial Neural Networks—Advances in Computational Intelligence and Learning; 2009. pp. 47–52. [Google Scholar]

[R32] 32.Usunier N, Amini M-R, Goutte C. Machine Learning and Knowledge Discovery in Databases, Lecture Notes in Computer Science. Vol. 6913. Springer; New York City, US: 2011. Multiview semi-supervised learning for ranking multilingual documents; pp. 443–458. [Google Scholar]

[R33] 33.Vandenberghe L, Boyd S. Semidefinite programming. SIAM Rev. 1996;38(1):49–95. [Google Scholar]

[R34] 34.Bennett K, Demiriz A. Semi-supervised support vector machines. In: Michael DAC, Kearns S, Solla Sara A, editors. Advances in Neural Information Processing Systems. Vol. 11. 1999. pp. 368–374. [Google Scholar]

[R35] 35.Chang CC, Lin CJ. LIBSVM: a library for support vector machines. ACM Trans Intell Syst Technol. 2011;2(3):27:1–27:27. [Google Scholar]

[R36] 36.Dietterich TG. Approximate statistical tests for comparing supervised classification learning algorithms. Neural Comput. 1998;10(7):1895–1923. doi: 10.1162/089976698300017197. [DOI] [PubMed] [Google Scholar]

[R37] 37.Tutuncu RH, Toh KC, Todd MJ. Solving semidefinite-quadratic-linear programs using SDPT3. Math Progr. 2003;95(2):189–217. [Google Scholar]

[R38] 38.Labit Y, Peaucelle D, Henrion D. SEDUMI INTERFACE 1.02: a tool for solving LMI problems with SEDUMI. Proceedings of IEEE International Symposium on Computer Aided Control System Design; 2002. pp. 272–277. [Google Scholar]

[R39] 39.Lofberg J. YALMIP: a toolbox for modeling and optimization in MATLAB. 2004 IEEE International Symposium on Computer Aided Control Systems Design; 2004. [Google Scholar]

[R40] 40.Canu S, Grandvalet Y, Guigue V, Rakotomamonjy A. SVM and Kernel Methods Matlab Toolbox. Perception Systemes et Information, INSA de Rouen; Rouen, France. 2005. [Google Scholar]

[R41] 41.Frank A, Asuncion A. UCI machine learning repository. University of California, School of Information and Computer Science; Irvine, CA: < http://archive.ics.uci.edu/ml>. [Google Scholar]

[R42] 42.Lachin JM. Introduction to sample-size determination and power analysis for clinical-trials. Control Clin Trials. 1981;2(2):93–113. doi: 10.1016/0197-2456(81)90001-5. [DOI] [PubMed] [Google Scholar]

[R43] 43.Hochberg Y. A sharper Bonferroni procedure for multiple tests of significance. Biometrika. 1988;75(4):800–802. [Google Scholar]

[R44] 44.Siegel R, Naishadham D, Jemal A. Cancer statistics 2012. Ca-a Cancer J Clin. 2012;62(1):10–29. doi: 10.3322/caac.20138. [DOI] [PubMed] [Google Scholar]

[R45] 45.Pickhardt PJ, Choi JR, Hwang I, Butler JA, Puckett ML, Hildebrandt HA, Wong RK, Nugent PA, Mysliwiec PA, Schindler WR. Computed tomographic virtual colonoscopy to screen for colorectal neoplasia in asymptomatic adults. N Engl J Med. 2003;349(23):2191–2200. doi: 10.1056/NEJMoa031618. [DOI] [PubMed] [Google Scholar]

[R46] 46.Summers RM. Improving the accuracy of CT colonography interpretation: computer-aided diagnosis. Gastrointest Endosc Clin N Am. 2010;20:245–257. doi: 10.1016/j.giec.2010.02.004. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R47] 47.Wang S, Yao J, Petrick N, Summers RM. Combining statistical and geometric features for colonic polyp detection in CTC based on multiple kernel learning. Int J Comput Intell Appl. 2010;9(1):1–15. doi: 10.1142/S1469026810002744. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R48] 48.Suzuki K, Zhang J, Xu JW. Massive-training artificial neural network coupled with Laplacian-eigenfunction-based dimensionality reduction for computer-aided detection of polyps in CT colonography. IEEE Trans Med Imag. 2010;29(11):1907–1917. doi: 10.1109/TMI.2010.2053213. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R49] 49.Wang S, McKenna MT, Nguyen TB, Burns JE, Petrick N, Sahiner B, Summers RM. Seeing is believing: video classification for computed tomographic colonography using multiple-instance learning. IEEE Trans Med Imag. 2012;31(5):1141–1153. doi: 10.1109/TMI.2012.2187304. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R50] 50.Franaszek M, Summers RM, Pickhardt PJ, Choi JR. Hybrid segmentation of colon filled with air and opacified fluid for CT colonography. IEEE Trans Med Imag. 2006;25(3):358–368. doi: 10.1109/TMI.2005.863836. [DOI] [PubMed] [Google Scholar]

[R51] 51.Jerebko AK, Malley JD, Franaszek M, Summers RM. Support vector machines committee classification method for computer-aided polyp detection in CT colonography. Acad Radiol. 2005;12(4):479–486. doi: 10.1016/j.acra.2004.04.024. [DOI] [PubMed] [Google Scholar]

[R52] 52.Ng AY. Feature selection, L1 vs. L2 regularization, and rotational invariance. the 21st International Conference on Machine Learning; 2004. [Google Scholar]

[R53] 53.Zhu J, Rosset S, Hastie T, Tibshirani R. 1-norm support vector machines. Advances in Neural Information Processing Systems. 2004;16 [Google Scholar]

[R54] 54.Boyd S, Vandenberghe L. Convex Optimization. Cambridge University Press; Cambridge, England: 2004. [Google Scholar]

PERMALINK

Optimizing area under the ROC curve using semi-supervised learning

Shijun Wang

Diana Li

Nicholas Petrick

Berkman Sahiner

Marius George Linguraru

Ronald M Summers

Abstract

1. Introduction

2. Maximizing AUC with large margin learning

Optimization problem 1

Optimization problem 2

3. A semi-supervised learning method for AUC optimization

Optimization problem 3 (1-norm)

Theorem 1

Optimization problem 4 (2-norm)

Theorem 2

4. Experimental validation

4.1. Experimental settings

4.2. Experimental results on UCI datasets

4.2.1. UCI datasets

Table 1.

4.2.2. Results

Table 2.

Table 3.

Table 4.

Fig. 1.

4.3. Experimental results on CTC dataset

Fig. 2.

4.3.1. CTC datasets

4.3.2. Results

Fig. 3.

Fig. 4.

5. Discussion and conclusion

Acknowledgments

Biography

Appendix A. Proof of Theorem 1

Proof

Footnotes

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases