Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2014 May 15.
Published in final edited form as: J Mach Learn Res. 2011 Jan 1;15:688–697.

Assisting Main Task Learning by Heterogeneous Auxiliary Tasks with Applications to Skin Cancer Screening

Ning Situ , Xiaojing Yuan , George Zouridakis †,‡,
PMCID: PMC4022611  NIHMSID: NIHMS296973  PMID: 24839405

Abstract

In typical classification problems, high level concept features provided by a domain expert are usually available during classifier training but not during its deployment. We address this problem from a multitask learning (MTL) perspective by treating these features as auxiliary learning tasks. Previous efforts in MTL have mostly assumed that all tasks have the same input space. However, auxiliary tasks can have different input spaces, since their learning targets are different. Thus, to handle cases with heterogeneous input, in this paper we present a newly developed model using heterogeneous auxiliary tasks to help main task learning. First, we formulate a convex optimization problem for the proposed model, and then, we analyze its hypothesis class and derive true risk bounds. Finally, we compare the proposed model with other relevant methods when applied to the problem of skin cancer screening and public datasets. Our results show that the performance of the proposed method is highly competitive compared to other relevant methods.

1 Introduction

Nonoperational are those features that are available during classifier development but not available after classifier deployment, because individual object annotation is expensive or impossible, due to time constraints or lack of expertise (Caruana, 1997). This situation arises in many biomedical applications where automated computer-aided image analysis is needed. Our application involves skin cancer screening whereby the main task is to make a decision on whether a given skin lesion is cancerous or not based on dermoscopy images. Dermatology experts rely on a set of high level concepts to characterize a lesion as malignant (Johr, 2002). For instance, the image in Fig. 1 exhibits several anatomical features highly suggestive of skin cancer, and for this reason, this lesion, which is indeed a melanoma, in terms of dermoscopic criteria is termed as having multicomponent global features. The local dermoscopic features observed in the image, namely the irregular dot/globular structure and the blue-whithish veil, are both hallmarks of melanoma. However, for a new test image, a dermatologist may not be available to identify these high-level anatomical features, and this poses a major limitation in developing accurate automated lesion-screening tools (Federman et al., 2002).

Figure 1.

Figure 1

An example of the main task and auxiliary tasks in skin cancer detection based on an Epilumines-cence microscopy (ELM) image.

In order to use nonoperational features, Caruana (1997) proposed to view nonoperational features as auxiliary tasks during the training phase and demonstrated that simultaneous learning of main task and auxiliary tasks can lead to a more accurate model. Many previous approaches on multi-task learning (Argyriou et al., 2008; Zhang et al., 2008) assumed that these tasks are homogeneous, i.e. they all share the same input space. This is not true in the skin cancer screening paradigm, as recognition of different anatomical features typically needs different low level features, such as geometry (Zouridakis et al., 2004), texture (Yuan et al., 2006), and color (Stanley et al., 2007). One obvious solution here is to combine all the heterogenous features into one large feature vector. However, this method does not use the knowledge of the split feature set and it may result in two potential problems. First, it increases the dimension of the feature vector for all tasks and probably combines some unrelated features for some tasks. Second, because multi-task learning methods typically assume that classifiers from different tasks are similar, it is likely that for some tasks, certain unrelated features are forced to impact the final output, especially when individual task classifier is assumed to be similar to their average classifier, as in the case in the models of Evgeniou et al. (2006), Daumé (2007), and Finkel and Manning (2009). Another solution that relies on Lemma 2 proposed by Evgeniou et al. (2006), is to map each of the different input spaces into one common space. This possible solution has only been described briefly by Evgeniou et al. (2006). In general, it is difficult and not straightforward to choose the common space or the mappings between spaces with difference dimensions.

To handle the heterogenous input condition, the proposed method considers task relatedness in the following way: we directly model the main task classifier as a weighted average of the output provided by the auxiliary task classifiers. It is important to note that we use the weighted average of the output of the classifiers, not the average of the classifier’s parameters, such as, for instance, the weight of each feature in the linear case (Evgeniou and Pontil, 2004; Evgeniou et al., 2006; Daumé, 2007; Finkel and Manning, 2009). This simple idea allows us to perform domain adaptation easily regardless of the different forms of the auxiliary classifiers. The proposed method for modeling task relatedness is closely related to the ensemble learning scenario. Three representative methods using a similar ensemble idea are the adaptive support vector machine (A-SVM) (Yang et al., 2007), the linear programming boosting (LPB) (Demiriz et al., 2002; Gehler and Nowozin, 2009), and the gating network approach (Bonilla et al., 2007). These ensemble learning methods typically treat the training of each sub-model and their combination as two separate stages. Our proposed method, however, inspired from the multiple kernel learning (MKL) (Lanckriet et al., 2004) and the multi-task kernel approaches (Evgeniou et al., 2006), can learn the models for auxiliary tasks and the combined model for the main task together. Our method also increases the dimension of the feature space for the main task. However, we use a weighted average so that the impact of unrelated tasks can be decreased at the task level. This has an effect similar to the MKL, group lasso (Yuan and Lin, 2006), and their p-norm variants (i.e., non-sparse solution will be obtained when p > 1, in contrast to the original MKL and group lasso.) (Kloft et al.,, 2009).

To allow the auxiliary classifiers to adapt to the main task, we add a small term to each auxiliary classifier in the weighted sum, as that in A-SVM, and to build a general model, we allow the main task to have its own features that are different from those used to characterize auxiliary tasks.

Using the notion of task relatedness described above, and following the generic empirical risk minimization approach, we formulate a mathematical programming problem with a regularization term similar to that of the multiple kernel learning considered by Zien and Ong (2007) and its p-norm extensions (Kloft et al., 2009).

The contributions of this paper are: (a) Development of a convex optimization problem for main task learning with heterogeneous auxiliary tasks (section 2.3); (b) Derivation of error bounds for the proposed model (section 3). The obtained error bounds are informative. (c) A practical application of our model to skin cancer screening (section 4); (d) A comparison with three relevant methods, namely MKL, multi-task kernel (Evgeniou et al., 2006), and LPB (section 4).

2 The Proposed Learning Model

2.1 Symbols and Notations

The most important symbols and notations are listed in Table 1, while other symbols are introduced in the text. As is customary, bold-faced letters denote vectors, e.g., d = (d1, …, dm), where the dimension m of the vector is determined in the specific context. We use e to represent a column vector of ones and I to denote the identity matrix. Generally, for xInline graphic, where Inline graphic is a reproducing kernel Hilbert space (RKHS), we only consider a classifier f(x) of the form f(x) = wT x, with wInline graphic. We consider the bias parameter by mapping x into (x, 1). When a classifier has the form ( (dlwl)(l=1)(m)), with dl ∈ ℝ, wlInline graphic, and each Inline graphic is an RKHS, its decision on an instance ( (xl)(l=1)(m)) can be explicitly written as l=1mdlwlTxl. We use ||·||p to denote the p-norm in an RKHS and ||·|| refers to the 2-norm by default.

Table 1.

Symbols and Notations

Symbols Meanings
graphic file with name nihms296973ig2.jpg Feature space of task l, 1 ≤ lm. Inline graphic denotes feature space used by the main task only. Each Inline graphic is an RKHS.
graphic file with name nihms296973ig5.jpg Product space Inline graphic × Inline graphic × … × Inline graphic.
xil
Feature of the i-th instance of task l, with 1 ≤ lm. When l = m + 1, it denotes feature of instance i used by the main task only. For all 1 ≤ lm + 1, xilXl
Kl n × n symmetric positive definite (s.p.d.) kernel matrix where Equation
yil
yil{1,1}, label of the i-th instance from task l.
(xil)(l=i)=(1,1)(m,n)
Vector(x11,x21,,xn1m,xnm).

2.2 Problem Statement and Task Relatedness

Assuming that there are m auxiliary tasks and only one main task (the (m + 1)-th task), our goal is to choose a decision function h from a given hypothesis class Inline graphic ⊆ {−1, 1}Inline graphic, such that h achieves the best true error rate on the main task. We assume that all tasks are binary classification problems and the training data are n independent and identically distributed (i.i.d.) data instances Xi=((xil)l=1m+1,(yil)l=1m+1), 1 ≤ in, drawn from an unknown distribution Inline graphic defined on the domain Inline graphic × {−1, 1}m+1. It is important to notice that the independence assumption here is for two different i’s. Inside the i-th training instance Xi, the xil’s and yil’s , 1 ≤ lm + 1, may not be independent of each other. This is slightly different from the setting studied by Evgeniou and Pontil (2004), Evgeniou et al. (2006), Daumé (2007), and Finkel and Manning (2009) where each of the ( xil,yil)’s is an i.i.d. training data, for any i and any l. A testing instance has the form ((xl)l=1m+1) and our goal is to predict ym+1. In the testing instance, (yl)l=1m are unknown.

To explicitly represent the task relatedness between main and auxiliary tasks, let the decision function for auxiliary task l be wlInline graphic. Assuming that the classifier for the main task can be written as ( (dl(wl+vl))l=1m+1), with wlInline graphic, vlInline graphic, ||vm+1|| = 0, dl ≥ 0 for 1 ≤ lm+1, and ||d||pp1 for some p ≥ 1, where d denotes the vector (d1, d2, …, dm+1), the relatedness between the main task and a specific auxiliary task l is captured by ||vl||, 1 ≤ lm. Obviously, the smaller the ||vl||, the more related the main and the auxiliary tasks are. In the special case when ||vl|| = 0, 1≤ lm + 1, and ||wm+1|| = 0, the prediction value of the main task classifier for an instance (xl)l=1m+1 becomes l=1mdlwlTxl, which is just a weighted sum of output provided by auxiliary task classifiers.

2.3 Mathematical Programming

We formulate a mathematical programming for the problem of main task learning using heterogeneous auxiliary tasks as follows:

P1minw,v,ξ,d12l=1mdl||vl||2+12l=1mCldl||wl||2+12dm+1||wm+1||2+12l=1mi=1ndl(ξil)2+Ci=1nξim+1s.t.yil(wlTφl(xil))1ξil,1lm,1in;d0,||d||pp1,ξ0,yim+1(l=1m+1dl(wl+vl)Tφl(xil))1ξim+1,when1in (1)

where 0 is a column vector of zeros and ||vm+1|| = 0. Inequalities between two vectors are taken element-wise, and Cl, 1 ≤ lm and C are positive user defined parameters.

To convert P1 into a convex optimization problem, we can simply replace wl, vl (1 ≤ lm + 1), and ξil (1 ≤ in and 1 ≤ lm) with ŵl/dl, l/dl, and ξ^il/dl, respectively. If dl = 0, we define a/dl = ∞ when a ≠ 0, and a/dl = 0 when a = 0. Then, we can use the cutting plane algorithm to solve the convex optimization problem. We omit the detailed steps here because this is a standard procedure employed in many MKL optimization algorithms1 (Sonnenburg et al., 2006; Kloft et al., 2009).

One possible concern is that during the optimization if dl = 0, the corresponding auxiliary task’s error can be encouraged to be away from 0. To avoid this situation, we can simply replace the constraint d0 with dεd, where εd is a small positive parameter. When εd > 0, the optimization method is still the same as the case of εd = 0. Our experiments, as detailed in section 4, indicate that setting εd = 10−4 works well when p = 2.

In P1 we use the hinge loss for main task error (i.e., ξim+1) as that in the classic SVM formulation. For the error term of auxiliary task (i.e., ξil, 1 ≤ lm), our analysis in section 3 shows that a risk bound on main task can be obtained when quadratic loss is used for auxiliary tasks. The truncated quadratic loss is used in P1 because we empirically observe that it performs better than the quadratic loss. For optimization, the procedure is similar with respect to these three types of loss functions (hinge loss, quadratic loss, and truncated quadratic loss).

3 Formal Analysis

We provide a risk bound analysis of the proposed learning model. We use the quadratic loss to replace the truncated quadratic loss on auxiliary tasks in P1 (i.e., ξil0 is removed when 1 ≤ lm) for the purpose of error bound analysis here. We first analyze the hypothesis class of our model for main task learning. Given a training set with n i.i.d. data, X={((xil)l=1m+1,(yil)l=1m+1),1in}, a B > 0, and user defined positive parameters Cl, with 1 ≤ lm, the hypothesis class Inline graphic for the proposed learning model is defined as

H1(B,X,Cl):={(xl)l=1m+1l=1m+1(dl(wl+vl))Txl|wl,vlXl,1lm+1;vm+1=0;(2.a)||d||pp1,d0;and12l=1mdl||vl||2(2.b)+12l=1mCldl||wl||2+12dm+1||wm+1||2(2.c)+12l=1mi=1ndl(wlTxilyil)2(2.d)B} (2)

where the main task classifier is ((dl(wl+vl))l=1m+1), and wl is the l-th auxiliary task classifier with 1 ≤ lm. Here 0 denotes the zero element in the corresponding RKHS. It should be noted that the loss of auxiliary task (i.e., (2.d)) is viewed as one of the regularization terms. In the above definition, the term (2.a) restricts the p-norm of the weight vector to be smaller than 1, similar to that of the non-sparse MKL (Kloft et al., 2009). Furthermore, the term (2.b) regularizes the task relatedness (as described in section 2.2), while the term (2.c) regularizes the norm of classifiers of different tasks. We use a weighted sum formulation similar to the regularization term of the intuitive multiple kernel learning (Zien and Ong, 2007), which was shown to be the same as the regularization term considered by Sonnenburg et al. (2006) in the case of MKL. The relation between using this kind of regularization term (i.e., the terms (2.b) and (2.c)) and using the group lasso regularization (Yuan and Lin, 2006) for MKL is discussed by Bach (2008). The term (2.d) stipulates the performance of the auxiliary task classifiers by a total quadratic error. The quadratic error here is in a weighted form. If an auxiliary task has a higher impact on the main task (i.e., dl is high), its error will be penalized more. If dl = 0, there will be no restriction on the error of the l-th auxiliary task and this is because wl (the l-th auxiliary task classifier) has no effect on the main task classifier ((dl(wl+vl))l=1m+1) as long as dl = 0 (recall that our goal is to preform classification on the main task). Another motivation to use a weighted error term (2.d) is that it can easily lead to a convex optimization problem as described in section 2.3. The parameters Cl’s provide a function similar to the regularization parameter in an SVM. When taking out all the terms related to wl in (2.c) and (2.d) for a fixed l, we obtain dl(12Cl||wl||2+12i=1n(wlTxilyil)2), which is exactly an SVM with the quadratic loss for the l-th auxiliary task.

Assuming that ||xil||2=1, with 1 ≤ lm + 1, 1 ≤ in, for a fixed l, 1 ≤ lm + 1, the n × n kernel matrix Kl with the (i, j)-th element (1 ≤ i, jn) equaling xil,xjl, has a trace of n. We also need to assume that each of the Inline graphic is an Euclidean space2. The dimension of Inline graphic can be different for different l.

We provide true risk bounds for functions from Inline graphic on the main task. We need some notations first. Let λl, 1 ≤ lm + 1, be the maximum eigenvalue of Kl and C(K+):=min(m+1,||((λl)l=1m+1||q), when p > 1, 1/p + 1/q = 1. When p = 1 we define q = ∞. We let Cmin: = min{C1, C2, …, Cm}, E1:=B/(2Cmin)+m(B+m1/(2q)n2B/Cmin)/(2n), and for any Ē, c > 0, E(E¯,c):=E¯+(44mE¯/n+6ln(c/δ)/2n)(E¯/2+E¯m/2). Before stating the bound, we need the concept of landmark set introduced by Shivaswamy and Jebara (2010). A landmark set U={((uil)l=1m+1,(y¯il)l=1m+1),1in} of size n is a set of n data drawn i.i.d. from the same distribution as that of the training set X.

Theorem 1

Fix γ > 0, and Cl > 0, 1 ≤ l ≤ m. Let X be a training set of n i.i.d. data drawn from a distribution Inline graphic, and U be a landmark set of size n, for any h ∈ Inline graphic(B, X, Cl) with B > 0:

  1. With probability at least 1 − δ over a random draw of X, we have
    PrP[ym+1sign(h((xl)l=1m+1))]1nγi=1nξim+1+3ln(8/δ)2n+8m32ln(4/δ)E(E1,8)γn+22BC(K+)γn+42E(E1,8)γnEU[T(U,X)]
    where
    T(U,X):=[i=1n((xil)l=1m)T(12I+12nj=1n((ujl)l=1m)((ujl)l=1m)T)1((xil)l=1m)]12
  2. With probability at least 1 − δ over a random draw of X, we have
    PrP[ym+1sign(h((xl)l=1m+1))]1nγi=1nξim+1+3ln(8/δ)2n+8m32ln(4/δ)E(E1,8)γn+8Bγn×(m+1)+4mE(E1,8)γn

    where ξim+1=max(0,γyim+1h((xil)l=1m+1)) are the so-called slack variables.

A proof with complete details is provided in our supplementary materials. We follow the path using empirical Rademacher complexity to derive error bounds. One difficulty is that Inline graphic is dependent on the training data, and hence it is hard to bound Inline graphic’s true risk with its empirical Rademacher complexity directly (Shivaswamy and Jebara, 2010). Our proof follows the approach developed by Shivaswamy and Jebara (2010) which used the landmark set to overcome the problem of data-dependent hypothesis class. The obtained bound in part (i) of Theorem 1 is dependent on the training data while that in part (ii) is independent of the training data but could be looser than the former. The asymptotic behavior of the bound in (ii) with respect to the number of training data n is: main task’s empirical margin error (i.e. ξim+1) + O(1/n).

4 Empirical Results

4.1 Skin Cancer Screening

We first demonstrate an application of the proposed algorithm to skin cancer screening based on Epiluminescence microscopy (ELM) images. Our dataset is collected from Interactive CD of Dermoscopy (Argenziano et al., 2000). We have 360 skin lesion images, in which 270 are benign and 90 are melanoma. The typical resolution of the images is 500 × 740. We use manual segmentation (e.g., the red boundary in Fig. 1) to exclude healthy skin, and this ensures that comparison of classification performance between algorithms is not affected by incorrect automated detection of lesions boundaries. Automated lesion segmentation (Zouridakis et al., 2004) can be viewed as an orthogonal research area to the study here.

4.1.1 Main Task

The main task in skin lesion screening is to detect melanoma. We use the well known bag-of-features scheme which is widely used in computer vision to build a feature vector for each lesion. We first sample randomly 10, 000 16 ×16 patches from each lesion, and then we compute Haar wavelet coefficients and color moments on each patch, and build histograms for wavelet and color moments respectively. We use a codebook size of 100 for both the wavelet and color moment features. Hence, the length of the main task feature vector (i.e., the input space before mapping into Inline graphic in the analysis above) is 200.

4.1.2 Auxiliary tasks

Auxiliary tasks include the global dermoscopic feature and three local dermoscopic features. The global dermoscopic feature has 7 classes as listed in Fig. 1, but we only classify each lesion as multicomponent or not multicomponent, since being multicomponent is a sign of melanoma. We use the local binary patterns LBP16,2riu and LBP24,3riu proposed by Ojala et al. (2002) as low level features for the global dermoscopic feature and the standard deviation and entropy of the histograms built from wavelet and color moments, respectively. Feature vector size here is 48. The three local dermoscopic features include irregular dot/globular, irregular network, and blue-whitish veil. Because of the weak label of the database (Argenziano et al., 2000), we know whether certain local pattern exists in the whole lesion, but we do not know where exactly in the lesion it is present. To solve this problem, we first perform a segmentation inside the lesion by graph cut (5 segments for each lesion). If any of the segmented region contains a certain local feature, that local feature is considered as present in the lesion. This is a typical multi-instance learning (MIL) problem that we convert into a single instance learning problem by the method proposed by Li and Yeung (2009) which can then be intergraded into our framework. This MIL method identifies instance prototypes (IPs) (Maron and Lozano-Pérez, 1998) for each task based on low level features of each segmented region (instance) which are the same as those of the main task. Different tasks will have distinct IPs that lead to distinct feature spaces. The final sizes of the feature vectors for the three local dermoscopic patterns are 75, 69, and 130 respectively.

4.1.3 Experimental Settings

We use five-fold cross-validations (CV) on all methods applied to the dataset. We are only concerned with performance on the main task, which is the goal of skin cancer screening. Our dataset is imbalanced and the accuracy value is not a good measure of classifiers. Hence, we use the area under the receiver operating characteristic curve (AUC) of the main task as performance measure. In this study we compare the following methods.

Simple: Concatenation of all features from all tasks and training with the main task label using an SVM.

MTK: Multi-task kernel. The general definition of MTK was proposed by Evgeniou et al. (2006) and our implementation uses a specific type of MTK defined in Eq. (22) of their paper (RBF kernel is used to replace the dot product for non-linear mapping). Their experiment has demonstrated the effectiveness of this kind of MTK. All MTK models (Evgeniou and Pontil, 2004; Evgeniou et al., 2006; Finkel and Manning, 2009; Daumé, 2007) focused on considering homogenous feature spaces. To adapt MTK to our problem, we concatenate all heterogenous features here as that in Simple.

Single (Baseline): A single kernel built from the feature of the main task (i.e., Km+1) and training with the main task label using an SVM.

MKL: Multiple kernel learning, which was used to combine heterogeneous data sources (Lanckriet et al., 2004). We use the non-sparse multiple kernel learning (Kloft et al., 2009) and the 2-norm in the constraint for the weight variables (i.e., ||d||221). MKL does not use labels of auxiliary tasks but uses the features of auxiliary tasks.

LPB: Linear programming boosting (Demiriz et al., 2002) was not designed in a multi-task setting originally. However, it is straightforward to generalize LPB to our problem. We use the ν-LPB formulation3 from Gehler and Nowozin (2009), which was shown to be more effective than MKL in combining heterogeneous features (Gehler and Nowozin, 2009). In our experiments, ν-LPB first trains m + 1 SVMs for m auxiliary tasks and the main task with their own features. Then, ν-LPB builds a linear weighted combination of the m + 1 classifiers for the main task. This method is the closest to the proposed learning framework. The major difference is that LPB performs learning in two steps, namely learns each task first, and then based on the output of the first step, learns the main task. In our model, learning is performed coherently, namely all auxiliary tasks and the main task are learned simultaneously.

CMHA: our model which concurrently learns main and heterogeneous auxiliary tasks (CMHA) and p = 2.

In order to see the effect of auxiliary task labels (i.e., nonoperational features), we consider replacing all the auxiliary task labels in CMHA and LPB with the main task label of the lesion, and still use the features of auxiliary tasks. For LPB, this is just its original formulation (Gehler and Nowozin, 2009) to combine heterogenous features. We call these two methods without auxiliary task labels as CMHA-WOA and LPB-WOA respectively.

We also test MKL and CMHA by setting their weight variables (i.e., d) to be uniform, which are denoted as MKL-ave and CMHA-ave respectively.

We use the RBF kernel with a parameter4 of 1 (For Simple and MTK we also try values of 1/5 and 1/522 for they concatenate five sets of features and the total feature length is 522. Best results are reported and this could benefit them in comparison). We select the regularization parameter for Simple, Single (Baseline), MKL, and MKL-ave from the set {1000, 100, 50, 25, 10, 0.1} based on their best prediction performance which could benefit these methods in comparison. The ‘(λ, γ)’ pair of MTK (Evgeniou et al., 2006) is searched in {(0.1, 0.2, …, 0.9) × (1000, 100, 50, 25, 10, 0.1)} and the result is also reported for its best setting from the final performance. For all regularization parameters of methods from our model (P1) and LPB, including ‘C’, ‘Cl’, and the ‘ν’ in ν-LPB (Gehler and Nowozin, 2009), we choose them based on a validation set (20% of training data) and the final model is retrained with the whole training set. We could not afford to test all possible combinations. The ‘1/Cl’ in our model (P1) and the ‘C’ of the SVM for the l-th auxiliary task in LPB have similar functions in regularization5 as discussed at the end of section 3. So, we can use a heuristic method (Gehler and Nowozin, 2009): selecting those values individually based on the validation set performance by training an SVM for the l-th task. For the global pattern, we select ‘1/Cl’ from 104 × {2, 1.5, 1}. For local patterns, we use the set 102×{1,2/2,1/2}. After selecting parameters for auxiliary tasks, we choose the ‘C’ in our model (P1) from 100 ×{20, 15, 10, 5, 1} and the ‘ν’ in ν-LPB from {0.01, 0.1, 1, 10, 100} based on the validation set result.

4.1.4 Results and Discussions

The baseline (Single) method’s AUC is 74.01% (std: 7.12%). We report the AUC’s of different methods minus that of the baseline model (Single) and their standard deviations in Table 2. These results can be summarized as follows:

Table 2.

AUC’s (%) of Various Methods Minus That of Baseline (Diff. AUC)

Methods Simple MTK MKL
Diff. AUC(std) −0.39* (3.34) 2.67* (6.36) 1.73* (3.50)

Methods MKL-ave LPB LPB-WOA
Diff. AUC(std) 3.51* (4.13) 6.17* (6.27) 7.28* (6.70)

Methods CMHA CMHA-ave CMHA-WOA
Diff. AUC(std) 9.42 (5.32) 9.39 (5.70) 6.91* (5.77)

A ‘*’ sign indicates that the result of the corresponding method is significantly different from that of CMHA by a paired t-test at the 95% confidence level.

  1. Among methods without using auxiliary labels, Simple performs worse than the two MKL models, LPB-WOA and CMHA-WOA. When auxiliary labels are used, MTK performs significantly worse than LPB and CMHA (at the 95% confidence level by paired t-test). This shows that ignoring the “natural splitting” of the feature set (e.g., Simple and MTK), motivated from different learning targets, is not a very competitive scheme for this particular dataset.

  2. Both CMHA-WOA and LPB-WOA outperform MKL, and the reason is similar to that discussed by Gehler and Nowozin (2009): the Lagrange multiplier is restricted to be the same for all kernels in MKL but not in CMHA-WOA and LPB-WOA resulting in more flexible models.

  3. CMHA provides a statistically significant improvement compared to LPB-WOA which achieves the best results among methods without using the additional auxiliary task labels (i.e., nonoperational features). This shows the effectiveness of using nonoperational features by our learning framework.

  4. Our method’s performance is comparable to LPB’s when learning with main label only (a paired t-test shows that the difference between CMHA-WOA and LPB-WOA is not significant with a p-value of 0.6208). However, when including nonoperational features, CMHA provides an improvement over CMHA-WOA, while LPB performs worse than LPB-WOA. One plausible explanation is given by the main difference between these two schemes: CMHA learns the auxiliary tasks and main task together, allowing the auxiliary task classifier to adapt to the main task, while LPB learns them separately.

  5. The difference between the weighted scheme CMHA and the average scheme CMHA-ave is not significant and this is similar to MKL (Gehler and Nowozin, 2009). To see the advantage of CMHA over CMHA-ave, we add some randomly generated auxiliary tasks and apply the same experimental setting above to test CMHA and CMHA-ave. Fig. 2(a) shows that CMHA is more robust than CMHA-ave in the presence of unrelated tasks. Fig. 2(b) shows the weights (i.e., d) learned by CMHA (average from the five-fold CV) when there are 5 unrelated tasks and obviously CMHA can successfully exclude the unrelated tasks.

Figure 2.

Figure 2

Comparing CMHA with CMHA-ave when unrelated auxiliary tasks present. (a) AUC vs. number of unrelated tasks; (b) The weights learned by CMHA (average from the five-fold CV). 1–5: related tasks; 6–10: randomly generated unrelated tasks.

4.2 Experiments and Results on Public Datasets

In this section, we demonstrate that our model is general and can be readily applied to other domains. We use the CAL500 dataset6 (Barrington et al., 2008) from the UCSD multiple kernel learning repository. CAL500 is consisted of 502 songs and each song is annotated with its genre, emotion, instrument, etc.. We consider two main tasks: predicting whether a song is annotated with electronica and alternative respectively. We use the annotation electric guiter (including both clean and distorted) as the auxiliary label. We use the MFCC kernel as feature for main task and the last.fm kernel as feature for auxiliary task. Following Barrington et al. (2008), we test our method CMHA with ten-fold CV and use AUC as the performance measure. Barrington et al. (2008) reported results using MKL on combining four kernels including the MFCC kernel and the last.fm kernel. To be fair in comparison, we also apply MKL and LPB (section 4.1.3) only using these two kernels as that in CMHA. Regularization parameters are chosen as described above. As the results shown in Table 3, our method can still slightly outperform MKL and LPB when generalized to a new application.

Table 3.

AUC’s (%) of Various Methods on Two Music Genre Recognition Tasks

Methods CMHA MKL LPB (Barrington 2008)

electronica 90.53 89.03 88.27 86
alternative 81.99 81.83 81.21 81

5 Related Work

Several attempts have been made to integrate heterogenous data sources with MKL (Lanckriet et al., 2004), its p-norm extension (Kloft et al., 2009), and LPB (Demiriz et al., 2002; Gehler and Nowozin, 2009) for single task learning. In this study, we combine and extend these ideas to multi-task learning. In our model, Lagrange multipliers for different kernels can be different, unlike those in MKL which are required to be the same for all kernels. LPB (Demiriz et al., 2002) also allows different multipliers for different kernels (Gehler and Nowozin, 2009) and has been shown to achieve superior performance for image classification by combining heterogeneous feature spaces (Gehler and Nowozin, 2009). Our work, however, is different from Demiriz et al., (2002) mainly in two aspects: (a) the bound obtained by Demiriz et al. (2002) is from covering numbers, while our formulation is derived from Rademacher complexity; and (b) LPB optimizes each auxiliary classifier first, and combines their decisions later, thus resulting in an ensemble of auxiliary classifiers. Our model (P1), on the other hand, optimizes the auxiliary classifiers and builds a weighted combination for the main task simultaneously. This also differentiates our model from A-SVM (Yang et al., 2007) and the gating network approach (Bonilla et al., 2007). Furthermore, A-SVM focuses on homogeneous input space and computes the weights of auxiliary classifiers using unlabeled data and the gating network approach is designed to utilize task-specific features that are the same for all data from one task and are from a homogeneous space across all tasks.

Modeling task relatedness has attracted lots of interests because it is a critical factor allowing multitask learning to outperform single task learning. In multi-task kernel (Evgeniou and Pontil, 2004; Evgeniou et al., 2006) and its similar models (Daumé, 2007; Finkel and Manning, 2009) task relatedness is modeled in the following way: let the main task classifier be h, the task relatedness is captured by ||h1/(m+1)i=1m+1wi||. However, in our problem, the Inline graphic’s are different and therefore the average term 1/(m+1)i=1m+1wi is ill-defined (recall that wlInline graphic). Simply combining all features from all tasks may not always be a very competitive solution as shown in our experiments in contrast to the case of task-specific features considered by Bonilla et al. (2007). Evgeniou et al. (2006) briefly mentioned another solution mapping all Inline graphic’s into one common space, but a practical application of this approach is not shown by Evgeniou et al. (2006). Our model, however, will not suffer from this heterogeneous input space problem because it only uses the predictions from the auxiliary classifiers. Evgeniou et al. (2006) didn’t consider this simple method because of a slight difference between their problem setting and ours: in our motivated application, for any two tasks l and r, both ( xil,yil) and ( xir,yir) are from the i-th training instance (lesion), while in previous methods (Evgeniou and Pontil, 2004; Evgeniou et al., 2006; Finkel and Manning, 2009; Daumé, 2007), ( xil,yil) and ( xir,yir) are two i.i.d. training instances. Thus, previous models (Evgeniou and Pontil, 2004; Evgeniou et al., 2006; Finkel and Manning, 2009; Daumé, 2007) did not use xil (1 ≤ lm) to predict the (m + 1)-th task while our model uses that to predict the main task (see Eq. (1)).

The techniques developed by Srebro and Ben-david (2006), Ying and Campbell (2009), and Cortes et al. (2009) for error bound analysis of MKL could be applied to handle the terms (2.b) and (2.c) of Inline graphic and to improve the fourth term in the bounds of Theorem 1(i),(ii). However, the major difficulty in deriving the error bounds of our model lies on the way to handle the data-dependent term (2.d) of Inline graphic. We relax Inline graphic to a form similar to the function classes considered by Shivaswamy and Jebara (2010) and then use the landmark set method (Shivaswamy and Jebara, 2010) to deal with data-dependent regularization terms.

6 Conclusions and Future Work

Viewing nonoperational features as auxiliary tasks was initially proposed by Caruana (1997). Starting from the simple idea of expressing the main task classifier as a weighted sum of the prediction values of the auxiliary classifiers, we devise a learning framework (with proven error bounds) that allows the use of nonoperational features. When the method is applied to skin cancer screening we obtain very encouraging results. One immediate extension of our model is to incorporate unlabeled data as that in co-training and multi-view learning. Obtaining shaper error bounds and hyperparameter selection for our model and LPB are also interesting future topics.

Supplementary Material

SupplementaryProofs

Acknowledgments

This work was supported in part by NIH grant no. 1R21AR057921, and by grants from UH-GEAR and the Texas Learning and Computation Center at the University of Houston.

Footnotes

1

We provide the convex formulation and the semi-infinite programming in appendix. For completeness, we provide the full details in our supplementary materials

2

For the RBF (radial basis function) kernel, its feature space has an infinite dimension. Using the Taylor expansion, we can approximate it with a polynomial by truncating the higher order terms.

3

We also try to replace the linear programming with a quadratic programming and observe similar performance.

4

The RBF kernel has the form exp(−γ||xy||2), where x and y are two vectors, and the parameter refers to γ.

5

Notice that in P1Cl’ is multiplied with ||wl||2 and the ‘C’ of a typical SVM is multiplied with the error terms.

References

  1. Argenziano G, Soyer HP, De Giorgi V, Piccolo D, Carli P, Delfino M, Ferrari A, Hofmann-Wellenhof R, Massi D, Mazzocchetti G, et al. Dermoscopy: a tutorial. 2000;12 [Google Scholar]
  2. Argyriou A, Evgeniou T, Pontil M. Convex multi-task feature learning. Machine Learning. 2008;73(3):243–272. [Google Scholar]
  3. Bach FR. Consistency of the Group Lasso and Multiple Kernel Learning. Journal of Machine Learning Research. 2008;9:1179–1225. [Google Scholar]
  4. Barrington L, Yazdani M, Turnbull D, Lanckriet G. Coombining Feature Kernels for Semantic Music Retrieval. Proceedings of the 9th International Society for Music Information Retrieval Conference; 2008. [Google Scholar]
  5. Bonilla EV, Agakov FV, Williams CKI. Kernel Multi-task Learning using Task-specific Features. Proceedings of the 11th International Conference on Artificial Intelligence and Statistics; 2007. [Google Scholar]
  6. Caruana R. Multitask learning. Machine Learning. 1997;28(1):41–75. [Google Scholar]
  7. Cortes C, Mohri M, Rostamizadeh A. Generalization Bounds for Learning Kernels. Proceedings of the 27th International Conference on Machine Learning; ACM; 2010. [Google Scholar]
  8. Daumé H. Frustratingly easy domain adaptation. Annual meeting-association for computational linguistics. 2007;45:256. [Google Scholar]
  9. Demiriz A, Bennett KP, Shawe-Taylor J. Linear programming boosting via column generation. Machine Learning. 2002;46(1):225–254. [Google Scholar]
  10. Evgeniou T, Pontil M. Regularized multi–task learning. Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining; ACM; 2004. pp. 109–117. [Google Scholar]
  11. Evgeniou T, Micchelli CA, Pontil M. Learning multiple tasks with kernel methods. Journal of Machine Learning Research. 2006;6(1):615. [Google Scholar]
  12. Federman DG, Kravetz JD, Kirsner RS. Skin cancer screening by dermatologists: prevalence and barriers. Journal of the American Academy of Dermatology. 2002;46(5):710. doi: 10.1067/mjd.2002.120531. [DOI] [PubMed] [Google Scholar]
  13. Finkel JR, Manning CD. Hierarchical bayesian domain adaptation. Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics; Association for Computational Linguistics; 2009. pp. 602–610. [Google Scholar]
  14. Gehler P, Nowozin S. On Feature Combination for Multiclass Object Classification. IEEE International Conference on Computer Vision; 2009. [Google Scholar]
  15. Johr RH. Dermoscopy: alternative melanocytic algorithms the ABCD rule of dermatoscopy, Menzies scoring method, and 7-point checklist. Clinics in dermatology. 2002;20(3):240–247. doi: 10.1016/s0738-081x(02)00236-5. [DOI] [PubMed] [Google Scholar]
  16. Kloft M, Brefeld U, Sonnenburg S, Laskov P, Müller KR, Zien A. Efficient and accurate lp-norm multiple kernel learning. Advances in Neural Information Processing Systems. 2009;22:997–1005. [Google Scholar]
  17. Lanckriet G, Cristianini N, Bartlett P, El Ghaoui L, Jordan MI. Learning the Kernel Matrix with Semidefinite Programming. Journal of Machine Learning Research. 2004;5:27–72. [Google Scholar]
  18. Li WJ, Yeung DY. Localized Content-Based Image Retrieval Through Evidence Region Identification. IEEE Computer Society Conference on Computer Vision and Pattern Recognition; 2009. [Google Scholar]
  19. Maron O, Lozano-Pérez T. A framework for multiple-instance learning. Advances in Neural Information Processing Systems. 1998:570–576. [Google Scholar]
  20. Micchelli CA, Pontil M. Feature space perspectives for learning the kernel. Machine Learning. 2007;66(2):297–319. [Google Scholar]
  21. Ojala T, Pietikäinen M, Mäenpää T. Multiresolution Gray-Scale and Rotation Invariant Texture Classification with Local Binary Patterns. IEEE Transaction on Pattern Analysis and Machine Intelligence. 2002;24(7):971. [Google Scholar]
  22. Rakotomamonjy A, Bach F, Grandvalet Y, Canu S. Simple MKL. Journal of Machine Learning Research. 2008;9:2491–2521. [Google Scholar]
  23. Shivaswamy PK, Jebara T. Maximum Relative Margin and Data-Dependent Regularization. Journal of Machine Learning Research. 2010;11:747–788. [Google Scholar]
  24. Sonnenburg S, Rätsch G, Schäfer C, Schölkopf B. Large scale multiple kernel learning. The Journal of Machine Learning Research. 2006;7:1565. [Google Scholar]
  25. Srebro N, Ben-david S. Learning bounds for support vector machines with learned kernels. Annual Conference On Learning Theory (COLT) 2006:169–183. [Google Scholar]
  26. Stanley RJ, Stoecker WV, Moss RH. A relative color approach to color discrimination for malignant melanoma detection in dermoscopy images. Skin Research and Technology. 2007;13(1):62–72. doi: 10.1111/j.1600-0846.2007.00192.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  27. Yang J, Yan R, Hauptmann AG. Cross-domain video concept detection using adaptive svms. Proceedings of the 15th international conference on Multimedia; ACM; 2007. [Google Scholar]
  28. Ying Y, Campbell C. Generalization bounds for learning the kernel. Proc. of the 22nd Annual Conference on Learning Theory; 2009. [Google Scholar]
  29. Yuan M, Lin Y. Model selection and estimation in regression with grouped variables. Journal-Royal Statistical Society Series B Statistical Methodology. 2006;68(1):49. [Google Scholar]
  30. Yuan X, Yang Z, Zouridakis G, Mullani N. SVM-based texture classification and application to early melanoma detection. Annual International Conference of the IEEE Engineering in Medicine and Biology Society; 2006. p. 4775. [DOI] [PubMed] [Google Scholar]
  31. Zhang J, Ghahramani Z, Yang Y. Flexible latent variable models for multi-task learning. Machine Learning. 2008;73(3):221–242. [Google Scholar]
  32. Zien A, Ong CS. Multiclass multiple kernel learning. Proceedings of the 24th International Conference on Machine Learning; ACM; 2007. p. 1198. [Google Scholar]
  33. Zouridakis G, Doshi M, Mullani N. Early diagnosis of skin cancer based on segmentation and measurement of vascularization and pigmentation in nevo-scope images. 26th Annual International Conference of the IEEE Engineering in Medicine and Biology Society; 2004. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

SupplementaryProofs

RESOURCES