Abstract
Heterogeneity in psychiatric and neurological disorders has undermined our ability to understand the pathophysiology underlying their clinical manifestations. In an effort to better distinguish clinical subtypes, many disorders, such as Bipolar Disorder, have been further sub-categorized into subgroups, albeit with criteria that are not very clear, reproducible and objective. Imaging, along with pattern analysis and classification methods, offers promise for developing objective and quantitative ways for disease subtype categorization. Herein, we develop such a method using learning multiple tasks, assuming that each task corresponds to a disease subtype but that subtypes share some common imaging characteristics, along with having distinct features. In particular, we extend the original SVM method by incorporating the sparsity and the group sparsity techniques to allow simultaneous joint learning for all diagnostic tasks. Experiments on Multi-Task Bipolar Disorder classification demonstrate the advantages of our proposed methods compared to other state-of-art pattern analysis approaches.
1 Introduction
Most neurodegenerative and neuropsychiatric disorders are very heterogeneous, both from an imaging and from a clinical perspective, likely reflecting underlying complex genetic and environmental factors. Heterogeneity is further complicated by the fact that oftentimes different pathologies co-exist in the same individual, thereby confounding the structural and the clinical phenotypes. In the past decade, we have witnessed a great deal of progress in the use of advanced pattern analysis and machine learning methods for the classification of individuals, which is important for diagnostic and predictive purposes, and ultimately for individualized medicine. To date, however, most attempts for multivariate pattern analysis (MVPA) methods, such as Support Vector Machines (SVM), especially linear formulations, are primarily focused on the problem of finding a single direction separating two groups, and not on capturing multiple directions in heterogeneous populations.
For example, Bipolar Disorder (BD) mainly consists of BD type I and type II [1]. Multiple tasks of classifications including the whole patients (BD) vs normal controls (NC), each subtype of BD vs NC (i.e., BD I vs NC and BD II vs NC), and the distinguishing between different subtypes of BD (BD I vs BD II), are thereby more necessarily implemented rather than simple binary categorization of BD vs NC for computerized MRI diagnosis.
Multi-task learning [2] is a relatively recent development in the field of machine learning, and might be better suited for classification under phenotypic heterogeneity, as it simultaneously solves multiple classification tasks. Herein, we develop a novel multi-task SVM method, named Multi-Task l2,1 + l1-norm SVM (mtSVML21L1), which can work in the context of multiple classification tasks, by solving the multi-task hinge loss with sparsity [3] and group sparsity [4] regularization minimization problem. The learned weight coefficients W which defining the hyperplane are endowed with group sparsity property across multiple tasks while allow different patterns between tasks. This, therefore, can facilitate us to select a subset of features from the original input variables, which are meaningful for all the tasks. Our method is different from the multi-task feature learning methods [5][6][7][8][9][10][11] which are based on the least square (LS) loss technique. Actually, hinge loss based SVM (adopted in our method) has been validated [12][13] to have better performance than LS based methods for feature selection and classification. To the best of our knowledge, this is the first multi-task pattern classification method invented to identify the individual-level biomarkers for diagnosis of the heterogeneous neuropsychiatric data.
2 Multi-Task l2,1 + l1-norm Support Vector Machine
2.1 Formulation
Assuming that we have t supervised learning tasks, let Xi = [x1, x2, ···, xn] ∈ ℝd×n as the training data matrix on ith task, i = 1, ···, t, where d is the feature dimension, n is the number of input data samples, and let Yi = [y1, y2, ···, yn] ∈ ℝn as the corresponding labels from these training samples for task i, where yj ∈ {+1, −1} is the binary label for each task. Let W = [w1, w2, ···, wt] ∈ ℝd×t be the weight coefficient matrix for all t tasks, whose column wi ∈ ℝd parameterizes the linear discriminant function and whose row wk ∈ ℝt is the vector of coefficients associated with the kth feature across different tasks. Then the hinge loss based multi-task model, i.e., Multi-Task l2,1 +l1-norm SVM (mtSVML21L1) can be defined by the following minimization problem:
| (1) |
where f is the hinge loss function as used in standard SVM [14] and defined as:
| (2) |
where (a)+ =max(0, a), b is the bias term. In the second term of (1), is the structural sparsity, i.e., l2,1-norm regularization [4], which encourages the weight coefficient matrix with many near-zero rows, while endows the coefficients that are significant to all the tasks to have larger weights. It will make sense if all classification tasks more or less share some common features. This may be true in our BD problem, because some brain regions might be abnormal in all subgroups (BD I, BD II) here. However, on the other hand, each task may have its specific features that are important for this task while unimportant for some others. So the l1-norm regularization term ||W||1 is included in (1) in order to induce sparsity among tasks. This idea can be illustrated by Fig. 1: Fig. 1A is the standard sparsity pattern, and the models for different tasks are built independently; Fig. 1B is the pattern learned by the model with only l2,1-norm, which enforces all models from different tasks to select a common set of features; Fig. 1C shows the learned pattern with l2,1 + l1-norm, which makes sparsity weight coefficients that are similar, but not identical, across tasks.
Fig. 1.

Illustrations of sparsity effects. Different colors indicate different weight coefficients. A) Standard sparsity; B) Model with only l2,1-norm; C)Model with l2,1+l1-norm.
2.2 Solution
We use the Optimal Stochastic Alternating Direction Method of Multipliers (SADMM) method [15] to solve our l2,1 +l1-norm SVM problem. We first convert (1) to the following equivalent problem:
| (3) |
This is a non-smooth but strongly convex problem. Let f(w, ξ) = max(0, 1 −ywT x), where ξ = {x, y} is a feature-label pair, and h(Z) = α||Z||2,1 + β||Z||1, the augmented Lagrangian will be:
| (4) |
where gk = f′(Wk, ξk+1) is a stochastic sub-gradient of f (Wk) at the current search point Wk of the kth iteration, λ is the Lagrangian multipliers, μ > 0 is a penalty parameter, < A, B >= trace(AT B), ηk is the step size and is set as ηk = 2/γ(k + 2) as well as in [15]. Applying SADMM to problem (4) produces closed-form updating rules as follows:
| (5) |
Let , have , and then we get the updating rule:
| (6) |
where f′(w, ξ) = −yx, if ywT x < 1; otherwise 0.
The implementation of the method can be summarized in Algorithm 1. Note that the Step 4 is solved by utilizing the decomposition property [7]. For further details see Supplementary Material1, in which we also provided the proof on the convergence property of the algorithm.
Algorithm 1.
Multi-Task l2,1 + l1-norm SVM (mtSVML21L1)
| Input: data matrix X, labels Y, and parameters α, β | |
| Initialize: W0 = Z0= λ0 = 0, μ = 10−6, μmax = 1010, ρ0 = 1.1, ε = 10−8, γ = 2, maxIter= 103, k = 0. | |
| Output: W | |
| while not converge, k <maxIter, do | |
| 1 ηk = 2/γ(k + 2) | |
| 2 Obtain stochastic gradient gk; build via (4) | |
| 3 Fix the others and update W by (6) | |
| 4 Fix the others and update Z by: | |
|
| |
| 5 Update the multiplier λ by: λk+1 = λk – μ (Zk+1 – Wk+1) | |
| 6 Update the parameter μ by: μ = min(ρ0μ, μmax) | |
| 7 Check the convergence conditions: ||Zk+1 – Wk+1||∞ < ε | |
| 8 k = k + 1 | |
| end while |
3 Results
3.1 Multi-Task Feature Learning on the Simulated Data
The Data
We generated three groups of images: 1) disease type I (D1) data, 2) disease type II (D2) data, and 3) normal control data (NC). Each group had 30 samples, resulting in a total of 90 samples. All images are of size 100 × 100. The data are generated as follows. For each of the normal data, the mean is in [0.8, 0.95] with some Gaussian noise. In D1 and D2 images, there is an area of size 30 × 30, in which the values are decreased to [0.1, 0.6] with some Gaussian noise. The locations of such patches in D1 and D2 are not identical, but they have an overlapping area of size 20× 20. The generation is illustrated in Fig. 2A.
Fig. 2.
Simulated data and results. A) Data generation; B) Learned weight coefficients.
The Results
The results obtained by mtSVML21L1 are shown in Fig. 2B. Both disease patterns are identified in the comparison of D1+D2 vs NC (Task 1), with their overlapping area being more highlighted. In Task 2, it’s shown that the abnormal patch in D1 is identified, while in Task 3, the abnormal patch in D2 is well marked. In Task 4, we can see that only the differences between D1 and D2 are highlighted. Taken together, the simulation results show our proposed method works effectively and correctly for multi-task feature learning.
3.2 Multi-Task Classification on the Bipolar Disorder Data
The Data
We evaluated the proposed methods using the structural brain MRIs on Bipolar Disorder (BD), a typical heterogeneous neuropsychiatric illness. From the total of 71 subjects, 44 were treatment-naive patients of BD and 27 were age and gender matched normal controls (NC). According to the DSM-IV criteria, each patient was assigned into BD I (22 subjects) or BD II (22 subjects) subgroups. Details on demographic characteristics, and image acquisition and preprocessing can be found in [16]. T1-weighted images were preprocessed according to a number of steps [16], including 1) AC-PC plane alignment; 2) Skull removal; 3) Tissue segmentation into gray matter (GM), white matter (WM), and cerebrospinal fluid (CSF); and 4) High-dimensional image warping to a standard MNI space, resulting in the mass-preserved tissue density maps.
Experimental Design
Based on the voxel-wise tissue density values of GM, we performed the multiple classification tasks, including 1) Task 1: BD vs NC, 2) Task 2: BD I vs NC, 3) Task 3: BD II vs NC, and 4) Task 4: BD I vs BD II. According to the absolute values of weight coefficients W, we select the respective K top-ranked features [6][12] for each task, with which the linear SVM is used in the final step for the binary classification for each task. Other than the proposed mtSVML21L1 method, some comparative methods are also carried out as below: 1) stLSL1: Single Task l1-norm Least Square (LS) loss function feature selection; 2) stSVML1: Single Task l1-norm SVM feature selection; 3) mtLSL21: Multi-Task l2,1-norm LS loss function feature selection [6][7]; 4) mtSVML21: Multi-Task l2,1-norm SVM, i.e., the case that only l2,1-norm term is included in Equation (1); 5) mtLSL21L1: Multi-Task l2,1 + l1-norm LS loss function [17] feature selection. mtLSL21L1 is built upon mtLSL21 by adding the l1-norm, and solved by using the Accelerated Proximal Gradient (APG) [7] method. To compare all methods, we used 5-fold cross-validation: four random subsets for training and the remaining one subset for testing.
Parameters Tuning
The above methods can be classified into three groups according to regularization terms: 1)l1-norm: stLSL1 and stSVML1; 2)l2,1-norm: mtLSL21 and mtSVML21; 3)l2,1 + l1-norm: mtLSL21L1 and mtSVML21L1. They are related with two parameters, α or/and β which regulate the effects of the l2,1 or/and l1 terms respectively. We searched them in the range of α, β ∈ [10−5, ···, 10−1, 0.5, 1, 101, ···, 105]. Another important parameter is K, i.e., the number of features which are selected from tasks. In our experiments, this number is in the area of [5, 5500].
Classification Results
The optimal classification accuracy (ACC) and the area under curve (AUC) measures of all methods are listed in Table 1. As shown, Multi-Task methods performed better than Single Task ones. Among all the methods, mtSVML21L1 has the best performances. The fact that mtSVML21L1 outperformed mtSVML21 reveals the benefit of characterizing specific patterns related to different tasks. The heterogeneity in the BD group resulted in inferior performance of BD vs NC than BD I/II vs NC. In addition, we find that BD I vs BD II is the most difficult task, and likely requires a much larger training set.
Table 1.
The ACCs (%) and the AUCs of the competing methods, calculated from four different tasks, respectively. The right columns list the average values.
| Methods | Task 1 | Task 2 | Task 3 | Task 4 | ACC (avg) | AUC (avg) | ||||
|---|---|---|---|---|---|---|---|---|---|---|
| ACC | AUC | ACC | AUC | ACC | AUC | ACC | AUC | |||
| stLSL1 | 59.23 | 0.49 | 51.78 | 0.54 | 48.79 | 0.44 | 46.80 | 0.42 | 51.65 | 0.47 |
| stSVML1 | 56.48 | 0.47 | 51.79 | 0.54 | 53.52 | 0.43 | 50.81 | 0.45 | 53.15 | 0.48 |
| mtLSL21 | 67.52 | 0.61 | 66.75 | 0.62 | 58.52 | 0.56 | 68.52 | 0.61 | 65.33 | 0.60 |
| mtSVML21 | 70.38 | 0.66 | 74.78 | 0.73 | 72.43 | 0.73 | 71.42 | 0.61 | 72.25 | 0.68 |
| mtLSL21L1 | 76.00 | 0.64 | 74.83 | 0.64 | 74.28 | 0.61 | 72.34 | 0.63 | 74.36 | 0.63 |
| mtSVML21L1 | 78.95 | 0.67 | 84.23 | 0.76 | 84.18 | 0.77 | 78.35 | 0.71 | 81.42 | 0.72 |
Feature Interpretations
We overlaid the output weight coefficients obtained by mtSVML21L1 onto the standard template for visual inspection. The representative sections are displayed in Fig. 3. We can see that BD I and BD II share similar patterns of GM abnormalities around the Frontal Pole (Fig. 3A) and the Precuneus (Fig. 3B) which are present in the results of Tasks 1, 2, and 3 (namely, BD vs NC, BD I vs NC, BD II vs NC), but not in Task 4 (BD I vs BD II). Relative to BD I vs NC (Task 2), BD II vs NC (Task 3) demonstrated more widely spread patterns including not only the Frontal Pole and the Precuneus but also more signals around the Cerebellum (Fig. 3C), and the Middle Frontal Gyrus (Fig. 3D) which were further confirmed by the direct comparison between BD I and BD II, that is, Task 4.
Fig. 3.
Representative slices of regions, including A) Frontal Pole, B) Precuneus, C) Cerebellum, and D) Middle Frontal Gyrus, obtained from all four tasks. The scale indicates the absolute values of weights.
4 Conclusions
In this paper, we propose a novel method named Multi-Task l2,1 + l1-norm Support Vector Machine (mtSVML21L1) for classifying Bipolar Disorder (BD) disease under the presence of phenotypic heterogeneity. We adopt the framework of multi-task hinge loss with sparsity regularization terms to jointly learn features that are commonly shared among all the tasks and which are characterized with specific patterns in each task. Experimental results have shown that, compared with other state-of-the-art methods, our proposed method can achieve the best performances for multi-tasks, also yielding better results than previous works on MRI-based classification in BD [16]. Furthermore, the features learned by the proposed method reveals the heterogeneous patterns of structural abnormalities from different tasks. Taken together, the proposed methods have deepened our insight into the neurobiological basis of the disorder’s clinical heterogeneity and helped us make progress on individual-level patient stratification.
Supplementary Material
Acknowledgments
This was supported in part by NIH R01AG14971, CNPq-Brazil & NARSAD (for clinical data), and FAPESP 13/03905-4 (to M.V.Z).
Footnotes
References
- 1.Dunner DL, Gershon ES, Goodwin FK. Heritable factors in the severity of affective illness. Biological Psychiatry. 1976;11(1):31–42. [PubMed] [Google Scholar]
- 2.Caruana R. Multitask learning. Machine Learning. 1997;28(1):41–75. [Google Scholar]
- 3.Tibshirani R. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B. 1996;58(1):267–288. [Google Scholar]
- 4.Yuan M, Lin Y. Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society: Series B. 2006;68(1):49–67. [Google Scholar]
- 5.Evgeniou A, Pontil M. Multi-task feature learning. NIPS. 2007:41–48. [Google Scholar]
- 6.Wang H, Nie F, Huang H, Risacher S, et al. Sparse multi-task regression and feature selection to identify brain imaging predictors for memory performance. ICCV. 2011:557–562. doi: 10.1109/ICCV.2011.6126288. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Zhou J, Liu J, Narayan VA, Ye J. Modeling disease progression via multi-task learning. NeuroImage. 2013;78:233–248. doi: 10.1016/j.neuroimage.2013.03.073. [DOI] [PubMed] [Google Scholar]
- 8.Rao N, Cox C, Nowak R, Rogers TT. Sparse overlapping sets lasso for multitask learning and its application to fmri analysis. NIPS. 2013:2202–2210. [Google Scholar]
- 9.Jie B, Zhang D, Cheng B, Shen D. Manifold regularized multitask feature learning for multimodality disease classification. Human Brain Mapping. 2015;36(2):489–507. doi: 10.1002/hbm.22642. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Metsis V, Makedon F, Shen D, Huang H. DNA copy number selection using robust structured sparsity-inducing norms. IEEE/ACM Transactions on Computational Biology and Bioinformatics. 2014;11(1):168–181. doi: 10.1109/TCBB.2013.141. [DOI] [PubMed] [Google Scholar]
- 11.Nie F, Huang H, Cai X, Ding CH. Efficient and robust feature selection via joint l2, 1-norms minimization. NIPS. 2010:1813–1821. [Google Scholar]
- 12.Cai X, Nie F, Huang H, Ding C. Multi-class l2, 1-norm support vector machine. ICDM. 2011:91–100. [Google Scholar]
- 13.Chang CC, Lin CJ. LIBSVM: a library for support vector machines. ACM Transactions on Intelligent Systems and Technology. 2011;2(3):28–55. [Google Scholar]
- 14.Scholkopf B, Smola AJ. Learning with kernels: Support vector machines, regularization, optimization, and beyond. MIT Press; 2002. [Google Scholar]
- 15.Azadi S, Sra S. Towards an optimal stochastic alternating direction method of multipliers. ICML. 2014:620–628. [Google Scholar]
- 16.Serpa MH, Ou Y, Schaufelberger MS, Doshi J, et al. Neuroanatomical classification in a population-based sample of psychotic major depression and bipolar I disorder with 1 year of diagnostic stability. BioMed Research International. 2014 doi: 10.1155/2014/706157. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Simon N, Friedman J, Hastie T, Tibshirani R. A sparse-group lasso. Journal of Computational and Graphical Statistics. 2013;22(2):231–245. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.


