Abstract
Feature selection (FS) methods play two important roles in the context of neuroimaging based classification: potentially increase classification accuracy by eliminating irrelevant features from the model and facilitate interpretation by identifying sets of meaningful features that best discriminate the classes. Although the development of FS techniques specifically tuned for neuroimaging data is an active area of research, up to date most of the studies have focused on finding a subset of features that maximizes accuracy. However, maximizing accuracy does not guarantee reliable interpretation as similar accuracies can be obtained from distinct sets of features. In the current paper we propose a new approach for selecting features: SCoRS (Survival Count on Random Subsamples) based on a recently proposed Stability Selection theory. SCoRS relies on the idea of choosing relevant features that are stable under data perturbation. Data are perturbed by iteratively sub-sampling both features (subspaces) and examples. We demonstrate the potential of the proposed method in a clinical application to classify depressed patients versus healthy individuals based on functional magnetic resonance imaging data acquired during visualization of happy faces.
Keywords: Classification, classification accuracy, depression, faces visualization, feature selection, functional magnetic resonance imaging (fMRI), machine learning, multivariate mapping, regression, support vector machines
I. Introduction
In the last few years there has been an increasing interest of the neuroimaging community in pattern recognition as an exploratory approach with potential for clinical applications. In the clinical context, classification methods can be useful for aiding diagnosis and prognosis. Although recent studies have shown very promising results in this area [1], there are still challenges ahead. Increasingly, neuroimaging based classification and regression models aim not only to predict well, but also to obtain insight into the anatomical or functional features that drive the predictions [2]. In neuroimaging based classification, applications involving a small number of training examples compared to a much larger number of features are very common, so the resulting classifier is likely to capture irrelevant patterns and present limited generalization performance. According to [3], one of the major challenges of multivariate pattern analysis based on functional Magnetic Resonance Images (fMRI) lies in the fact that the data usually contain a large number of uninformative, noisy voxels that do not carry useful information about the category label.
Feature Selection (FS) are techniques developed for choosing a subset of relevant features with the aim of building robust learning models. In general these techniques are used as a previous step to classification algorithms in an attempt to improve prediction accuracy. Moreover, they can also identify sets of meaningful features that best discriminate the classes. Therefore, they bring two potential benefits: defy the curse of dimensionality in order to improve prediction performance and facilitate interpretation.
A number of FS methods have been developed in the bioinformatic domain, particularly using multivariate approaches to account for interactions among genes, such as CFS (Correlation-based FS) [4], MRMR (Minimum Redundancy-Maximum Relevance) [5] and USC (Uncorrelated Shrunken Centroid) [6]. Interesting reviews addressing aspects of FS and its applications can be found in [7] and [8]. They present examples that illustrate the usefulness of selecting subsets of variables that jointly have good predictive power, as opposed to ranking variables according to their individual performance. In addition, they categorize existing techniques as wrappers, filters, and embedded methods. Briefly, wrappers use a learning machine as a black box to score subsets of variables according to their predictive power. Filters select subsets of variables as a pre-processing step, ranking features independently of the chosen classifier. Embedded methods perform variable selection during the training process and are usually specific to a particular learning machine.
In neuroimaging-based classification, a variety of these FS techniques have been proposed. However, it is important to emphasize that in the context of neuroimaging data, the problem of having much more features than examples is usually more severe than in other domains, especially in exploratory applications using the whole brain. These applications often involve hundreds of thousands of voxels for typically tens or few hundreds of scans, turning them into very challenging problems.
Some approaches classically used for FS in neuroimaging include univariate methods (e.g. t-test, Anova, Wilcoxon) as filters to select features for classification ([9], [10]), as well as multivariate approaches, e.g. Recursive Feature Elimination ([7], [11]), Hybrid FS and non-linear SVM Classification [12], Reverse feature elimination methods [13], Sparse logistic regression [14] and Perturbation method [15]. Additionally, alternative feature extraction approaches based on neuroanatomical landmarks have also been applied to neuroimaging (e.g. using summarizations from regions of interest, as in [16]). The later approach can produce interesting results when there is prior knowledge about anatomical regions or brain tissues (i.e. gray or white matter) involved in the specific disorder studied. On the other hand it might not be suitable for more exploratory approaches.
Some authors have also referred to the searchlight technique as a FS method ([11], [17], [18]). Searchlight [19] is a technique proposed for multivariate mapping based on local neighborhood. In this approach the analysis is performed for each voxel (as in univariate analysis), however voxels within a neighborhood are included in the feature set for joined multivariate analysis. The result of the local multivariate analysis (e.g. accuracy) is then stored for each voxel. By visiting all voxels and analyzing their respective (partially overlapping) neighbourhoods, one obtains a whole-brain map of accuracies. Since the performed multivariate analysis operates within each voxels’ neighborhood, this approach is also called “local pattern effects” mapping. While the searchlight mapping approach is very attractive, it only explores local relationships and does not account for long distance spatially distributed patterns.
A more recent method that has some similarities with searchlight is the Optimally-Discriminative Voxel-Based Analysis (ODVBA) [20]. ODVBA is a framework proposed to determine the optimal spatially adaptive smoothing. In a voxel-based group analysis, the authors showed that the approach was able to describe the shape and localization of structural abnormalities using both simulated and real data. The approach can also be considered a FS method, as the regional clusters associated to the highest group differences can be used as input features to a classifier.
In the present work we introduce a new multivariate method to select relevant features in neuroimaging. The proposed method (SCoRS - Survival Count on Random Subsamples) is based on iterative random sub-sampling of both features (subspaces) and examples. Repetitive application of a L1-norm regression to these sub-samples enables the selection of features that survive after many iterations (expected to be stable under perturbation). It is a novel application of a theory described as stability selection [21] with adaptations designed for the particular characteristics of neuroimaging data. Its rationale is based on the “survival” frequency after many iterations instead of relying on the coefficient values resulting from the L1-norm regression.
SCoRS is a global approach as no spatial constraints are applied. Thus it differs from other recent proposed approaches that include spatial adaptation, as in [22] and [2]. The latter rely on priors to express that not all image locations may be equally relevant for making predictions about a specific experimental or clinical condition and that areas biologically connected may be more similar in prediction relevance than unrelated ones.
We applied the proposed method to a classification problem with the aim of discriminating depressed patients versus healthy individuals based on fMRI data acquired during visualization of happy faces. In addition we compared the SCoRS with three other FS approaches previously applied to neuroimaging data: recursive feature elimination (SVM-RFE), Gini contrast, and t-test. The results are compared in terms of classification accuracy, overlap of selected feature across cross-validation folds, false selection estimation and spatial location of the selected features.
II. Material and Methods
We start this section defining the basic terminology that will be used through the text. Then we review some FS approaches previously applied to neuroimaging data (Section II-A). In Section II-B we describe the proposed method in details as well as its underlying theory. Next we explain the cross-validation (CV) framework used in the current work (Section II-C), the measure implemented for evaluating overlap of selected features across cross-validation folds (Section II-D) and the procedure used for estimating the rate of false selection (Section II-E). Finally, we describe the dataset used for illustrating results for all methods discussed (Section II-F).
Notation
Feature
In the present work, each feature corresponds to a single voxel within the brain containing BOLD signal from fMRI;
Example
vector , where each element corresponds to a particular feature and p represents the number of features;
DataMatrix
matrix where n represents the number of examples and xij corresponds to the value of the jth feature in the ith example.
LabelsVector
vector where each element yi corresponds to a label associated to a particular example. Labels can be categorical (for classification applications) or continuous (for regression applications). In the present work we illustrate the proposed FS method using a binary classification problem (depressed patients vs healthy controls) using labels 1 and −1, respectively.
A. Related Work
In this section we review three previously proposed approaches for FS in neuroimaging, whose properties and results will be compared to SCoRS: Recursive Feature Elimination (RFE-SVM), Gini Contrast and t-test.
1) Recursive Feature Elimination
This method was first proposed by [23] for selecting genes from micro-array data. In [11] authors described it in the context of an fMRI application to discriminate cognitive tasks. RFE-SVM consists of recursively applying a classification method and at each iteration discarding the feature that least contributes to the classification according to the model’s coefficients (e.g., SVM weights). This process is repeated until the accuracy drops or until all features are discarded and the optimal number of features corresponds to the one that resulted in the higher classification accuracy. However, in high dimensional problems, it is usually impracticable to discard only one single feature in each iteration, therefore a step-size (number of features eliminated in each iteration) is commonly used.
In this work we implemented RFE-SVM with a fixed step-size of 10000 voxels, but when the number of features drops to less than 10000, the step size is decreased to 20% of the number of remaining voxels, which ultimately results in more sparse voxel sets. As we used a nested cross-validation procedure (described in section II-C), the average optimal number of remaining features across the internal loop is used in the external loop for the generalization test.
RFE-SVM has become a benchmark as a multivariate FS approach for classification in neuroimaging. It has been used in several applications, most of them embedding SVM ([24], [25], [26]). However, there has been some criticism to this method. According to [3] it is not clear whether the ranking provided by the initially trained classifier is a reliable measure for the elimination of voxels.
2) Gini Contrast
The application of this approach to detect stable distributed patterns of brain activation was proposed by [3] for discriminating complex visual stimuli based on fMRI data. In this study the authors discussed the benefits of using Random Forest classifiers and the associate Gini importance as a framework for classification. They also demonstrated that the spatial patterns detected with Gini contrast provided higher classification accuracy and higher reproducibility across runs when compared with patterns obtained using RFE-SVM in conjunction with SVM. An important and distinctive characteristic of classification using Random Forest based on Gini contrast is its inherent potential for multiclass discrimination.
In a Random Forest, each decision tree is trained with a random subset of examples from the training set. In order to build each node of a tree, the algorithm searches over a random subset of features to maximize the separation among the different classes. The features are tested for their ability to separate classes, conditioned on the decisions at the higher levels of the tree.
In the present study we applied Gini contrast (GC) using the same implementation described in [3] available in Matlab (http://code.google.com/p/randomforest-matlabb/). There are two basic parameters to be set: number of trees (ntree) and subspace size (mtry). Acording to [3], the results appear quite robust to the changes in the values of the parameters. We set the subspace size to its default value (square root of the total number of features). In [3], ntree is equal to the total number of features. However, considering both the high dimensionality of our problem (we are using all voxels within the brain) and our framework (nested cross-validation for optimizing the number of features), we set ntree to 1/5 of the number of features, otherwise the computational cost would be unfeasible. Additionally, the parameter nodesize (the number of features in the terminal nodes of the trees) was set to 100 voxels, as only a few levels are necessary in order to get multivariate relationships.
For choosing the optimal number of features in the nested cross-validation framework, we considered a range of features sets sizes obtained dividing iteratively the number of features by 2 (as performed in [3]).
The selection of features proposed by [3] is closely related to the approach we are proposing in the sense that both work on random sub-samplings of features and examples, although the ranking is obtained through very different procedures. Particular differences among all the methods considered in the present work are discussed in the end of this section.
3) t-test
For completeness we also included a univariate approach in our comparison of FS methods. In this approach, a paired t-test for finding statistical differences between the classes was performed for each feature (voxel). Therefore, each feature is tested in relation to the alternative hypothesis (i.e. that there is no difference between the means of each class).
The degrees-of-freedom of the paired t-test are given according to the number of training examples in order to determine whether the t-statistic reaches the threshold of significance. In the same way as for the other methods we implemented a nested cross-validation in order to choose the optimal significance level. For t-test we used a range of significance levels varying from 0.01 to 0.1.
B. Proposed Approach: Detecting Distributed Patterns With SCoRS
1) Sparse Models
Sparse methods are able to estimate solutions for which only a few features are considered relevant therefore producing more easily interpretable solutions. One example of sparse model is the Least Absolute Shrinkage and Selector Operator (LASSO) [27], a regression approach similar to Ordinary Least Squares regression (OLS) in the sense that it aims to minimize the residual squared error. However, the LASSO formulation includes an additional L1 – Norm penalty bounding the absolute sum of all coefficients, forcing some of them to be shrunken and others to be set to zero thus producing sparse models according to equation 1 , where is the LASSO estimate, p is the number of features and λ ∈ R+ is a regularisation parameter that determines the model sparseness.
| (1) |
Although the parameter λ controls the amount of shrinkage applied to the estimates, the total number of non-zero coefficients is bounded by the number of examples. This property produces results extremelly sparse for highly ill-posed problems (such as in neuroimaging, where the number of features largely exceeds the number of examples). Additionally, in datasets containing many correlated relevant variables LASSO will tend to include only one representative variable in the model from each cluster of correlated variables [28].
2) Stability Selection and the Randomized LASSO
The Stability Selection theory, recently proposed by [21] is a general approach to address problems related to variable selection or discrete structure estimation (as graphs or clusters). The properties of this approach are particularly beneficial for applications involving high dimensional data, specially in cases where the number of variables or covariates p largely exceeds the number of examples n (i.e. the p >> n case).
In the stability selection framework, data are perturbed several times (for example by iterative sub-sampling the examples). For each perturbation, a method that produces sparse coefficients is applied to a sub-sample of the data. After a large number of iterations, all features that were selected in a large fraction of the perturbations are chosen. Finally a cutoff threshold (0 < thr < 1) is applied in order to select the most stable features.
According to the stability selection theory, for every set K ⊆ 1, ⋯ , p, the probability of K being in the selected set is defined as
| (2) |
where I is a random subsample of 1, ⋯ , n of size drawn without replacement. According to [21], the probability P* in the definition 2 regards both the random sub-sampling and other sources of randomness.
It is important to emphasize that according to stability selection theory, any regression method which produces sparse results can be used to select the features, as one is interested in the frequency of selections and not in the sparsity inherent to specific methods.
In [21], the authors used the LASSO to demonstrate the properties of the stability selection framework in an application to select relevant features in a vitamin gene expression data set. The data set consisted of 115 examples and 4088 features. The authors permuted 4082 features and applied stability selection to find the remaining six relevant features.
The original formulation of the stability selection theory is based on sub-sampling of examples (as in bootstrapping procedures). However, the authors also proposed a modified version of the original framework, which they called Randomised LASSO. In this approach, instead of penalizing the absolute value βk of every component with a penalty term proportional to λ (as in equation 1), the Randomised LASSO changes the penalty λ to a randomly chosen value in a predefined range, according to the following equation:
| (3) |
The re-weighting is not based on any previous estimate, but is simply chosen randomly. According to [21], applying this random re-weighting several times and looking for variables that are chosen often will turn out to be a very powerful procedure. They showed the superiority of Randomized LASSO in relation to the original stability selection formulation in the vitamin data set. Using the Randomized LASSO the six non-permuted features were selected and much less permuted features were included in the selected set (i.e. the number of false positive selections was lower than in the original formulation).
In our dataset the number of examples is 240 and the number of features is around 220000. The attempt of applying the original framework to our problem resulted in selections too sparse, not accounting completely for the problem of correlated predictors. Therefore we have adapted the framework as described in the following section.
3) SCoRS Algorithm
The proposed method consists of successive applications of a sparse regression method to sub-samplings of both examples and features obtained randomly from the data. We use the LASSO to select a subset of relevant features in each sub-sampling (i.e. at each iteration only a few features will have regression coefficients different from zero). After many iterations we can select the features that presented non-zero coefficients more often, i.e. the features that survive after several iterations.
In the algorithm 1, p is the total number of features, ntrain is the number of training examples, βi is the coefficient of the feature i and R is the total number of repetitions. Vectors c and s are respectively counters for the number of times each feature was randomly chosen (c) and the number of times each feature was selected by LASSO (s).
| Algorithm 1 SCoRS |
| X ← DataMatrix(ntrain, p); |
| Y ← LabelsVector(ntrain); |
| r ← 0; |
| si = 0 and ci = 0, ∀i, i = 1 : p; |
| repeat |
| Randomly select a subset of features rp out of p; |
| ci(rp) ← ci(rp) + 1; |
| Randomly select a subset of examples rn out of ntrain; |
| RX ← X(rn, rp); |
| Apply regression to RX; |
| si ← si + 1∀i∣βi ≠ 0 |
| r ← r + 1; |
| until r = R |
| Select feature i if (si./ci) > th, where 0 < th < 1 is a threshold value; |
The size of subspaces is set with respect to the total number of features and examples of the data set. In a previous work [29] we performed experiments with different combinations of numbers of repetitions (R) and sizes of subspaces (S) through a grid search using three real datasets with diverse characteristics. Our experiments suggested that smaller subspaces result in better classification accuracies. Based on our investigation of these parameters we set R = 10% and S = 0.5% of the number of features, respectively.
Regarding the sub-sampling of examples, since there were only 30 subjects in our data set, at each iteration we left one third of the subjects out, instead of half as described in the original stability selection framework.
In the present work we used LASSO implementation as described in [30].
Figure 1 shows a flowchart representing SCoRS in a classification framework. After the threshold step, surviving features are used define the selected set for classification or regression (if labels are categorical or continuous, respectively). In the present work we used the SVM classifier (Support Vector Machine) ([31], [32]) based on a linear kernel. It was implemented using LIBSVM ([33]) and the parameter C was fixed to 1. The set of selected voxels along with their selection frequency can be visualized as a map displaying regions that together are relevant for the classification (we called it Relevance map, as represented in figure 1).
Fig. 1.
Representation of the framework containing the proposed FS method inside a cross-validation loop in order to use it for classification.
In order to optimize the frequency threshold value, we set a range of 9 values from 0.1 to 0.9 and used a nested cross-validation as described in section II-C for the optimisation. For example, for a given fold if the threshold that produced the best accuracy was 0.6 only features that were selected at least 60% of the times will survive.
An important aspect to be emphasised is that at each iteration the subset of features is randomly selected from the complete set of features (i.e. the whole brain) without spatial constraints.
SCoRS is expected to perform well in practice due to simultaneous sub-sampling of examples and features. Data perturbation through random sub-sampling of examples followed by a threshold of the survival frequency enables selection of stable features ([21]). However, in neuroimaging the features (or voxels) are expected to be highly correlated, therefore the additional random sub-sampling of features is advantageous to decrease the amount of correlation among them, this procedure approximates the Randomised LASSO approach ([34],[21]). The recombination of features in different random subsets will favour the most relevant to survive.
The table I summarises the main differences among the methods that are being compared in the present work.
TABLE I.
Technical details of FS methods discussed
| Method | Properties | Selection criterium |
|---|---|---|
| SCoRS 1 | Multivariate Linear | Survival selection after successive applications of sparse regression to random subsets of features and examples. |
| RFE-SVM 2 | Multivariate Linear | Recursive elimination of features based on the SVM weights. |
| GC 3 | Multivariate Non-linear | Reduction in Gini impurity integrated over all trees in a random forest. |
| t-test | Univariate Linear | Voxelwise statistical test for mean difference between the classes. |
Surviving Count on Random Subsamples
Recursive Feature Elimination
Gini Contrast in Random Forests
C. Nested Cross-Validation
We used a nested cross-validation loop for parameter optimisation (i.e. in order to avoid using test data in any parameter tuning). In this framework, a pair of subjects is left out in the outer loop for test while the inner loop is used to find the parameter value that results in the highest classification accuracy.
Figure 2 illustrates the nested cross-validation procedure. The different FS approaches, described in sections II-A1, II-A2, II-A3 and II-B3) are placed in the gray shaded rectangles represented in the figure. For each method we optimised a specific parameter, varying it within an appropriate range of values, as follows:
Fig. 2.
Representation of the nested cross-validation approach to parameters tuning. The parameters are: threshold level in SCoRS, number of features in RFE-SVM, top rank percentage in Gini contrast and statistical significance in t-test.
SCoRS: Threshold levels (from 0.1 to 0.9 in steps of 0.1);
RFE-SVM: A range of iterations, where the number of features is given by eliminating stepsize recursively;
GC: A range of features sets sizes (nfeatures/2, iteratively) obtained from features ranked according to their associated Gini Contrast.
t-test: Significance levels (from 0.01 to 0.1 in steps of 0.01)
In some folds the maximum classification accuracy was obtained for different parameter values. As a tiebreaker criterion we calculated the median among the parameter values that produced the highest accuracy.
D. Overlap of Selected Features Across Cross-Validation Folds
In order to evaluate the variability across CV folds we computed a measure of overlap across the sets of features selected in each CV fold and applied this measure to each FS method. Our implementation was based on an overlap measure proposed by [35]. As we have a leave-one-out cross-validation with F folds, we averaged the pairwise overlaps Oi among the folds, according to equation 4, where Si (Sj) is the subset of features selected in the fold i (j), F is the number of folds, is the number of non-zero features in the subset Si and E is a factor that aims to correct for the fact that for a given model the expected overlap of non-zero features will increase with the sparsity reduction. The heuristic given by equation 5 (also based on [35]) was used to calculate this correction, where P is the total number of features.
| (4) |
| (5) |
E. False Selection Estimation
An important issue related to the interpretation of the selected features is how to control the number of features falsely selected. In the current work we proposed an empirical test to estimate the rate of false positive selection according to the following procedure:
I) Obtain the set of features S composed of the union of the features selected in at least half of the CV folds;
II) Obtain F , the complementary set of S;
III) Permute the examples for all features f ∈ F (as illustrated in figure 3;
IV) Using the data matrix updated with features permuted across the examples, run SCoRS again (using the same nested-CV framework);
V) Compute how many features in F are selected (this number corresponds to the estimation of how many features were falsely selected).
Fig. 3.
Representation of false positive evaluation. (a) Original data matrix with examples labeled as classes C1 and C2. Enhanced columns represent the features selected. (b) Examples are permuted for all features that were not selected.
The reasoning behind this evaluation is to assess what proportion of the features whose correlation with the label has been destroyed through permutation of the examples are still selected by chance. It is important to emphasize that all examples belonging to the same subject (four examples in this dataset, as explained in section II-F) are kept together, i.e. not permuted among themselves, as represented in figure 3.
F. Data Description
We used a dataset from a previous study, which we briefly describe below. A more detailed description of the data and the context of the study can be found in [36].
1) Participants
A total of 30 psychiatric in-patients from the University Hospital of Psychiatry, Psychosomatics and Psychotherapy (Wuerzburg, Germany) diagnosed with Recurrent depressive disorder, Depressive episodes, or Bipolar affective disorder based on the consensus of two trained psychiatrists according to ICD-10 criteria (DSM-IV codes 296.xx) participated in this study. Accordingly, self report scores in the German version of the Beck Depression Inventory (Second Edition) ranged from 2 to 42 (mean [SD]score, 19.0 [9.4]).
Exclusion criteria were age below 18 or above 60 years, comorbidity with other currently present Axis I disorders, mental retardation or mood disorder secondary to substance abuse, medical conditions as well as severe somatic or neurological diseases. Patients suffering from bipolar affective disorder were in a depressed phase or recovering from a recent one with none showing manic symptoms. All patients were taking standard antidepressant medications, consisting of selective serotonin reuptake inhibitors (n=14), tricyclic antidepressants (n=14), tetracyclic antidepressants (n=8), or serotonin and noradrenalin selective reuptake inhibitors (n=8).
Thirty comparison subjects from a pool of 94 participants previously recruited by advertisement from the local community were selected to match the patient group in regard to gender, age, smoking, and handedness using the optimal matching algorithm implemented in the MatchItpackageforR (http://www.r-project.org/). In order to exclude potential Axis I disorders, the German version of the Structured Clinical Interview for DSM-IV (SCID; 35) Screening Questionnaire was conducted. Additionally, none of the control subjects showed pathological Beck Depression Inventory (BDI II) scores (mean = 4.3, SD = 4.6).
From all 60 participants, written informed consent was obtained after complete description of the study to the subjects. The study was approved by the Ethics Committee of the University of Wuerzburg, and all procedures involved were in accordance with the latest version (fifth revision) of the Declaration of Helsinki.
2) Tasks and Procedures
The paradigm consisted of passively viewing emotional faces. A blocked design was used, with each block containing faces from eight individuals (four female, four male) that were taken from the Karolinska Directed Emotional Faces database. Every block was repeated four times in a random mode. Each face was shown against a black background for two seconds and was directly followed by the next face. Thus, each block had a duration of 16 seconds. Face blocks were alternated with blocks of the same length showing a white fixation cross on which the participant had to focus. Subjects were instructed to attend to the faces and empathize with the emotional expression.
3) fMRI Acquisition
Imaging was performed using a 1.5 T Siemens Magnetom Avanto TIM-system MRI scanner (Siemens, Erlangen, Germany) equipped with a standard 12 channel head coil. In a single session, twenty-four 4-mm-thick, interleaved axial slices (in-plane resolution: 3.28 × 3.28 mm) oriented at the AC-PC transverse plane were acquired with 1 mm inter-slice gap, using a T2*-sensitive single-shot EPI sequence with following parameters: repetition time (TR; 2000 ms), echo time (TE; 40 ms), flip angle (90), matrix (64×64), and field of view (FOV; 210×210 mm2). The first six volumes were discarded to account for magnetization saturation effects. Stimuli were presented via MRI-compatible goggles (VisuaStim; Magnetic Resonance Technologies, Northridge, CA, USA).
4) Preprocessing
Data were pre-processed using the Statistical Parametric Mapping software (SPM5, Wellcome Department of Cognitive Neurology, UK). Slice-timing correction was applied, images were realigned, spatially normalized and smoothed using an 8 mm FWHM Gaussian isotropic kernel. Before running the FS methods, specific additional pre-processing was performed using custom-built Matlab routines: A mask was applied to each volume or scan in order to select voxels that contain brain tissue in all subjects; then, for each subject, all the voxels inside the mask were linearly detrended. Before selecting the examples (i.e. the BOLD signal images corresponding to the times in which the stimuli were presented), the scans were shifted to accommodate the delay due to hemodynamic response, according to the following expression, where TR represents the amount of time between consecutive excitation pulses (in milliseconds): hrf – delay = 3/(TR/1000). In addition, we used the Matlab operator floor to round the value to the nearest smaller integer. Within each block, individual scans were averaged to increase the signal-to-noise ratio, i.e. a temporal compression as proposed by [37] was applied. Therefore the resulting data-matrix was composed of 219727 features (voxels) and 240 examples (2 groups, 30 subjects in each group, 4 blocks per subject).
In our regression model the predictors (independent variables) are the fMRI values from the subset of features randomly selected in each example (note that the examples are also randomly selected from the training set of examples). The response (dependent variable) is the categorical label associated with the example (i.e., 1 for patients and −1 for healthy controls).
Importantly, in this study patients who were on a variety of medications and who, at the time of the measurements, presented varying degrees of depressive symptoms from severe to currently almost symptom-free were explicitly recruited. We used data from a well-diagnosed, but heterogeneous group of patients with varying degrees of depressive symptoms (including medicated patients) in order to obtain a more realistic estimate of the algorithms potential utility in real clinical applications.
III. Results
In this section we compare the results of proposed approach (SCoRS) with previously proposed FS approaches for neuroimaging applications (RFE-SVM, Gini contrast and t-test). The methods were applied to the same data set (described in section II-F) and evaluated with respect to classification accuracy, overlap of the selected features across CV folds, false selection estimation and spatial mapping.
A. Impact of the Threshold Level on the Accuracy
In the table II we present results from SVM classification based on features selected by SCoRS without threshold optimization (i.e. with only the outer loop in the cross-validation). These values were presented in order to demonstrate the impact of the threshold level on the accuracy.
TABLE II.
SCoRS threshold evaluation - First column presents the range of thresholds considered (from 0.1 to 0.9). T = 0.9, for example, means that only features selected by the LASSO more than 90% of the times they were randomly selected were used in the classification.
| N feat | Specificity | Sensitivity | Accuracy | |
|---|---|---|---|---|
| No FS | 219727 | 0.63 | 0.70 | 0.67 |
| No threshold | 210922 | 0.63 | 0.70 | 0.67 |
| T = 0.1 | 98738 | 0.67 | 0.73 | 0.70 |
| T = 0.2 | 51094 | 0.63 | 0.73 | 0.68 |
| T = 0.3 | 29958 | 0.63 | 0.73 | 0.68 |
| T = 0.4 | 18046 | 0.63 | 0.80 | 0.72 |
| T = 0.5 | 10704 | 0.67 | 0.80 | 0.74 |
| T = 0.6 | 6170 | 0.67 | 0.77 | 0.72 |
| T = 0.7 | 3265 | 0.67 | 0.77 | 0.72 |
| T = 0.8 | 1473 | 0.67 | 0.73 | 0.70 |
| T = 0.9 | 461 | 0.67 | 0.70 | 0.68 |
The first row in table II presents results of classification without FS (using all voxels within the brain). The accuracy is the average between sensitivity (proportion of patients correctly classified) and specificity (proportion individuals in the healthy control group correctly classified). The second row shows results obtained using SCoRS without applying threshold (i.e. considering all voxels selected at least once by the LASSO). The remaining rows show results obtained after applying different threshold levels. However, a nested cross-validation is necessary in order to optimize the threshold, as described in section II-C.
B. Parameter Optimization
Figure 4 shows SVM accuracies obtained for different parameter values using a nested cross-validation framework with features selected by SCoRS (a), RFE-SVM (b), Gini contrast (c) and t-test (d). The colour scales represent the mean accuracy across the inner loops. The columns correspond to different cross-validation folds and each row represents one of the parameter values in the ranges considered for each method (described in sections II-B3, II-A1, II-A2 and II-A3, respectively).
Fig. 4.
Classification accuracy tables. The colors represent classification accuracies obtained in the cross-validation inner loops. In each table, rows represent values in the specific ranges used in each method for optimising the number of features. Columns represent the folds.
From figure 4 it is possible to notice that in some folds the highest accuracy was obtained for more than one parameter value. In these cases we used the median among these values as the optimal parameter (as described in section II-C). The optimal values chosen (i.e. used in the outer loop) are marked in the figures with crosses.
Figure 5 presents bar graphs showing the number of features selected in each fold for SCoRS (a), RFE-SVM (b), Gini contrast (c) and t-test (d).
Fig. 5.
Number of features selected in each fold.
Table III summarizes the performances obtained for the different FS methods. All values (specificity, sensitivity and accuracy) correspond to average across folds.
TABLE III.
Comparison of different feature selection methods
| Method | Specificity | Sensitivity | Accuracy |
|---|---|---|---|
| Whole brain | 0.63 | 0.70 | 0.67 |
| SCoRS | 0.67 | 0.77 | 0.72 |
| RFE | 0.60 | 0.73 | 0.67 |
| GC | 0.60 | 0.67 | 0.63 |
| t-test | 0.67 | 0.57 | 0.62 |
C. Overlap of Selected Features Across CV Folds
Results presented in figures 4 and 5 show that there was a high variability in the number of features selected across cross-validation folds even though there was a high similarity between the training examples from different folds (i.e. only a different pair of subjects was left out for test).
We applied the procedure described in section II-D to compute an overlap score for each method. For SCoRS, RFE-SVM, Gini contrast and t-test, results were respectively 0.64, 0.70, 0.37 and 0.71.
D. False Selection Estimation
We applied the procedure described in section II-E to estimate the rate of false selections. For SCoRS, RFE-SVM and t – test, results were respectively 0.06, 0.14 and 0.03.
We did not compute this measure for features selected using Gini contrast due to the high computational cost associated with running the Random Forests approach in this framework.
Table IV summarizes all quantitative results (averaged across folds). NF is the number of features selected, Acc is the classification accuracy, O is the overlap of features selected across folds and FSR is the false selection rate.
TABLE IV.
Summarizing final results for all methods.
| Method | NF | Acc | O | FSR |
|---|---|---|---|---|
| No FS | 219727 | 0.67 | —‐ | —‐ |
| SCoRS | 8670 | 0.72 | 0.64 | 0.06 |
| RFE | 15569 | 0.67 | 0.70 | 0.14 |
| GC | 22848 | 0.63 | 0.37 | —‐ |
| t-test | 19044 | 0.62 | 0.71 | 0.03 |
E. Spatial Mapping
Figure 6 displays the sets of features selected by SCoRS (a), RFE-SVM (b), Gini contrast (c) and t-test (d), respectively. For each approach the selected features were overlaid on an anatomical template. In order to make the maps corresponding to different methods comparable, colours attributed to each voxel represent how frequent the voxel was selected across the cross-validation folds. The colours scale varies from dark red (features selected in one single fold) to light yellow (features selected in all 30 folds).
Fig. 6.
Features selected by each method: (a) SCoRS, (b) RFE-SVM, (c) Gini Contrast, (d) t-Test. The colour attributed to each feature represents how frequent it was selected across the cross-validation folds.
In order to quantify the similarities among the features selected by different FS method in terms of spatial localization we displayed an overlapping map with the features selected by SCoRS, RFE-SVM and t-test (figure 7.) The Gini constrast was not included because the extracted features contained noise and did not have well defined cluster to generate table (figure 6 c). Actually, most of the selected voxels were included in a single cluster (99 % of the voxels).
Fig. 7.
Overlap of features selected across the methods. Each of four colours represent how many methods selected a particular feature.
In figure 7 the colour scale varies from 1 to 3 (e.g. 3 means that the feature was selected by all 3 FS methods considered, 2 means that it was selected by 2 of them, and 1 means that it was selected for 1 FS method only). From figure 7 it is possible to see that there is a high overlap among the features selected by the different approaches. Interestingly, the complete overlap (voxels selected by all methods, displayed in white colour) consists of large clusters concentrated in specific regions.
In Tables V - VII, we present the 25 most important brain regions that discriminate the groups, using SCoRS, RFE-SVM, and t-test approaches, respectively. The regions were ranked and listed according to the extension of the clusters. In each table we present the following information: names of the anatomical regions where the clusters’ peaks are located, Talairach coordinates of the clusters’ peaks (x,y,and z), and the corresponding Brodmann area(BA).
TABLE V.
Brain regions important to discriminate the groups found by SCoRS
| Brain region | x | y | z | BA |
|---|---|---|---|---|
| Middle Temporal Gyrus | −34 | 2 | −40 | 38 |
| Middle Frontal Gyrus | −42 | 44 | 14 | 10 |
| Cerebellum - Declive | −14 | −82 | −20 | * |
| Basal Ganglia (Striatum) | 16 | 6 | −8 | * |
| Cerebellar Tonsil | 8 | −32 | −44 | * |
| Inferior Frontal Gyrus | 48 | 10 | 30 | 9 |
| Middle Frontal Gyrus | −26 | 14 | 34 | 8 |
| Anterior Cingulate | 24 | 36 | 14 | 32 |
| Orbitofrontal cortex | −38 | 46 | −14 | 11 |
| Visual Cortex (Cuneus) | −18 | −96 | 0 | 17 |
| Basal Ganglia (Caudate) | −2 | 2 | 14 | * |
| Fusiform Gyrus | −50 | −40 | −4 | 37 |
| Visual Cortex (Cuneus) | 8 | −84 | 4 | 17 |
| Orbitofrontal cortex | 2 | 42 | −18 | 11 |
| Inferior Frontal Gyrus | −56 | 22 | −10 | 47 |
| Superior Frontal Gyrus | −16 | 48 | 32 | 9 |
| Inferior Temporal Gyrus | 66 | −20 | −24 | 20 |
| Precentral Gyrus | −46 | 2 | 34 | 6 |
| Middle Temporal Gyrus | 46 | −46 | 12 | 22 |
| Superior Frontal Gyrus | 4 | 6 | 64 | 6 |
| Cerebellum (Culmen) | 6 | −38 | 0 | * |
| Inferior Frontal Gyrus | 36 | 30 | −14 | 47 |
| Posterior Cingulate | −2 | −52 | 14 | 29 |
| Cingulate Gyrus | 18 | −6 | 40 | 24 |
| Visual Cortex Cuneus | −18 | −104 | 0 | 18 |
TABLE VI.
Regions important to discriminate the groups found by RFE
| Brain region | X | Y | Z | BA |
|---|---|---|---|---|
| Fusiform Gyrus | 30 | −92 | −20 | 18 |
| Superior Temporal Gyrus | 30 | 16 | −40 | 38 |
| Cerebelum (Culmen) | −4 | −46 | −2 | * |
| Parahippocampal gyrus (Uncus) | −30 | 0 | −42 | 20 |
| Orbitofrontal cortex | −38 | 48 | −20 | 11 |
| Superior Temporal Gyrus | −54 | 18 | −10 | 38 |
| Middle Frontal Gyrus | 50 | 20 | 28 | 9 |
| Cerebelum (Tuber) | 46 | −58 | −28 | * |
| Superior Frontal Gyrus | −2 | 30 | 56 | 6 |
| Cerebellar Tonsil | −42 | −48 | −36 | * |
| Superior Frontal Gyrus | −16 | 48 | 32 | 9 |
| Superior Frontal Gyrus | −20 | 64 | 6 | 10 |
| Postcentral Gyrus | −48 | −28 | 60 | 1 |
| Superior Temporal Gyrus | 66 | −26 | 8 | 42 |
| Supramarginal Gyrus | −52 | −42 | 34 | 40 |
| Orbitofrontal cortex | 30 | 34 | −22 | 11 |
| Cuneus | 20 | −74 | 32 | 7 |
| Superior Frontal Gyrus | −22 | 0 | 70 | 6 |
| Anterior Cingulate | 4 | 36 | −2 | 24 |
| Precuneus | 4 | −78 | 48 | 7 |
| Supramarginal Gyrus | 36 | −50 | 30 | 40 |
| Middle Temporal Gyrus | 50 | 6 | −36 | 21 |
| Superior Frontal Gyrus | −42 | 36 | 30 | 9 |
| Inferior Frontal Gyrus | −56 | 24 | 12 | 45 |
| Orbital Gyrus | 2 | 38 | −30 | 11 |
TABLE VII.
Brain regions important to discriminate the groups found by t – test
| Brain region | X | Y | Z | BA |
|---|---|---|---|---|
| Inferior Temporal Gyrus | 36 | −2 | −40 | 20 |
| Precuneus | −42 | −74 | 34 | 19 |
| Orbitofrontal Cortex | −12 | 56 | −16 | 11 |
| Superior Frontal Gyrus | −18 | 10 | 70 | 6 |
| Visual Cortex (Cuneus) | 16 | −104 | −2 | 18 |
| Postcentral Gyrus | −58 | −24 | 48 | 2 |
| Cerebellum (Tonsil) | 28 | −32 | −32 | * |
| Orbitofrontal Cortex | −38 | 44 | −18 | 11 |
| Cerebellum (Uvula) | −20 | −88 | −26 | * |
| Cerebelum (Dentate) | −14 | −56 | −20 | * |
| Inferior Temporal Gyrus | 50 | −54 | −6 | 37 |
| Visual Cortex (Cuneus) | −2 | −86 | 10 | 18 |
| Inferior Frontal Gyrus | −54 | 24 | 10 | 45 |
| Cerebellum Pyramis | 8 | −86 | −34 | * |
| Insula | 28 | −38 | 22 | 13 |
| Visual Cortex (Cuneus) | −18 | −98 | 0 | 18 |
| Cerebellum (Declive) | −16 | −80 | −16 | * |
| Precentral Gyrus | −46 | 2 | 34 | 6 |
| Cerebellum Inf Semi-Lunar Lobule | 32 | −80 | −40 | * |
| Cerebellum (Tuber) | −48 | −66 | −26 | * |
| Fusiform Gyrus | −38 | −68 | −14 | 19 |
| Cerebellum (Tuber) | −52 | −48 | −24 | * |
| Inferior Frontal Gyrus | −48 | 36 | 2 | 45 |
| Inferior Frontal Gyrus | 48 | 8 | 32 | 9 |
| Middle Frontal Gyrus | −58 | 14 | 36 | 9 |
Table VIII displays the clusters extracted from the overlapping map (Fig. 7) including common features selected by t-test, RFE-SVM, and SCoRS. For the reasons previously explained we did not include a table describing the most important regions accordingto the Gini Contrast. However, in order to evaluate the overlap of peaks across the four methods we extracted voxels selected by the Gini Contrast in all leave-one-out cross-validations folds [i.e., the peaks in Fig. 6(c)]. A careful inspection of these voxels and the clusters’ peaks described in the Tables V - VII reveals a coincidence of the peaks across the four methods in important regions. Particularly the peaks in inferior/middle temporal gyrus (BA 20), cerebellum, and orbitofrontal cortex (BA 11) were common for all four methods (SCoRS, RFE-SVM, t-test, and Gini Contrast).
TABLE VIII.
Overlap of brain regions important to discriminate the groups found by SCoRS, RFE-SVM and t – test.
| Brain region | X | Y | Z | BA |
|---|---|---|---|---|
| Cerebellum (Tonsil) | 0 | −48 | −38 | * |
| Precentral Gyrus | −58 | 10 | 4 | 44 |
| Inferior Frontal Gyrus | −20 | 24 | −4 | 47 |
| Middle Temporal Gyrus | −48 | −64 | 22 | 39 |
| Precentral Gyrus | −30 | −14 | 68 | 6 |
| Supramarginal Gyrus | −54 | −44 | 34 | 40 |
| Precentral Gyrus | −46 | 0 | 30 | 6 |
| Orbitofrontal cortex | 30 | 34 | −22 | 11 |
| Inferior Frontal Gyrus | −54 | 22 | 10 | 45 |
| Precentral Gyrus | −46 | −10 | 40 | 4 |
| Cerebellum (Tonsil) | 40 | −44 | −46 | * |
| Superior Parietal Lobule | 28 | −66 | 60 | 7 |
| Middle Frontal Gyrus | 36 | 48 | −8 | 10 |
| Inferior Parietal Lobule | 40 | −54 | 58 | 40 |
| Middle Temporal Gyrus | −58 | 6 | −26 | 21 |
| Inferior Temporal Gyrus | 60 | −8 | −34 | 20 |
| Posterior Cingulate | 18 | −34 | 28 | 31 |
| Postcentral Gyrus | 32 | −32 | 44 | 2 |
| Visual Cortex | 30 | −78 | 12 | 19 |
| Orbitofrontal cortex | 8 | 46 | −26 | 11 |
| Inferior Parietal Lobule | 36 | −42 | 42 | 40 |
| Orbitofrontal Cortex | −6 | 52 | −16 | 11 |
IV. Discussion
In the present paper, we proposed SCoRS as a new FS method and demonstrated its potential through a challenging application to classify depressed patients and healthy controls based on fMRI data. Classification based on the set of features selected by SCoRS presented higher accuracy, both with respect to whole-brain and to the other FS methods compared. The improvement in accuracy was obtained with a significant reduction in the number of features, producing maps more easily interpretable.
Feature selection and mapping in neuroimaging-based multivariate analysis is a challenging problem, specially in exploratory applications involving the whole brain without any kind of prior hypothesis regarding brain regions potentially involved in the problem.
Classification based on fMRI can be applied to two different problems: discriminating tasks and discriminating groups. In task classification, scans from different cognitive states are extracted from the time-series and the objective is to predict which task the subject was performing (also known as mind-reading) and although it is possible to use data from one single subject or from a group of subjects, classification is performed with respect to the tasks. For group discrimination in fMRI, a single task is usually considered and examples are related to different subjects. The objective is to classify subjects between groups according to their patterns (e.g., patients vs. healthy controls, responders vs. non-responders to a specific treatment or sub-groups of patients). Classifying groups is usually much more challenging than classifying tasks, as it is based on the assumption that a particular stimulus or task will evoke a different activity pattern in each of the groups. In other words, it usually relies on a more subtle distinction than discriminating different tasks. Consequently, these difficulties influence the complexity associated to features selection as well.
In addition, when working with multiple subjects it is necessary to normalise the images into a common space, what can increase the number of features due to oversampling in the pre-processing steps thus contributing for the course of dimensionality problem. Moreover, another issue that makes the classification task more challenging in the current application is the heterogeneity of the data. The study covers patients with Recurrent depressive disorder, Depressive episodes, and Bipolar affective disorder. An heterogeneous group of patients had been explicitly recruited in a previous study [36] in order to evaluate the ability of classification methods to deal with such diversity. This might contribute to the high variability across cross-validation folds observed for all methods compared.
Although various feature selection methods have been proposed and applied to neuroimaging data, the stability of the selected features has not usually been properly addressed. One of the pioneer studies addressing the importance of reproducibility and preservation of local correlation in feature selection methods was [3]. In this study the authors proposed a method based on Gini importance and Random Forest classifiers to distinguish complex visual tasks. They presented interesting results in terms of accuracy improvement and reproducibility in relation to RFE-SVM and raised a very interesting discussion regarding these issues. One limitation of this study was that the analysis was restricted to a specific brain area.
Another study that addressed the issue of reproducibility was [35]. In this study the authors focused on the relative influence of the model regularization parameter choices on both the model generalization and the reliability of the patterns identified by the models.
The method we proposed is based on Stability Selection theory [21]. The framework is general and can be applied to different modalities and experimental designs. Although SCoRS selects subsets of voxels in each iteration, the high number of repetitions with different randomizations make it globally multivariate. The random sub-sampling gives to each feature the opportunity to be selected in different configurations. Consequently SCoRS tends to select features that have the highest predictive power given that they survive after taking part in different combinations of random subspaces and sub-sampling of examples. It is interesting to observe that even though there is no spatial constraint in SCoRS, the voxels selected are grouped in clusters. This fact can be due to physiological properties of the data (i.e. the brain is organized in regions) and due to pre-processing steps (i.e. spatial smoothing). In any case, the fact that SCoRS finds clusters of voxels is good evidence that the method is finding features that are truly relevant for the prediction. Since we know that neighbour voxels in the brain are correlated we expect them to share predictive information .
In the present study the selection of features using SCoRS resulted in an improvement in the accuracy up to 6% in relation to the whole-brain (from 67% to 72%) while using around 4% of the total number of features in average across cross-validation folds. The improvement was consistent for both specificity (rate of controls correctly classified) and sensitivity (rate of patients correctly classified), increasing from 63% to 67% and from 70% to 77%, respectively. SCoRS has also presented higher accuracy than the other FS methods compared, despite their differences in sparsity.
Although FS is commonly applied with the aim of increasing accuracy, interpretability has become increasingly an important matter both for classification and for regression models in neuroimaging, as introduced in section I. In clinical research as well as in neuroscience is very important to be able to localize anatomicaly the most relevant features. Therefore, the development of methods that provide solutions that are easier to interpret in terms of anatomical locations and at the same time stable is of major importance.
In the present paper, besides the comparison in terms of accuracies, we have also evaluated the results of the FS methods using additional measures characterising overlap across folds and estimation of false selections. With respect to the overlap of selected features across cross-validation folds, SCoRS presented less variability then Gini Contrast, but more variability than RFE-SVM and t-test. However it should be noticed that less sparse methods tend to have higher overlap across folds. Even though we have used an approach to compensate to the fact that the expected overlap of non-zero features increases with the sparsity reduction, this correction still relies on heuristic assumptions.
In applications using highly heterogeneous data (as the one in the current paper), some of the features considered relevant in a specific fold might not be relevant in another fold, therefore a low overlap of selected features across folds might happen. In this cases, an estimation of false selection rate (FSR) might provide additional information related to the level of confidence of the selected features. The smaller overlap of selected features across folds obtained with SCoRS when compared to RFE-SVM and t – test suggests that this approach might be more sensitive to the heterogeneity of the data. Interestingly, when comparing the estimation of FSR for the different feature selection approaches, SCoRS presented a lower estimate than RFE-SVM, what suggests that there might be an overlap of features selected by chance by RFE-SVM across folds. Comparing SCoRS and t – test, the latter presented lower estimate of FSR, what would be expected since the t–test is a statistical test developed to find relevant features at a specific significance level or p – value. However, as the t – test is a univariate approach, i.e., it does not take into account spatial correlation among the voxels, and might not detect features that are relevant when operating together. The lower accuracy of the classification based on t–test selection (in relation to all FS approaches considered) supports this hypothesis.
Based on visual inspection of the maps produced for all FS methods it is possible to see that SCoRS produced small and well-defined clusters. RFE-SVM produced results slightly less sparse than SCoRS and t-Test produced larger clusters. In spite of the fact that Gini contrast produced noisier maps it is possible to observe similar groups of clusters with respect to the other FS approaches. It is important to notice that Gini Contrast could potentially produce better results if a high number of trees was used. However, given the dimensionality of the problem addressed in the present work, increasing the number of trees was not computationally feasible. The computational cost for Gini Contrast with parameters used in the present study was around 30 times higher than SCoRS’s cost. Hence, our results suggest that the Gini Contrast might be more appropriate for feature selection and mapping in combination with prior heuristics to limit the number of input features (as implemented in [3]).
The anatomical location of the selected features are described in tables V, VI and VII for the SCoRS, RFE-SVM and t – test approaches, respectively. The tables include, for each approach, the twenty five most important brain regions according to the extension of the cluster. We did not generate a list of important regions for the Gini Contrast because most of the selected voxels were included in a single cluster.
Additionally, in table VIII, we present the overlap of twenty two brain regions selected according to SCoRS, RFE-SVM and t-test approaches. These results are interesting because several brain regions described in table VIII are consistently implicated in major depression and bipolar disorders. Specifically, we found Orbitofrontal cortex (BA 11), Middle Frontal Gyrus (BA 10), Visual Cortex (BA 19), Posterior Cingulate and Cerebellum, which are brain regions associated with these psychiatry diseases (see [38]; [39]; [40]; [41]; [42]; [43]; [44]; [45]). However, it is interesting to note that we found more overlapping brain regions between SCORS and RFE-SVM. In fact, discriminative voxels in Anterior Cingulate Cortex (BA 24 and 32) were found only using SCoRS or RFE-SVM, but not using t – test. This region is considered a key node of a brain network linked to depression states (e.g. [43]; [44]). Furthermore, only using SCoRS approach, we found Basal Ganglia listed as one of the 25 most import regions that discriminate the groups. The Basal Ganglia, specially the Striatum, has been considered as part of several neuroanatomic circuits that are involved in mood regulation in depression and bipolar disorders ([39]; [40]).
In summary, the inspection of the frequency maps indicated that all the methodologies used were able to identify the main regions involved in major depression and bipolar disorders. Furthermore, SCoRS seems to be more efficient to identify core brain regions associated with these psychiatry diseases, such as Anterior Cingulate and Basal Ganglia.
As future work, we intend to explore SCoRS more thoroughly as a mapping approach enabling inferences from the selected features. In addition, we intend to apply the proposed framework to decode continuous variables (regression analysis) as well as to different imaging modalities and/or data sources.
Acknowledgments
JMM was supported by a Wellcome Trust Career Development Fellowship under grant no. WT086565/Z/08/Z. JMR was supported by the Wellcome Trust (grant no. WT086565/Z/08/Z) and by Capes (Coordination for the Improvement of Higher Level Personnel), Brazil (grant no. 3883/11-6). LO was supported by Capes (Coordination for the Improvement of Higher Level Personnel), Brazil. AM gratefully acknowledges support from the Kings College Annual Fund and the Kings College London Centre of Excellence in Medical Engineering, funded by the Wellcome Trust and EPSRC under grant no. WT088641/Z/09/Z. JMR and JST acknowledge support from Pascal2 Thematic Programme on Cognitive Inference and Neuroimaging.
The authors would like to thank Dr. G. Langs, who kindly provided helpful advice on Gini Contrast settings. The authors would also like to thank Dr. F. Breuer and Dr. M. Blaimer from the Research Center Magnetic-Resonance-Bavaria for their technical support.
References
- [1].Kloppel S, Abdulkadir A, C. J, Jr, Koutsouleris N, Mouro-Miranda J, Vemuri P. Diagnostic neuroimaging across diseases. Neuroimage. 2012;61(no. 2):457–63. doi: 10.1016/j.neuroimage.2011.11.002. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [2].Sabuncu M, Leemput K. The relevance voxel machine (rvoxm): A self-tuning bayesian model for informative image-based prediction. IEEE Transactions on Medical Imaging. 2012;31:2290–2306. doi: 10.1109/TMI.2012.2216543. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [3].Langs G, Menze B, Lashkari D, Golland P. Detecting stable distributed patterns of brain activation using gini contrast. Neuroimage. 2011;56:497–507. doi: 10.1016/j.neuroimage.2010.07.074. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [4].Wang Y. Gene selection from microarray data for cancer classificationa machine learning approach. Comput. Biol. Chem. 2005;29:37–46. doi: 10.1016/j.compbiolchem.2004.11.001. [DOI] [PubMed] [Google Scholar]
- [5].Ding C, Peng H. Minimum redundancy feature selection from microarray gene expression data. Proceedings of the IEEE Conference on Computational Systems Bioinformatics; 2003. pp. 523–528. [DOI] [PubMed] [Google Scholar]
- [6].Yeung K, Bumgarner R. Multiclass classification of microarray data with repeated measurements: application to cancer. Genome Biol. 2003;4 doi: 10.1186/gb-2003-4-12-r83. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [7].Guyon I, Elisseefi A. An introduction to variable and feature selection. Journal of Machine Learning Research. 2003;3:1157–1182. [Google Scholar]
- [8].Saeys Y, Inza I, Larranaga P. A review of feature selection techniques in bioinformatics. Bioinformatics Advance Access. 2007;23(no. 19):2507–17. doi: 10.1093/bioinformatics/btm344. [DOI] [PubMed] [Google Scholar]
- [9].Salas-Gonzalez D, Grriz M, Ramrez J, Lpez M, Alvarez I, Segovia F, Chaves R, Puntonet C. Computer-aided diagnosis of alzheimers disease using support vector machines and classification trees. Phys. Med. Biol. 2010;55:28072817. doi: 10.1088/0031-9155/55/10/002. [DOI] [PubMed] [Google Scholar]
- [10].Balci S, Sabuncu J, Yoo S, Ghosh S, Whitfield-Gabrieli J, Gabrieli P, Golland P. Prediction of successful memory encoding from fmri data. Med Image Comput Comput Assist Interv. 2008;11:97–104. doi: 10.1901/jaba.2008.2008-97. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [11].Martino FD, Valente G, Staeren N, Ashburner J, Goebel R, Formisano E. Combining multivariate voxel selection and support vector machines for mapping and classification of fmri spatial patterns. Neuroimage. 2008;43:44–58. doi: 10.1016/j.neuroimage.2008.06.037. [DOI] [PubMed] [Google Scholar]
- [12].Fan Y, Rao H, Giannetta J, Hurt H, Wang J, Davatzikos C, Shen D. Diagnosis of brain abnormality using both structural and functional mr images. Conf Proc IEEE Eng Med Biol Soc. 2006;8:1044–7. doi: 10.1109/IEMBS.2006.259260. [DOI] [PubMed] [Google Scholar]
- [13].Fan Y, Shen D, Gur R, Gur R, Davatzikos C. Compare: classification of morphological patterns using adaptive regional elements. IEEE Trans Med Imaging. 2007;26(no. 1):93–105. doi: 10.1109/TMI.2006.886812. [DOI] [PubMed] [Google Scholar]
- [14].Yamashita O, Sato M, Yoshioka T, Tong F, Kamitani Y. Sparse estimation automatically selects voxels relevant for the decoding of fmri activity patterns. Neuroimage. 2008;42:1414–1429. doi: 10.1016/j.neuroimage.2008.05.050. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [15].Hanson S, Halchenko Y. Brain reading using full brain support vector machines for object recognition: there is no face identification area. Neural Comput. 2008;20:486–503. doi: 10.1162/neco.2007.09-06-340. [DOI] [PubMed] [Google Scholar]
- [16].Pelaez-Coca M, Bossa M, Olmos S. Discrimination of ad and normal subjects from mri: anatomical versus statistical regions. Neurosci Lett. 2011;487:113–117. doi: 10.1016/j.neulet.2010.10.007. [DOI] [PubMed] [Google Scholar]
- [17].Pereira F, Mitchell T, Botvinick M. Machine learning classifiers and fmri: a tutorial overview. Neuroimage. 2009;45:S199S209. doi: 10.1016/j.neuroimage.2008.11.007. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [18].Brodersen K, Haiss F, Ong C, Jung F, Tittgemeyer M, Buhmann J, Weber B, Stephan K. Model-based feature construction for multivariate decoding. Neuroimage. 2011;56:601615. doi: 10.1016/j.neuroimage.2010.04.036. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [19].Kriegeskorte N, Goebel R, Bandettini P. Information-based functional brain mapping. Proc Natl Acad Sci USA. 2006;63:3863–3869. doi: 10.1073/pnas.0600244103. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [20].Zhang T, Davatzikos C. Odvba: Optimally-discriminative voxel-based analysis. IEEE Transactions on Medical Imaging. 2011;30:14411454. doi: 10.1109/TMI.2011.2114362. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [21].Meinshausen N, Buhlmann P. Stability selection. Journal of the Royal Statistical Society. 2010;72:417–473. [Google Scholar]
- [22].Baldassare L, Mourao-Miranda J, Pontil M. Structured sparsity models for brain decoding from fmri data. Worksho on Pattern Recognition in NeuroImaging (PRNI) 2012 [Google Scholar]
- [23].Guyon I, Weston J, Barnhill S, Vapnik V. Gene selection for cancer classification. Machine Learning. 2002;46:389422. [Google Scholar]
- [24].Fan Y, Shen D, Davatzikos C. Classification of structural images via high-dimensional image warping, robust feature extraction, and svm. Med Image Comput Assist Interv. 2005;8:1–8. doi: 10.1007/11566465_1. [DOI] [PubMed] [Google Scholar]
- [25].Funamizu A, Kanzaki R, Takahashi H. Distributed representation of tone frequency in highly decodable spatio-temporal activity in the auditory cortex. Unknown Journal. 2011;24:321–322. doi: 10.1016/j.neunet.2010.12.010. [DOI] [PubMed] [Google Scholar]
- [26].Calderoni S, Retico A, Biagi L, Tancredi R, Muratori F, Tosetti M. Female children with autism spectrum disorder: an insight from mass-univariate and pattern classification analyses. Neuroimage. 2012;59(no. 2):1013–22. doi: 10.1016/j.neuroimage.2011.08.070. [DOI] [PubMed] [Google Scholar]
- [27].Tibshirani R. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. 1996;58:267–288. [Google Scholar]
- [28].Zou H, Hastie T. Regularization and variable selection via the elastic net. Stat. Soc. 2005;67(no. 2):301320. [Google Scholar]
- [29].Rondina J, Shawe-Taylor J, Mourao-Miranda J. A new feature selection method based on stability theory - exploring parameters space to evaluate classification accuracy in neuroimaging data. LNAI Survey of the state of the art Machine Learning andInterpretation of Neuroimaging. 2012;7263:58–66. [Google Scholar]
- [30].Sjostrand K. Matlab implementation of lasso, lars, the elastic net and spca, informatics and mathematical modelling. Unknown Journal. 2005 [Google Scholar]
- [31].Haussler D, editor. A training algorithm for optimal margin classifiers. 1992.
- [32].Vapnik V. Statistical Learning Theory. Wiley; 1998. [DOI] [PubMed] [Google Scholar]
- [33].Chang C-C, Lin C-J. LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology. 2011;2(no. 3):1–27. software available at http://www.csie.ntu.edu.tw/cjlin/libsvm. [Google Scholar]
- [34].Lamparter D. Stability selection for error control in high-dimensional regression. 2011. Ph.D. dissertation. [Google Scholar]
- [35].Rasmussen P, Hansen L, Madsen K, Churchill N, Strother S. Model sparsity and brain pattern interpretation of classification models in neuroimaging. Pattern Recognition. 2012;45:2085–2100. [Google Scholar]
- [36].Hahn T, Marquand A, Ehlis A, Dresler T, Kittel-Schneider S, Jarczok T, Lesch K, Jakob P, Mourao-Miranda J, Brammer M, Fallgatter A. Integrating neurobiological markers of depression. Arch Gen Psychiatry. 2011;68(no. 4):361–368. doi: 10.1001/archgenpsychiatry.2010.178. [DOI] [PubMed] [Google Scholar]
- [37].Mouro-Miranda J, Reynaud E, McGlone F, Calverte G, Brammer M. The impact of temporal compression and space selection on svm analysis of single-subject and multi-subject fmri data. Neuroimage. 2006;33(no. 4):1055–1065. doi: 10.1016/j.neuroimage.2006.08.016. [DOI] [PubMed] [Google Scholar]
- [38].Fitzgerald P, Laird A, Maller J, Daskalakis Z. A meta-analytic study of changes in brain activation in depression. Hum Brain Mapp. 2008;29(no. 6):683–95. doi: 10.1002/hbm.20426. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [39].Koolschijn P, van Haren N, Lensvelt-Mulders G, Pol HH, Kahn R. Brain volume abnormalities in major depressive disorder: a meta-analysis of magnetic resonance imaging studies. Hum Brain Mapp. 2009;30(no. 11):3719–35. doi: 10.1002/hbm.20801. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [40].Chi-Hua C, Suckling J, Lennox B, Ooi C, Bullmore E. A quantitative meta-analysis of fmri studies in bipolar disorder. Bipolar Disorders. 2011;13(no. 1):1–15. doi: 10.1111/j.1399-5618.2011.00893.x. [DOI] [PubMed] [Google Scholar]
- [41].Pizzagalli D. Frontocingulate dysfunction in depression: toward biomarkers of treatment response. Neuropsychopharmacology. 2011;36(no. 1):183–206. doi: 10.1038/npp.2010.166. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [42].Du M, Wu Q, Yue Q, Li J, Liao Y, Kuang W, Huang X, Chan R, Mechelli A, Gong Q. Voxelwise meta-analysis of gray matter reduction in major depressive disorder. Prog Neuropsychopharmacol Biol Psychiatry. 2012;36(no. 1):11–16. doi: 10.1016/j.pnpbp.2011.09.014. [DOI] [PubMed] [Google Scholar]
- [43].Bora E, Fornito A, Pantelis C, Ycel M. Gray matter abnormalities in major depressive disorder: a meta-analysis of voxel based morphometry studies. J Affect Disord. 2012;138:9–18. doi: 10.1016/j.jad.2011.03.049. [DOI] [PubMed] [Google Scholar]
- [44].Diener C, Kuehner C, Brusniak W, Ubl B, Wessa M, Flor H. A meta-analysis of neurofunctional imaging studies of emotion and cognition in major depression. Neuroimage. 2012;61(no. 3):677–85. doi: 10.1016/j.neuroimage.2012.04.005. [DOI] [PubMed] [Google Scholar]
- [45].Groenewold N, Opmeer E, de Jonge P, Aleman A, S. C. S. Emotional valence modulates brain functional abnormalities in depression: evidence from a meta-analysis of fmri studies. Neurosci Biobehav Rev. 2013;37(no. 2):152–63. doi: 10.1016/j.neubiorev.2012.11.015. 2013. [DOI] [PubMed] [Google Scholar]







