Effective functional mapping of fMRI data with support‐vector machines

Sangkyun Lee; Sebastian Halder; Andrea Kübler; Niels Birbaumer; Ranganatha Sitaram

doi:10.1002/hbm.20955

. 2010 Jan 28;31(10):1502–1511. doi: 10.1002/hbm.20955

Effective functional mapping of fMRI data with support‐vector machines

Sangkyun Lee ^1,^2,^✉, Sebastian Halder ¹, Andrea Kübler ^1,³, Niels Birbaumer ^1,^4,⁵, Ranganatha Sitaram ¹

PMCID: PMC6871106 PMID: 20112242

Abstract

There is a growing interest in using support vector machines (SVMs) to classify and analyze fMRI signals, leading to a wide variety of applications ranging from brain state decoding to functional mapping of spatially and temporally distributed brain activations. Studies so far have generated functional maps using the vector of weight values generated by the SVM classification process, or alternatively by mapping the correlation coefficient between the fMRI signal at each voxel and the brain state determined by the SVM. However, these approaches are limited as they do not incorporate both the information involved in the SVM prediction of a brain state, namely, the BOLD activation at voxels and the degree of involvement of different voxels as indicated by their weight values. An important implication of the above point is that two different datasets of BOLD signals, presumably obtained from two different experiments, can potentially produce two identical hyperplanes irrespective of their differences in data distribution. Yet, the two sets of signal inputs could correspond to different functional maps. With this consideration, we propose a new method called Effect Mapping that is generated as a product of the weight vector and a newly computed vector of mutual information between BOLD activations at each voxel and the SVM output. By applying this method on neuroimaging data of overt motor execution in nine healthy volunteers, we demonstrate higher decoding accuracy indicating the greater efficacy of this method. Hum Brain Mapp, 2010. © 2010 Wiley‐Liss, Inc.

Keywords: fMRI, multivariate analysis, multivariate pattern analysis, support vector machine

INTRODUCTION

Pattern‐based methods use sophisticated machine learning techniques, such as multilayer neural networks and support vector machines to discriminate spatial, temporal, and spectral patterns in a system. Such methods have been successfully used in character recognition, speech recognition, and image recognition applications [Jain et al., 2000]. Rapid progress in the application of data mining and statistical techniques and the growth of computing power has enabled the efficient manipulation and handling of large amounts of neuroimaging data, acquired from high resolution brain scans of several time points, for multivariate pattern analysis [Haynes and Rees, 2006; Norman et al., 2006]. Multivariate pattern classification and analysis methods have been used with great success in several neuroimaging studies, including unconscious antecedents of free decisions [Soon et al., 2008], lie detection [Davatzikos et al., 2005], visual processing [Haxby et al., 2001; Kamitani and Tong, 2005], and emotion [Mourao‐Miranda et al., 2007].

In contradistinction to univariate analysis which evaluates each brain location separately although brain activity was measured from many thousands of locations simultaneously, multivariate analysis is based on the insight that multiple, spatially distributed regions act in consort during a task. Pattern analysis methods also provide an objective criterion for determining the importance of different brain regions in a given task by simply comparing the accuracy of decoding a task from signals extracted from individual brain regions or a group of brain regions [Haynes et al., 2007; Soon et al., 2008]. Pattern classification can be used not only to separate different task conditions or brain states, but also to test consistency of brain activation across tasks or sessions, and to track temporal transitions of brain states [Polyn et al., 2005].

Among different pattern classification methods, SVMs are one of the most widely used methods for fMRI signals [LaConte et al., 2005; Mourao‐Miranda et al., 2005, 2006, 2007]. Support vector machines (SVMs) are a set of supervised learning methods used for classification and regression. By considering input data as two sets of vectors in an M‐dimensional space, a linear SVM will construct a separating hyperplane in that space. A good separation is achieved by maximization of the margin, whose boundary is the distance to the separating hyperplane from the input vectors (i.e., support vectors) closest to it [Schölkopf and Smola, 2002; Schölkopf et al., 1999; Vapnik, 1995]. The criterion can be denoted as a quadratic optimization problem and the best solution can therefore be found by applying optimization theory. The advantage of SVMs in real‐world applications is their superior performance in classification accuracy with small sample sizes and high dimensional inputs.

Some applications of SVM to fMRI signals generated functional maps by displaying the weight value at each voxel [Mourao‐Miranda et al., 2005, 2006, 2007]. The studies maintained that the weight vector can identify the most discriminating voxels by multivariate analysis since the weight vector is the direction along which the input vectors from two conditions differ most. These studies considered only one (i.e., the weight vector) of the two factors that determining an SVM output, namely, the weight vector and the input vector. It needs to be highlighted that SVMs are trained on the input samples to minimize the classification‐error rates by computing the weight vector solely from support vectors that reside near the border of the hyperplane. It follows from this consideration that the weight vector so computed is not completely influenced by the statistical distribution of input vectors. As such, using the weight vector alone to generate functional maps is equivalent to presenting only a part of the information.

LaConte et al. used the pair‐wise correlation between the BOLD signal at each voxel and distance from the margin [LaConte et al., 2005]. This approach derives from the consideration that the distance from the separating hyperplane is related to ease of discrimination, and based on the intuition that the sample closest to the hyperplane is most difficult to classify. Although the distance is related to discrimination, it does not follow that the farthest sample is the most important for discrimination as it may contain very little information about data distributions of the tasks (see Theory for more detail). Additionally, this approach does not exploit the advantage of multivariate analysis of the SVM due to its univariate measure.

In this study, we propose a new functional‐mapping method to identify the voxels more closely related to the actual importance in classification. The method incorporates information from both the weight vector and the input vectors that together determine the SVM output. Toward this end, we first derive the formula for a new quantity called the effect value (EV). Effect value for a single voxel is defined as the statistical relation (mutual information; see Theory for detail) between the voxel and the SVM output, multiplied by the corresponding weight value of the SVM. Subsequently, we compare the proposed method with the functional maps generated by the previous methods. The comparison is done quantitatively by evaluating classification performance from the voxels identified as informative in competing functional mapping approaches generated based on data acquired during overt motor execution. We chose fMRI data from the overt motor execution task for this investigation, as this task is easier to execute consistently across runs and among all healthy volunteers, so that differences in classification accuracy is more readily attributable to differences in the performance of the classifiers per se than due to artifactual effects of unreliable data.

THEORY

First, we summarize the basic concept of SVM and the procedure for applying SVM to fMRI data. Next, based on considerations of the theoretical basis of SVM and the limitations of the conventional interpretation of SVM, we propose the new method.

Support Vector Machine (SVM)

In a typical SVM analysis of fMRI signals, BOLD values from all brain voxels of each repetition time (TR) are contained in the M‐dimensional (M: number of all the brain voxels) input vector x ⁱ (notation: X ^{index of sample}; bold font indicates a column vector, x _{index of voxel} ^{index of sample}). SVM determines a scalar class label L ⁱ from x ⁱ as follows:

(1)

where the weight vector w and the constant value b, which are estimated by a SVM training algorithm from the training dataset, define a linear decision boundary, T is transpose of a vector, sgn(.) is a sign function, sgn(x) = +1, 0, −1 if x > 0, x = 0, x < 0, respectively.

When the input vectors x ⁱ and the design labels L $_{D}^{i}$ (if the input vector comes from a condition of interest, then L $_{D}^{i}$ = 1; on the other hand if the input vector comes from a rest condition or a control condition, then L $_{D}^{i}$ = −1) are taken from the training dataset, the linear SVM algorithm attempts to find a separating hyperplane y = w ^T x + b = 0 in the feature space. The weight vector w of a linear SVM is obtained by minimizing objective function of Eq. (2) with constraints Eqs. (3) and (4),

(2)

(3)

(4)

where the slack variable ξⁱ is introduced to describe a nonseparable case (i.e., data that cannot be separated without classification error), C denotes the weighting on the slack variable (i.e., the extent to which misclassification is allowed). The minimization of Eq. (2) is originated from concept of the maximization of the margin (length of the margin = 2/‖w‖²), whose boundaries are defined as y = w ^T x + b = ±1 built from support vectors (i.e., SV = {x ⁱ|y ⁱ = w ^T x ⁱ + b = ±1}) in each class.

The main objective function Eq. (2) and constraint terms Eqs. (3) and (4) can be combined into one nonconstraint form by the introduction of a Lagrange multiplier. From the formula, the weighted vector w can be obtained:

(5)

Here, αⁱ is the Lagrange multiplier and its value determines whether the input vector x ⁱ is a support vector or not. When αⁱ is nonzero, the corresponding input vector is the support vector.

Interpretation of SVM Results

In fMRI studies using SVM, an intuitive way to analyze the results of SVM training might be to overlay the weight vector onto brain images [Mourao‐Miranda et al., 2005]. Although this method describes which weight value produces the larger effects, the method is limited due to the fact that the output y is not estimated from the weight vector alone, but also from the input vector containing the BOLD values at each voxel. In addition, this method has disadvantages as different datasets pertaining to different tasks may generate identical hyperplanes (see Fig. 1A).

Illustrations of characteristics of an SVM. (A) Two different datasets (red and blue) having the same separating hyperplane (y = w ^T x + b = 0). Even though the two datasets have different distributions (a red line: distribution of dataset 1, a blue broken line: distribution of dataset 2), SVM trained with each dataset separately can determine the same separating hyperplane (i.e., the same weight vector). (B) Distribution of the SVM outputs. Distance of an input vector from the separating hyperplane is proportional to the SVM output, greater distance indicating greater separation between the conditions. In condition 1, SVM output y ¹ is closer to mean or the center of distribution (a red broken line) than SVM output y ². Therefore, it is likely that there are more samples close to x ¹ than to x ². However, in terms of data distribution, x ¹ is more important than x ² even though x ¹ is closer to the separating hyperplane than x ². (C) Illustration of the determination of the SVM output with the weight vector and the input vectors. In (C), for the trained SVM model y = 10x ₁ + 2x ₂ − 7x ₃ − x ₄, there are three input vectors, particularly nonsupport vectors, classified to the same class (+1). When the importance of each element of the input vectors is considered based on the magnitude of the corresponding weight value, the 1st (10) and 3rd (−7) elements are the two most important ones [Mourao‐Miranda et al., 2005]. If the brain state of one class (+1 or −1) can be represented as one spatial pattern, importance of each element of the input vectors could be considered based on consistency of the elements. The values (2.0, 2.1, and 1.9) of the 1st element (x ₁) are consistent across the input vectors, but the values (3.0, 2.0, and 1.0) of the 3rd element (x ₃) do not show any consistency even if the SVM output are consistent, and the corresponding weight value is large. This shows that the importance of an element of the input vectors is not simply proportional to the magnitude of the corresponding weight value alone. Rather, the effective importance of an element depends on both the weight value and the sample data distribution considered together. [Color figure can be viewed in the online issue, which is available at wileyonlinelibrary.com.]

LaConte et al. used feature space weighting (FSW) to generate the functional maps from the SVM results [LaConte et al., 2005]. This approach used a distance measure from the separating hyperplane to the estimated output of SVM, leading to a weighted average contrast function (i.e., the contrast value at each time weighted by distance measure from the margin). Although they reported a relationship between the distance and discriminability, theoretically, the distance of the input vectors from the separating hyperplane is not completely representative of the given task (see Fig. 1B). In addition, generation of the functional maps by using pixel‐wise correlation between single voxel activation and the contrast function might not accurately represent the distinction between the classified brain states because it does not take into account the multivariate contribution of the weight vector to the SVM output.

Typically, SVM maximizes the distance between two hyperplanes composed from support vectors of each class without considering the data distribution of all the input vectors. As seen in Eq. (5), the weight vector is a weighted sum of only the support vectors and does not incorporate any information from the nonsupport vectors. In addition, when the input vectors of one condition (+1 or −1) are considered, one single activation pattern would represent the class. That is, many repetitions of one task lead to similar activation patterns in the brain regions associated with the task (i.e., small variance in elements of a class or higher signal‐to‐noise ratio (SNR)). In this view, important or essential elements of the input vectors from a class would be expected to show higher consistency or lower variance of activations across several samples. However, Figure 1C illustrates that the elements of the input vectors corresponding to the higher weight values do not necessarily repeat with higher consistency. In Figure 1C, three different input vectors result in similar SVM output with the same weight vector. This shows that consideration of only a single component, i.e., the weight vector, is not enough to determine the effect of individual elements of the input vector on the SVM output. This argument calls for the combined application of the weight vector and the input vector in obtaining legitimate functional activations.

Effect Mapping (EM)

Effect mapping considers both the effect of the voxel activation to the output of a classifier, and the weight vector of the estimated SVM model. EM measures the effect of each voxel to the classifier output by computing mutual information (MI) between the voxel and the output. MI is defined as the amount of information that one random variable contains about another random variable [Cover and Thomas, 1991]. That is, when two random variables X and Y occur with a joint probability mass function p(x,y) and marginal probability function p(x) and p(y), the entropies of the two random variables and the joint probability are given respectively by:

MI, I(X;Y), is the relative entropy between the joint distribution and the product distribution, i.e.

(6)

Because of the variance of mutual information based on entropies H(X) and H(Y), normalized mutual information is defined as [Maes et al., 1997]:

(7)

As shown in Figure 2, Inline graphic (x _i; y) can take into consideration consistency of x _is based on data distribution of y on the assumption that the probability density of the SVM outputs decrease with increase of y. Distribution of the SVM outputs, y, in our analysis also reflects this assumption well. In addition, Inline graphic (x _i; y) reflects the nonlinear dependencies between distribution of x _i and y better than a simple correlation method.

Illustration of mutual information in consistent (i.e., most of x _i s are mapped into y = 1; a red broken line of circle of Figure A), and inconsistent (i.e., mapping of x _i s are inconsistent without showing a dominant probability in a point; a blue broken line of ellipse of Figure B) mappings. A probability of each mapping point is denoted as P _a,b = p(x _i = *a, y* = b). In all the figures, data distribution is shown in one condition (y > 0) with the assumption that all the input vectors are correctly classified, and p(y) decreases with increase of y (i.e., p(y = 1) > p(y = 2)) due to the fact that a SVM model in such high dimensional data as fMRI data has many support vectors, and many of SVM outputs reside in a region close to hyperplanes, y = w ^T x + b = ±1, composed of support vectors. Figures A and B show the normalized mutual information between x _i and y. With regard to mapping of x _is into the SVM outputs y, A shows higher probability in one point reflecting higher consistency than B, and higher mutual information than B. [Color figure can be viewed in the online issue, which is available at wileyonlinelibrary.com.]

Hence, the effect value (EV) E _k of a voxel k, designed to take into consideration the above points, is defined as:

(8)

where y is the SVM output after excluding the sign function, w _k and x _k are weight value and activation in voxel k, respectively.

MATERIALS AND METHODS

Participants and Experimental Protocol

We analyzed fMRI data from nine right‐handed healthy college students (age: 26.4 ± 5.2). None of the participants had any history of neurological or psychiatric disorders. The study was approved by the ethics committee of the Faculty of Medicine of the University of Tuebingen. Stimuli were presented in a block design. There were two active conditions (left‐hand (LH) and right‐hand (RH) movements) and a resting condition. During active conditions, each participant was instructed to move his/her palm and fingers freely. Participants were asked to restrict movement above the wrist, for example, in the elbows and shoulders. Each active and inactive condition (rest state) lasted 30 s (15 scans). In our analysis, 12 active condition blocks (6 LHs +6 RHs) from 2 runs of fMRI measurement were used to train SVM and make comparisons.

Data Acquisition

Functional images were acquired on a 3.0 T whole body scanner, with a standard 12‐channel head coil (Siemens Magnetom Trio Tim, Siemens, Erlangen, Germany). A standard echo‐planar imaging sequence was used (EPI; TR = 2 s, TE = 30 ms, flip angle α = 78°, bandwidth = 2.232 kHz/pixel). Thirty‐two slices (voxel size = 3 × 3 × 3.75 mm³, slice gap = 1 mm), AC/PC aligned in axial orientation were acquired.

Preprocessing and Classification

Preprocessing was performed with SPM5 (Wellcome Department of Imaging Neuroscience, London) and classification was performed using MATLAB (The Mathworks, Natick, MA) scripts.

We performed realignment of functional images, coregistration between functional images and structure image, and normalization of functional and structure images onto the Montreal Neurological Institute (MNI) space. After selecting nonbrain areas with a brain mask (i.e., a brain mask file “mask.img”) generated with SPM5, z‐normalizations (z‐value: (x‐mean(x))/standard deviation(x), x: samples) were applied across all the time‐series of each voxel on each run of each participant data separately to correct for the variance of BOLD signals of different runs and different participants. Whole brain images from each TR were used to generate input vectors to the SVM classifier, and individual scans were classified.

The freely available SVM software SVMlight [Joachims, 1999] was used to implement the classifier. Linear SVMs were trained with a fixed regularization parameter C = 10⁵ (i.e., hard margin SVM) to remove variability of classification performance dependent on the regularization parameter C. In the classification procedure, the LH and RH were given 1 and −1 as design labels for the SVMs. Nine‐fold cross validation (CV) [Hastie et al., 2001] was applied in group data of nine participants. In each fold, the data of eight participants were used to train an SVM classifier, and then the data of one remaining participant were used to test the classifier.

Computation of Probability

In the computation of mutual information, joint distribution and marginal distributions were performed with two‐dimensional joint histogram and one‐dimensional histogram (http://www.cs.rug.nl/~rudy/matlab/). In our analysis, numbers of bins of the histogram were defined as N ^1/3 (N: number of samples; (8 × 180)^1/3 = 11 bins).

Comparison of Functional Maps

To prevent the double dipping problem [Kriegeskorte et al., 2009], comparisons were made with nine‐fold CV from group data of nine participants. In each fold, the data of eight participants were used to train an SVM, and to compute functional maps (i.e., F(SW)‐, I‐, W(eight)‐, E(ffect)‐maps) (see Fig. 3). To compare the capability of functional maps for detecting important spatial patterns of brain activity, the top x% of brain voxels (e.g., 5%) having the highest functional values in magnitude was selected from the functional maps. Then, the selected voxels in the data of eight participants were used to train a linear SVM with C = 10⁵, and the same voxels in the data of one remaining participant were used to classify.

Functional maps from a group analysis of non‐smoothed data. In functional maps (F(SW)‐, W(eight)‐, and E(ffect)‐maps) computed from the SVM outputs, clusters with positive value (red) correspond to the positive SVM weight values, while clusters with negative values (blue) correspond to the negative SVM weight values. In B, E‐map is drawn after rescaling the EVs of Eq. (8) for the purpose of display (to make extreme values to be smaller) with the following formula: lE _k = sgn(E _k) log(1 + |E _k|/std (|E|)) where std(|E|) is the standard deviation of all E _k. These maps show six horizontal slices every 12 mm obtained from the whole brain in MNI space. The functional maps are drawn by selecting 5% of voxels (for the purpose of display) having the highest values in magnitude. (A) F‐map. (B) I‐, W‐, and E‐ maps. The I‐map shows a distributed pattern of mutual information between the input vector and the SVM output (without the sign function; y = w ^T x + b). In principle, mutual information is zero or positive, but for the purpose of comparison with other methods, the values are multiplied by the sign of the SVM weight values. [Color figure can be viewed in the online issue, which is available at wileyonlinelibrary.com.]

Additionally, when comparing between two functional maps generated by two competing methods, it is possible that the decoding accuracies of the two maps are similar due to common voxels selected by these methods, yet the two methods might indicate different sets of voxels as most informative. This consideration is critical if differences of activation patterns from two functional maps are obtained in different functional areas or on the border of two functional areas. As an example, let us assume that two patterns of voxels in the visual cortex, encompassing brain areas such as V1, V2, V3, and V4, are generated by two different functional mapping methods. It is conceivable that the two methods provide similar decoding accuracies, but one method (e.g., E‐map) identifies voxels from the V1 and V3 areas as most informative, while the other (e.g., W‐map) identifies brain areas from the V1 and V4 areas as most informative. Given this, the open question is: which functional area between the V3 and V4 areas is more involved in the given task, acknowledging that V1 is the common area of activation. To be able to answer this question in our study, overlapping and nonoverlapping activations from two competing maps were considered separately in the voxel selection process. One SVM classifier for overlapping areas of the two competing functional maps, and two separate SVM classifiers for two nonoverlapping areas of the two competing functional maps were trained from the BOLD activations from the voxels selected. Then, the trained SVMs were tested to classify the data of one remaining participant. For instance, in comparing between nonoverlapping areas of E‐ and W‐maps, voxels only generated by E‐map but not by W‐map were used to train an SVM and classify the test dataset after selecting the top x% of brain voxels from both of E‐ and W‐maps, and vice versa. This method of comparison gives insight into how well one functional map can identify discriminating voxels which the other functional map does not. In addition, if voxels selected from the two different maps show similar decoding accuracy, and decoding accuracies from commonly selected voxels are also comparable to the decoding accuracy from all the voxels selected from each method, it is likely that the classification performance is more dependent on the overlapping voxels than on the nonoverlapping voxels. To evaluate the performance of functional maps more clearly, therefore, it is useful to test the decoding accuracy from nonoverlapping (i.e., exclusively selected voxels) areas. In addition, information such as the amount of overlap and how much decoding accuracy from the overlapping voxels is close to decoding accuracy from all the selected voxels provides greater insight for the purpose of comparison.

Clustering Functional Maps

To compare functional maps in terms of the degree of spread of local patterns or the degree of focus, areas occupied by different clusters was calculated in each map. The top x% of voxels (e.g., 5%), hereafter called first‐level threshold, having the highest functional values (i.e., FSW value in the F‐map, I value in I‐map, weight value in W‐map, and EV in E‐map) in magnitude were selected. The selected voxels were clustered by the surface connection method [Thurfjell et al., 1992]. Then, the ratio of the area of clusters consisting of at least “x” voxels (where x = 10, 30, 50, and 70), hereafter called second‐level threshold, to total area of the selected voxels was computed. Increasing the minimum size of clusters to be selected indicates that the selected clusters get larger and area occupied by all the selected clusters remains the same or becomes smaller.

RESULTS

Classification

We first evaluated whether SVM could consistently classify the fMRI group data between left‐ and right‐hand movement conditions. Our analysis indicated that SVM could discriminate between LH and RH conditions with an average accuracy of over 90% (95.8% ± 1.2%).

Comparison of Functional Maps

Characteristics of activation clusters of each functional map are presented by showing ratios of the area occupied by selected clusters after second‐level thresholding to total area of voxels selected with first‐level threshold (see Fig. 4). The selection of voxels of Figure 4 was performed with the same method as Figure 3, but with more number of thresholds, i.e., 10, 5, and 2.5%. The graphs of Figure 4B show the characteristics of clusters in Figure 3. The F‐map illustrates that larger clusters occupy most of the selected area (see Fig. 4). In contrast, the W‐map shows more wide‐spread clusters (see Figs. 3 and 4) with higher ratios of the area taken by small clusters. The E‐map displays an intermediate degree of spread between the I‐map and the W‐map. However, characteristics of cluster area from the E‐map obtained after decreasing the 1st‐level threshold (i.e., from 10 to 5% and 2.5%) resembles those from F‐/I‐maps.

Ratio of area of clusters remaining after applying the second‐level threshold to total area of voxels remaining after applying the first‐level threshold. The first‐level thresholds 10, 5, and 2.5%, are used in A–C, respectively. The second‐level thresholds {10, 30, 50, 70} are used in all the figures. [Color figure can be viewed in the online issue, which is available at wileyonlinelibrary.com.]

Figure 5 shows comparison of decoding performance in the selected voxels by four functional maps (F‐/W‐/I‐/E‐maps). The decoding accuracy of E‐map is similar to or slightly higher in the numbers of selected voxels (10, 5, 2.5, and 1%) than the others.

Comparison of cross‐validation results (i.e., classification accuracies (mean classification accuracy rate (%) ± standard error of the mean (%))) from voxels selected by the functional maps. All figures are drawn with nine‐fold CV on non‐smoothed data. After selecting top 10, 5, 2.5, and 1% brain voxels in magnitude in each functional map, the selected voxels were used to evaluate classification accuracy, respectively. [Color figure can be viewed in the online issue, which is available at wileyonlinelibrary.com.]

In Figure 6, classification accuracies of overlapping, and nonoverlapping, exclusively selected voxels are shown for the top x% (10, 5, 2.5, and 1%) of brain voxels in each type of functional map. These comparisons were performed on nonsmoothed data, on two types of functional maps at a time. Commonly selected voxels from the two functional maps provide higher classification accuracy in all the thresholds than exclusively selected voxels. Also, the decoding accuracies from the overlapping area are almost same as those of all the selected voxels from each functional map even when ratio of the overlapping area is not much high (i.e., below 70%) (see Figs. 5 and 6). Exclusively selected voxels from E‐ map provides higher classification accuracy in all the thresholds than F‐/I‐maps (see Fig. 6A,C). In comparison between E‐ and W‐ maps, E‐map provides higher prediction accuracies in thresholds 10, 5, and 2.5%, while W‐map shows better performance in 1% of brain voxels (see Fig. 6E).

Comparison of cross‐validation results (i.e., classification accuracies; mean classification accuracy rate (%) ± standard error of the mean (%)) for overlapping and non‐overlapping areas of two different types of functional map. All figures are drawn with nine‐fold CV on nonsmoothed data. In A, C, and E, after selecting top 10, 5, 2.5, and 1% brain voxels in magnitude from two functional maps, overlapping and nonoverlapping area (based on the concept of “relative complement”) were used to evaluate classification accuracy, respectively. In B, D, and F, after selecting top 10, 5, 2.5, and 1% brain voxels in magnitude from two functional maps, ratios of area taken by the voxels commonly selected from the two functional maps to the total area taken by all the selected voxels were shown. In each figure, legends from one functional map (i.e., E, F, I, and W), and legends from two functional maps (i.e., E and F, E and I, and E and W) indicate the results from the exclusive selected voxels from one functional map and commonly selected voxels from two functional maps, respectively. (A) Performance comparison between E‐ and F‐maps. (B) Ratio of overlapping area of E‐ and F‐maps. (C) Performance comparison between E‐ and I‐maps. (D) Ratio of overlapping area of E‐ and I‐maps. (E) Performance comparison between E‐ and W‐maps. (F) Ratio of overlapping area of E‐ and W‐maps. [Color figure can be viewed in the online issue, which is available at wileyonlinelibrary.com.]

DISCUSSION

The present study demonstrates that effect‐mapping could be an alternative method in the multivariate analysis of fMRI data by considering both discriminability and data distribution. As shown in Figures 3 and 4, the E‐map shows intermediate level of sparseness or focus in comparison to the mutual information method, and the weight‐vector method. This could be explained by the process of derivation of EV, and thus effect mapping can merge the effect of each voxel on the SVM output in a multivariate way as a result of multiplying the SVM weight vector with the mutual information. In addition, E‐mapping combines the univariate (I‐map) and previous multivariate approaches [W‐map; Mourao‐Miranda et al., 2005, 2006, 2007] as a hybrid of the two methods. It can also be shown that decrease of the weight value at each voxel based on the normalized MI value in effect‐value is regarded as a sort of constraint method of multivariate analysis, and the multivariate characteristics, therefore, remain in the Effect‐map. We showed that E‐mapping can identify the voxels more closely related to classification than the other methods previously used as it includes both the contributing factors that determine the output of SVM: the differing importance of spatially distributed brain voxels as represented by the weight vector, and the statistical distribution of the brain activations in the voxels as represented by the mutual information between each voxel activation and the SVM output. Figure 1A indicated that the SVM weight vector is a linear combination of the support vectors and does not represent the statistical distribution of the input patterns, and thus SVMs trained from different datasets potentially may have the same weight vector. Although the separating hyperplane is optimal in terms of generalization performance, it is a discriminative function in the dimension of the given input vectors. In Figure 1B, it was shown that greater distance of the sample from the separating hyperplane need not necessarily indicate more importance in data distribution in all the cases. The difference between the I‐map and W‐map depicted in Figure 3B reiterates the previous theoretical illustration of Figure 1C.

With these points of view, we have evaluated the efficacy of different methods of multivariate functional mapping: weight vector mapping [Mourao‐Miranda et al., 2005, 2006, 2007], feature‐space‐weighting [LaConte et al., 2005], mutual information mapping, and the proposed effect mapping, by comparing the classification accuracy of the brain states based on each method for increasing levels of threshold of voxel selection. The comparison was performed in two ways: (1) by comparing the performance from all the voxels selected by each method (see Fig. 5), (2) by comparing the performance from overlapping and nonoverlapping voxels between the methods (see Fig. 6). The first approach showed a small but consistent improvement in prediction by the effect mapping method over the other two methods. However, the second approach gives more insight into the workings of the three methods. It shows that the overlapping voxels or common voxels selected by the two different methods provides most of the information necessary for prediction of the brain states indicated by the highest classification accuracy (around 96%). However, since the maps identify the nonoverlapping areas as well as the overlapping areas as informative, it is still necessary to evaluate the areas exclusively identified as informative from the two different methods. Voxels exclusively identified as informative by the FSW or the weight vector method did not perform as well in pattern classification as the voxels exclusively identified as informative voxels by the EM method. Additionally, according to Op de Beeck et al. [ 2008], “The word “map” is generally used to refer to a gradient of selectivities along the cortical sheet. By contrast, “module”—in the context of brain function—refers to the clustering of selectivities in discrete regions, with clear selectivity discontinuities at the boundaries of these regions.” At the current state of art, since it is not clear which brain functions are maps and which others are modules, it is premature to select a fixed number of voxels for the purpose of optimal thresholding to provide the highest decoding accuracy. When one observes the classification accuracies at different threshold levels (see Figs. 5 and 6), it is apparent that EM produces consistent prediction accuracies at most thresholds in comparison to other mapping methods. This is a significant advantage for both brain‐state decoding and functional mapping applications, as one need not conduct a comprehensive search for the best threshold.

In conclusion, our theoretical explications and empirical analysis indicate that the new technique of effect mapping could enhance the identification of brain activation patterns in various perceptive and cognitive tasks. Although our proposed method partially overcomes the limitations in the previous methods, there are still disadvantages with computing the mutual information in a univariate way. In addition, the influence and the relationship between overlapping areas and nonoverlapping areas could be considered to understand the SVM output and obtain more accurate functional maps in neuroimaging.

Acknowledgements

Author S. L. is grateful to DAAD (German Academic Exchange Service) for supporting this research.

REFERENCES

Cover TM,Thomas JA ( 1991): Elements of Information Theory. New York: Wiley; 542 p. [Google Scholar]
Davatzikos C,Ruparel K,Fan Y,Shen DG,Acharyya M,Loughead JW,Gur RC,Langleben DD ( 2005): Classifying spatial patterns of brain activity with machine learning methods: Application to lie detection. Neuroimage 28: 663–668. [DOI] [PubMed] [Google Scholar]
Hastie T,Tibshirani R,Friedman JH ( 2001): The Elements of Statistical Learning: Data Mining, Inference, and Prediction: With 200 Full‐Color Illustrations. New York: Springer; 533 p. [Google Scholar]
Haxby JV,Gobbini MI,Furey ML,Ishai A,Schouten JL,Pietrini P ( 2001): Distributed and overlapping representations of faces and objects in ventral temporal cortex. Science 293: 2425–2430. [DOI] [PubMed] [Google Scholar]
Haynes JD,Rees G ( 2006): Decoding mental states from brain activity in humans. Nat Rev Neurosci 7: 523–534. [DOI] [PubMed] [Google Scholar]
Haynes JD,Sakai K,Rees G,Gilbert S,Frith C,Passingham RE ( 2007): Reading hidden intentions in the human brain. Curr Biol 17: 323–328. [DOI] [PubMed] [Google Scholar]
Jain K,Duin APW,JianchangMao R ( 2000): Statistical pattern recognition: A review. IEEE Trans Pattern Anal 22: 4–37. [Google Scholar]
Joachims T ( 1999): Making large‐scale SVM learning practical In: Schölkopf B, Burges C, Smola A, editors. Advances in Kernel Methods ‐ Support Vector Learning: MIT‐Press. [Google Scholar]
Kamitani Y,Tong F ( 2005): Decoding the visual and subjective contents of the human brain. Nat Neurosci 8: 679–685. [DOI] [PMC free article] [PubMed] [Google Scholar]
Kriegeskorte N,Simmons WK,Bellgowan PS,Baker CI ( 2009): Circular analysis in systems neuroscience: The dangers of double dipping. Nat Neurosci 12: 535–540. [DOI] [PMC free article] [PubMed] [Google Scholar]
LaConte S,Strother S,Cherkassky V,Anderson J,Hu X ( 2005): Support vector machines for temporal classification of block design fMRI data. Neuroimage 26: 317–329. [DOI] [PubMed] [Google Scholar]
Maes F,Collignon A,Vandermeulen D,Marchal G,Suetens P ( 1997): Multimodality image registration by maximization of mutual information. IEEE Trans Med Imaging 16: 187–198. [DOI] [PubMed] [Google Scholar]
Mourao‐Miranda J,Bokde AL,Born C,Hampel H,Stetter M ( 2005): Classifying brain states and determining the discriminating activation patterns: Support vector machine on functional MRI data. Neuroimage 28: 980–995. [DOI] [PubMed] [Google Scholar]
Mourao‐Miranda J,Reynaud E,McGlone F,Calvert G,Brammer M ( 2006): The impact of temporal compression and space selection on SVM analysis of single‐subject and multi‐subject fMRI data. Neuroimage 33: 1055–1065. [DOI] [PubMed] [Google Scholar]
Mourao‐Miranda J,Friston KJ,Brammer M ( 2007): Dynamic discrimination analysis: A spatial‐temporal SVM. Neuroimage 36: 88–99. [DOI] [PubMed] [Google Scholar]
Norman KA,Polyn SM,Detre GJ,Haxby JV ( 2006): Beyond mind‐reading: Multi‐voxel pattern analysis of fMRI data. Trends Cogn Sci 10: 424–430. [DOI] [PubMed] [Google Scholar]
Op de Beeck HP,Haushofer J,Kanwisher NG ( 2008): Interpreting fMRI data: Maps, modules and dimensions. Nat Rev Neurosci 9: 123–135. [DOI] [PMC free article] [PubMed] [Google Scholar]
Polyn SM,Natu VS,Cohen JD,Norman KA ( 2005): Category‐specific cortical activity precedes retrieval during memory search. Science 310: 1963–1966. [DOI] [PubMed] [Google Scholar]
Schölkopf B,Smola AJ ( 2002): Learning With Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. Cambridge, Mass.: MIT Press; 626 p. [Google Scholar]
Schölkopf B,Burges CJC,Smola AJ ( 1999): Advances in Kernel Methods: Support Vector Learning. Cambridge, Mass.: MIT Press; 376 p. [Google Scholar]
Soon CS,Brass M,Heinze HJ,Haynes JD ( 2008): Unconscious determinants of free decisions in the human brain. Nat Neurosci 11: 543–545. [DOI] [PubMed] [Google Scholar]
Thurfjell L,Bengtsson E,Nordin B ( 1992): A new three‐dimensional connected components labeling algorithm with simultaneous object feature extraction capability. CVGIP: Graph Models Image Process 54: 357–364. [Google Scholar]
Vapnik VN ( 1995): The Nature of Statistical Learning Theory. New York: Springer; 188 p. [Google Scholar]

[bib1] Cover TM,Thomas JA ( 1991): Elements of Information Theory. New York: Wiley; 542 p. [Google Scholar]

[bib2] Davatzikos C,Ruparel K,Fan Y,Shen DG,Acharyya M,Loughead JW,Gur RC,Langleben DD ( 2005): Classifying spatial patterns of brain activity with machine learning methods: Application to lie detection. Neuroimage 28: 663–668. [DOI] [PubMed] [Google Scholar]

[bib3] Hastie T,Tibshirani R,Friedman JH ( 2001): The Elements of Statistical Learning: Data Mining, Inference, and Prediction: With 200 Full‐Color Illustrations. New York: Springer; 533 p. [Google Scholar]

[bib4] Haxby JV,Gobbini MI,Furey ML,Ishai A,Schouten JL,Pietrini P ( 2001): Distributed and overlapping representations of faces and objects in ventral temporal cortex. Science 293: 2425–2430. [DOI] [PubMed] [Google Scholar]

[bib5] Haynes JD,Rees G ( 2006): Decoding mental states from brain activity in humans. Nat Rev Neurosci 7: 523–534. [DOI] [PubMed] [Google Scholar]

[bib6] Haynes JD,Sakai K,Rees G,Gilbert S,Frith C,Passingham RE ( 2007): Reading hidden intentions in the human brain. Curr Biol 17: 323–328. [DOI] [PubMed] [Google Scholar]

[bib7] Jain K,Duin APW,JianchangMao R ( 2000): Statistical pattern recognition: A review. IEEE Trans Pattern Anal 22: 4–37. [Google Scholar]

[bib8] Joachims T ( 1999): Making large‐scale SVM learning practical In: Schölkopf B, Burges C, Smola A, editors. Advances in Kernel Methods ‐ Support Vector Learning: MIT‐Press. [Google Scholar]

[bib9] Kamitani Y,Tong F ( 2005): Decoding the visual and subjective contents of the human brain. Nat Neurosci 8: 679–685. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib10] Kriegeskorte N,Simmons WK,Bellgowan PS,Baker CI ( 2009): Circular analysis in systems neuroscience: The dangers of double dipping. Nat Neurosci 12: 535–540. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib11] LaConte S,Strother S,Cherkassky V,Anderson J,Hu X ( 2005): Support vector machines for temporal classification of block design fMRI data. Neuroimage 26: 317–329. [DOI] [PubMed] [Google Scholar]

[bib12] Maes F,Collignon A,Vandermeulen D,Marchal G,Suetens P ( 1997): Multimodality image registration by maximization of mutual information. IEEE Trans Med Imaging 16: 187–198. [DOI] [PubMed] [Google Scholar]

[bib13] Mourao‐Miranda J,Bokde AL,Born C,Hampel H,Stetter M ( 2005): Classifying brain states and determining the discriminating activation patterns: Support vector machine on functional MRI data. Neuroimage 28: 980–995. [DOI] [PubMed] [Google Scholar]

[bib14] Mourao‐Miranda J,Reynaud E,McGlone F,Calvert G,Brammer M ( 2006): The impact of temporal compression and space selection on SVM analysis of single‐subject and multi‐subject fMRI data. Neuroimage 33: 1055–1065. [DOI] [PubMed] [Google Scholar]

[bib15] Mourao‐Miranda J,Friston KJ,Brammer M ( 2007): Dynamic discrimination analysis: A spatial‐temporal SVM. Neuroimage 36: 88–99. [DOI] [PubMed] [Google Scholar]

[bib16] Norman KA,Polyn SM,Detre GJ,Haxby JV ( 2006): Beyond mind‐reading: Multi‐voxel pattern analysis of fMRI data. Trends Cogn Sci 10: 424–430. [DOI] [PubMed] [Google Scholar]

[bib17] Op de Beeck HP,Haushofer J,Kanwisher NG ( 2008): Interpreting fMRI data: Maps, modules and dimensions. Nat Rev Neurosci 9: 123–135. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib18] Polyn SM,Natu VS,Cohen JD,Norman KA ( 2005): Category‐specific cortical activity precedes retrieval during memory search. Science 310: 1963–1966. [DOI] [PubMed] [Google Scholar]

[bib19] Schölkopf B,Smola AJ ( 2002): Learning With Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. Cambridge, Mass.: MIT Press; 626 p. [Google Scholar]

[bib20] Schölkopf B,Burges CJC,Smola AJ ( 1999): Advances in Kernel Methods: Support Vector Learning. Cambridge, Mass.: MIT Press; 376 p. [Google Scholar]

[bib21] Soon CS,Brass M,Heinze HJ,Haynes JD ( 2008): Unconscious determinants of free decisions in the human brain. Nat Neurosci 11: 543–545. [DOI] [PubMed] [Google Scholar]

[bib22] Thurfjell L,Bengtsson E,Nordin B ( 1992): A new three‐dimensional connected components labeling algorithm with simultaneous object feature extraction capability. CVGIP: Graph Models Image Process 54: 357–364. [Google Scholar]

[bib23] Vapnik VN ( 1995): The Nature of Statistical Learning Theory. New York: Springer; 188 p. [Google Scholar]

PERMALINK

Effective functional mapping of fMRI data with support‐vector machines

Sangkyun Lee

Sebastian Halder

Andrea Kübler

Niels Birbaumer

Ranganatha Sitaram

Abstract

INTRODUCTION