Skip to main content
International Journal of Biomedical Imaging logoLink to International Journal of Biomedical Imaging
. 2011 Jun 23;2011:350838. doi: 10.1155/2011/350838

Multiclass Sparse Bayesian Regression for fMRI-Based Prediction

Vincent Michel 1, 2, 3,2,3, Evelyn Eger 3, 4,4, Christine Keribin 2, 5,5, Bertrand Thirion 1, 3,3,*
PMCID: PMC3132985  PMID: 21754916

Abstract

Inverse inference has recently become a popular approach for analyzing neuroimaging data, by quantifying the amount of information contained in brain images on perceptual, cognitive, and behavioral parameters. As it outlines brain regions that convey information for an accurate prediction of the parameter of interest, it allows to understand how the corresponding information is encoded in the brain. However, it relies on a prediction function that is plagued by the curse of dimensionality, as there are far more features (voxels) than samples (images), and dimension reduction is thus a mandatory step. We introduce in this paper a new model, called Multiclass Sparse Bayesian Regression (MCBR), that, unlike classical alternatives, automatically adapts the amount of regularization to the available data. MCBR consists in grouping features into several classes and then regularizing each class differently in order to apply an adaptive and efficient regularization. We detail these framework and validate our algorithm on simulated and real neuroimaging data sets, showing that it performs better than reference methods while yielding interpretable clusters of features.

1. Introduction

In the context of neuroimaging, machine learning approaches have been used so far to address diagnostic problems, where patients were classified into different groups based on anatomical or functional data. By contrast, in cognitive studies, the standard framework for functional or anatomical brain mapping was based on mass univariate inference procedures [1]. Recently, a new way of analyzing functional neuroimaging data has emerged [2, 3], and it consists in assessing how well behavioral information or cognitive states can be predicted from brain activation images such as those obtained with functional magnetic resonance imaging (fMRI). This approach opens new ways for understanding the mental representation of various perceptual and cognitive parameters, which can be regarded as the study of the corresponding neural code, albeit at a relatively low spatial resolution. The accuracy of the prediction of the behavioral or cognitive target variable, as well as the spatial layout of predictive regions, can provide valuable information about functional brain organization; in short, it helps to decode the brain system [4].

Many different pattern recognition and machine leaning methods have been used to extract information from brain images and compare it to the corresponding target. Among them, Linear Discriminant Analysis (LDA) [3, 5], Support Vector Machine (SVM) [69], or regularized prediction [10, 11] has been particularly used. The major bottleneck in this kind of analytical framework is that there are far more features than samples, so that the problem is plagued by the curse of dimensionality, leading to overfitting. Dimension reduction can be used to extract relevant information from the data, the standard approach in functional neuroimaging being feature selection (e.g., Anova) [3, 6, 11, 12]. However, by performing feature selection and parameter estimation separately, such approach is not optimal. Thus, a popular combined selection/estimation scheme, Recursive Feature Elimination [13], may be used. However, this approach relies on a specific heuristic, which does not guarantee the optimality of the solution and is particularly costly. By contrast, there is great interest in sparsity-inducing regularizations, which optimize both simultaneously.

In this paper, we assume that the code under investigation is about some scalar parameter that characterizes the stimuli, such as a scale/shape parameters but possibly also position, speed (assuming a 1-D space), or cardinality. Thus, we focus on regression problems and defer the generalization to classification to future work. Let us introduce the following predictive linear model:

y=Xw+b, (1)

where y represents the behavioral variable and (w, b) are the parameters to be estimated on a training set. A vector w ∈ ℝp can be seen as an image; p is the number of features (or voxels), and b ∈ ℝ is called the intercept. The matrix X ∈ ℝn×p is the design matrix. Each row is a p-dimensional sample, that is, an activation map related to the observation. With np, the estimation of w is ill posed.

To cope with the high dimensionality of the data, one can penalize the estimation of w, for example, based on the 2 norm of the weights. Classical regularization schemes have been used in functional neuroimaging, such as the Ridge regression [14], Lasso [15], or Elastic Net regression [16]. However, these approaches require the amount of penalization to be fixed beforehand and possibly optimized by cross-validation. To deal with the choice of the amount of penalization, one can use the Bayesian regression techniques, which include the estimation of regularization parameters in the whole estimation procedure. Standard Bayesian regularization schemes are based on the fact that a penalization by weighted 2 norm is equivalent to setting the Gaussian priors on the weights w:

w~𝒩(0,A1),A=diag(α1,,αp),  i[1,,p],αi+, (2)

where 𝒩 is the Gaussian distribution and α i the precision of the ith feature. The model in (2) defines two classical Bayesian regression schemes. The first one is Bayesian Ridge Regression (BRR) [17], which corresponds to the particular case α 1 = ⋯ = α m. By regularizing all the features identically, BRR is not well suited when only few features are relevant. The second classical scheme is Automatic Relevance Determination (ARD) [18], which corresponds to the case α iα j if ij. The regularization performed by ARD is very adaptive, as all the weights are regularized differently. However, by regularizing each feature separately, ARD is prone to underfitting when the model contains too many regressors [19] and also suffers from convergence issues [20].

These classical Bayesian regularization schemes have been used in fMRI inverse inference studies [10, 14, 21]. However, these studies used only sparsity as built-in feature selection and do not consider neuroscientific assumptions for improving the regularization (i.e., within the design of the matrix A). Indeed, due to the intrinsic smoothness of functional neuroimaging data [22], predictive information is rather encoded in different groups of features sharing similar information. A potentially more adapted approach is the Bayesian regression scheme presented in [23], which regularizes patterns of voxels differently. The weights of the model are defined by w = , where U is a matrix defined as set of spatial patterns (one pattern by column) and η are the parameters of the decomposition of w in the basis defined by U. The regularization is controlled through the covariance of η, which is assumed to be diagonal with only m possible different values cov (η) = exp  (λ 1)I (1) + ⋯+exp  (λ m)I (m).

The matrices I (i) are diagonal and defined subsets of columns of U sharing similar variance exp (λ i). Due to its class-based model, this approach is similar to the one proposed in this paper, but the construction of I relies on ad hoc voxel selection steps, so that there is no proof that the solution is correct. A contrario, the proposed approach jointly optimizes, within the same framework, the construction of the pattern of voxels and the regularization parameter of each pattern.

In this paper, we detail a model for the Bayesian regression in which features are grouped into K different classes that are subject to different regularization penalties. The estimation of the penalty is performed in each class separately, leading to a stable and adaptive regularization. The construction of the group of features and the estimation of the predictive function are performed jointly. This approach, called Multiclass Sparse Bayesian Regression (MCBR), is thus an intermediate solution between BRR and ARD. It requires less parameters to estimate than ARD and is far more adaptive than BRR. Another asset of the proposed approach in fMRI inverse inference is that it creates a clustering of the features and thus yields useful maps for brain mapping. After introducing our model and giving some details on the parameter estimation algorithms (the variational Bayes or Gibbs sampling procedures), we show that the proposed algorithm yields better accuracy than reference methods, while providing more interpretable models.

2. Multiclass Sparse Bayesian Regression

We first detail the notations of the problem and describe the priors and parameters of the model. Then, we detail the two different algorithms used for model inference.

2.1. Model and Priors

We recall the linear model for regression:

y=f(X,w,b)=Xw+b. (3)

We denote by y ∈ ℝn the targets to be predicted and X ∈ ℝn×p the set of activation images related to the presentation of different stimuli. The integer p is the number of voxels and n the number of samples (images). Typically, p ~ 103  to  105 (for a whole volume), while n ~ 10  to  102.

Priors on the Noise —

We use classical priors for regression, and we model the noise on y as an i.i.d. Gaussian variable:

ϵ~𝒩(0,α1In),α~Γ(α;α1,α2), (4)

where α is the precision parameter and Γ stands for the gamma density with two hyperparameters α 1, α 2:

Γ(x;α1,α2)=α2α1xα11expxα2Γ(α1). (5)

Priors on the Class Assignment —

In order to combine the sparsity of ARD with the stability of BRR, we introduce an intermediate representation, in which each feature j belongs to one class among K indexed by a discrete variable z j (z = {z 1,…, z p}).All the features within a class k ∈ {1,…, K} share the same precision parameter λ k, and we use the following prior on z:

z~j=1pk=1Kπkδjk, (6)

where δ is Kronecker's δ, defined as

δjk={0if  zjk,1if  zj=k. (7)

We finally introduce an additional Dirichlet prior [24] on π:

π~Dir(η) (8)

with a hyperparameter η. By updating at each step the probability π k of each class, it is possible to prune classes. This model has no spatial constraints and thus is not spatially regularized.

Priors on the Weights —

As in ARD, we make use of an independent Gaussian prior for the weights:

w~𝒩(0,A1)      with  diag(A)={λz1,,λzp}, (9)

where λ zj is the precision parameter of the jth feature, with z j ∈ {1,…, K}. We introduce the following prior on λ k:

λk~Γ(λk;λ1,k,λ2,k) (10)

with hyperparameters λ 1,k, λ 2,k. The complete generative model is summarized in Figure 1.

Figure 1.

Figure 1

Graphical model of Multiclass Sparse Bayesian Regression (MCBR). We denote by y ∈ ℝn the targets to be predicted and by X ∈ ℝn×p the set of activation images. both the weights of the model w depend on a discrete variable z that assigns each feature to a class k among K. Both the noise ϵ and the weights w have a Gamma prior on their precisions. The variable z follows a Dirichlet prior π.

2.1.1. Link with Other Bayesian Regularization Schemes

The link between the proposed MCBR model and the other regularization methods, Bayesian Ridge Regression and Automatic Relevance Determination, is obvious.

  1. With K = 1, that is, λ z1 = ⋯ = λ zp, we retrieve the BRR model,

  2. With K = p, that is, λ ziλ zj if ij, and assigning each feature to a singleton class (i.e., z j = j), we retrieve the ARD model.

Moreover, the proposed approach is related to the one developed in [25]. In this paper, the authors proposed, for the distribution of weights of the features, a binary mixture of Gaussians with small and large precisions. This model is used for variable selection and estimated by the Gibbs sampling. Our work can be viewed as a generalization of this model to a number of classes K ≥ 2.

2.2. Model Inference

For models with latent variables, such as MCBR, some singularities can exist. For instance in a mixture of components, a singularity is a component with one single sample and thus zero variance. In such cases, maximizing the log likelihood yields flawed solutions, and one can use the posterior distribution of the latent variables p(z | X, y) for this maximization. However, the posterior distribution of the latent variables given the data does not have a closed-form expression, and some specific estimation methods, such as variational Bayes or Gibbs sampling, have to be used.

We propose two different algorithms for inferring the parameters of the MCBR model. We first estimate the model by the variational Bayes, and the resulting algorithm is thus called VB-MCBR. We also detail an algorithm, called Gibbs-MCBR, based on a Gibbs sampling procedure.

2.2.1. Estimation by Variational Bayes: VB-MCBR

The variational Bayes (or VB) approach provides an approximation q(Θ) of p(Θ | y), where q(Θ) is taken in a given family of distributions and Θ = [w, λ, α, z, π]. Additionally, the variational Bayes approach often uses the following mean field approximation, which allows the factorization between the approximate distribution of the latent variables and the approximate distributions of the parameters:

q(Θ)=q(w)q(λ)q(α)q(z)q(π). (11)

We introduce the Kullback-Leibler divergence 𝒟(q(Θ)) that measures the similarity between the true posterior p(Θ | y) and the variational approximation q(Θ). One can decompose the marginal log-likelihood log p (y) as

logp(yΘ)=(q(Θ))+𝒟(q(Θ)) (12)

with

(q(Θ))=dΘq(Θ)logp(y,Θ)q(Θ),𝒟(q(Θ))=  dΘq(Θ)logq(Θ)p(Θy), (13)

where (q(Θ)) is called free energy and can be seen as the measure of the quality of the model. As 𝒟(q(Θ)) ≥ 0, the free energy is a lower bound on log p (y) with equality if and only if q(Θ) = p(Θ | y). So, inferring the density q(Θ) of the parameters corresponds to maximizing with respect to the free distribution q(Θ). In practice, the VB approach consists in maximizing the free energy iteratively with respect to the approximate distribution q(z) of the latent variables and with respect to the approximate distributions of the parameters of the model q(w), q(λ), q(α), and q(π).

The variational distributions and the pseudocode of the VB-MCBR algorithm are provided in Appendix A. This algorithm maximizes the free energy . In practice, iterations are performed until convergence to a local maximum of . With an ARD prior (i.e., K = p and fixing z j = j), we retrieve the same formulas as the ones found for Variational ARD [18].

2.2.2. Estimation by Gibbs Sampling: Gibbs-MCBR

We develop here an estimation of the MCBR model using Gibbs sampling [26]. The resulting algorithm is called Gibbs-MCBR; the pseudocode of the algorithm and the candidate distributions are provided in Appendix B. The Gibbs sampling algorithm is used for generating a sequence of samples from the joint distribution to approximate marginal distributions. The main idea is to use conditional distributions that should be known and possibly easy to sample from, instead of directly computing the marginals from the joint law by integration (the joint law may not be known or hard to sample from). The sampling is done iteratively among the different parameters, and the final estimation of parameters is obtained by averaging the values of the different parameters across the different iterations (one may not consider the first iterations, this is called the burn in).

2.2.3. Initialization and Priors on the Model Parameters

Our model needs few hyperparameters; we choose here to use slightly informative and class-specific hyperparameters in order to reflect a wide range of possible behaviors for the weight distribution. This choice of priors is equivalent to setting heavy-tailed centered Student's t-distributions with variance at different scales, as priors on the weight parameters. We set K = 9, with weakly informative priors λ 1,k = 10k−4, k ∈ [1,…, K] and λ 2,k = 10−2, k ∈ [1,…, K]. Moreover, we set α 1 = α 2 = 1. Starting with a given number of classes and letting the model automatically prune the classes can be seen as a means of avoiding costly model selection procedures. The choice of class-specific priors is also useful to avoid label switching issues and thus speeds up convergence. Crucially, the priors used here can be used in any regression problem, provided that the target data is approximately scaled to the range of values used in our experiments. In that sense, the present choice of priors can be considered as universal. We also randomly initialize q(z) for VB-MCBR (or z for Gibbs-MCBR).

2.3. Validation and Model Evaluation

2.3.1. Performance Evaluation

Our method is evaluated with a cross-validation procedure that splits the available data into training and validation sets. In the following, (X l, y l) are a learning set (X t, y t) is a test set, and y^t=F(Xtw^) refers to the predicted target, where w^ is estimated from the training set. The performance of the different models is evaluated using ζ, the ratio of explained variance:

ζ(yt,y^t)=var(yt)var(yty^t)var(yt). (14)

This is the amount of variability in the response that can be explained by the model (perfect prediction yields ζ = 1, while ζ < 0 if prediction is worse than chance).

2.3.2. Competing Methods

In our experiments, the proposed algorithms are compared to different state-of-the-art regularization methods.

  1. Elastic Net Regression [27], which requires setting two parameters λ 1 and λ 2. In our analyzes, a cross-validation procedure within the training set is used to optimize these parameters. Here, we use λ1{0.2λ˜,0.1λ˜,0.05λ˜,0.01λ˜}, where λ˜=||XTy||, and λ 2 ∈ {0.1,0.5,1., 10., 100.}. Note that λ 1 and λ 2 parametrize heterogeneous norms.

  2. Support Vector Regression (SVR) with a linear kernel [28], which is the reference method in neuroimaging. The C parameter is optimized by cross-validation in the range of 10−3 to 101 in multiplicative steps of 10.

  3. Bayesian Ridge Regression (BRR), which is equivalent to MCBR with K = 1 and λ 1 = λ 2 = α 1 = α 2 = 10−6, that is, weakly informative priors.

  4. Automatic Relevance Determination (ARD), which is equivalent to MCBR with K = p and λ 1 = λ 2 = α 1 = α 2 = 10−6, that is, weakly informative priors.

All these methods are used after an Anova-based feature selection as this maximizes their performance. Indeed, irrelevant features and redundant information can decrease the accuracy of a predictor [29]. The optimal number of voxels is selected within the range {50,100,250,500}, using a nested cross-validation within the training set. We do not directly select a threshold on P value or cluster size, but rather a predefined number of features. The estimation of the parameters of the learning function is also performed using a nested cross-validation within the training set, to ensure a correct validation and an unbiased comparison of the methods. All methods are developed in C and used in Python. The implementation of elastic net is based on coordinate descent [30], while SVR is based on LibSVM [31]. Methods are used from Python via the Scikit-learn open source package [32].

For VB-MCBR and Gibbs-MCBR, in order to avoid a costly internal cross-validation, we select 500 voxels, and this selection is performed on the training set. The number of iterations used is fixed to 5000 (burn in of 4000 iterations) for Gibbs-MCBR and 500 for VB-MCBR. Preliminary results on both simulated and real data showed that these values are sufficient enough for an accurate inference of the model. As explained previously, we set K = 9, with weakly informative priors λ 1,k = 10k−4, k ∈ [1,…, K] and λ 2,k = 10−2, k ∈ [1,…, K]. Moreover, we set α 1 = α 2 = 1, and we randomly initialize q(z) for VB-MCBR (or z for Gibbs-MCBR).

3. Experiments and Results

3.1. Experiments on Simulated Data

We now evaluate and illustrate MCBR on two different sets of simulated data.

3.1.1. Details on Simulated Regression Data

We first test MCBR on a simulated data set, designed for the study of ill-posed regression problem, that is, np. Data are simulated as follows:

X~𝒩(0,1)      with  ϵ~𝒩(0,1),y=2(X1+X2X3X4)+0.5(X5+X6X7X8)+ϵ. (15)

We have p = 200 features, n l = 50 images for the training set, and n t = 50 images for the test set. We compare MCBR to the reference methods, but we do not use feature selection, as the number of features is not very high.

3.1.2. Results on Simulated Regression Data

We average the results of 15 different trials, and the average explained variance is shown in Table 1. Gibbs-MCBR outperforms the other approaches, yielding higher prediction accuracy than the reference elastic net and ARD methods. The prediction accuracy is also more stable than the other methods. VB-MCBR falls into the local maximum of and does not yield an accurate prediction. BBR has a low prediction accuracy compared to other methods such as ARD. Indeed, it cannot finely adapt the weights of the relevant features, as these features are regularized similarly as the irrelevant ones. SVR has also low accuracy, due to the fact that we do not perform any feature selection. Thus, SVR suffers from the curse of dimensionality, unlike other methods such as ARD or elastic net, which performs feature selection and model estimation jointly.

Table 1.

Simulated regression data. Explained variance ζ for different methods (average of 15 different trials). The P-values are computed using a paired t-test.

Methods Mean ζ Std ζ P-value to Gibbs-MCBR
SVR 0.11 0.1 .0**
Elastic net 0.77 0.11 .0004**
BRR 0.19 0.14 .0**
ARD 0.79 0.06 .0**
Gibbs-MCBR 0.89 0.04
VB-MCBR 0.04 0.05 .0**

**Level of significance of the P-values between 0.01 and 0.05.

In Figure 2, we represent the probability density function of the distributions of the weights obtained with BRR (a), Gibbs-MCBR (b), and ARD (c). With BRR, the weights are grouped in a monomodal density. ARD is far more adaptive and sets lots of weights to zero. The Gibbs-MCBR algorithm creates a multimodal distribution, lots of weights being highly regularized (pink distributions), and informative features are allowed to have higher weights (blue distributions).

Figure 2.

Figure 2

Results on simulated regression data. Probability density function of the weight distributions obtained with BRR (a), Gibbs-MCBR (b), and ARD (c). Each color represents a different component of the mixture model.

With MCBR, weights are clustered into different groups, depending on their predictive power, which is interesting in application such as fMRI inverse inference, as it can yield more interpretable models. Indeed, the class to the features with higher weights ({X 1, X 2, X 3, X 4}) belong which is small (average size of 6 features) but has a high purity (percentage of relevant features in the class) of 74%.

3.1.3. Comparison between VB-MCBR and Gibbs-MCBR

We now look at the values of w 1 and w 2 for the different steps of the two algorithms (see Figure 3). We can see that VB-MCBR (b) quickly falls into a local maximum, while Gibbs-MCBR (a) visits the space and reaches the region of the correct set of parameters (red dot). VB-MCBR is not optimal in this case.

Figure 3.

Figure 3

Results on simulated regression data. Weights of the first two features found for the different steps of Gibbs-MCBR (a) and VB-MCBR (b). The red dot represents the ground truth of both weights, and the green dot represents the final state found by the two algorithms. VB-MCBR is stuck in a local maximum, and Gibbs-MCBR finds the correct weights.

3.2. Simulated Neuroimaging Data

3.2.1. Details on Simulated Neuroimaging Data

The simulated data set X consists of n = 100 images (size 12 × 12 × 12 voxels) with a set of four square regions of interest (ROI) (size 2 × 2 × 2). We call the support of the ROI (i.e., the 32 resulting voxels of interest). Each of the four ROIs has a fixed weight in {−0.5,0.5, −0.5,0.5}. We call w i,j,k the weight of the (i, j, k) voxel. The resulting images are smoothed with a Gaussian kernel with a standard deviation of 2 voxels, to mimic the correlation structure observed in real fMRI data. To simulate the spatial variability between images (intersubject variability, movement artifacts in intrasubject variability), we define a new support of the ROIs, called ˜ such that, for each image lth, 50% (randomly chosen) of the weights w are set to zero. Thus, we have ˜. We simulate the target y for the lth image as

yl=(i,j,k)˜wi,j,kXi,j,k,l+ϵl (16)

with the signal in the (i, j, k) voxel of the lth image simulated as

Xi,j,k,l~𝒩(0,1), (17)

and ϵ l ~ 𝒩(0, γ) is a Gaussian noise with standard deviation γ > 0. We choose γ in order to have a signal-to-noise ratio of 5 dB.

3.2.2. Results on Simulated Neuroimaging Data

We compare VB-MCBR and Gibbs-MCBR with the different competing algorithms. The resulting images of weights are given in Figure 4, with the true weights (a) and resulting Anova F-scores (b). The reference methods can detect the truly informative regions (ROIs), but elastic net (f) and ARD (h) retrieve only part of the support of the weights. Moreover, elastic net yields an overly sparse solution. BRR (g) also retrieves the ROIs but does not yield a sparse solution, as all the features are regularized in the same way. We note that the weights in the feature space estimated by SVR (e) are nonzero everywhere and do not outline the support of the ground truth. VB-MCBR (c) converges to a local maximum similar to the solution found by BRR (g); that is, it creates only one nonempty class, and thus regularizes all the features similarly. We can thus clearly see that, in this model, the variational Bayes approach is very sensitive to the initialization and can fall into nonoptimal local maxima, for very sparse support of the weights. Finally, Gibbs-MCBR (d) retrieves most of the true support of the weights by performing an adapted regularization.

Figure 4.

Figure 4

Two-dimensional slices of the three-dimensional volume of simulated data. Weights found by different methods, the true target (a) and F-score (b). The Gibbs-MCBR method (d) retrieves almost the whole spatial support for the weights. The sparsity-promoting reference methods, elastic net (f) and ARD (h), find an overly sparse support of the weights. VB-MCBR (c) converges to a local maximum similar to BRR (g) and thus does not yield a sparse solution. SVR (e) yields smooth maps that are not similar to the ground truth.

3.3. Experiments and Results on Real fMRI Data

In this section, we assess the performance of MCBR in an experiment on the mental representation of object size, where the aim is to predict the size of an object seen by the subject during the experiment, in both intrasubject and intersubject cases. The size (or scale parameter) of the object will be the target variable y.

3.3.1. Details on Real Data

We apply the different methods on a real fMRI dataset related to an experiment studying the representation of objects, on ten subjects, as detailed in [33]. During this experiment, ten healthy volunteers viewed objects of 4 shapes in 3 different sizes (yielding 12 different experimental conditions), with 4 repetitions of each stimulus in each of the 6 sessions. We pooled data from the 4 repetitions, resulting in a total of n = 72 images by subject (one image of each stimulus by session). Functional images were acquired on a 3-T MR system with an eight-channel head coil (Siemens Trio, Erlangen, Germany) as T2*-weighted echo-planar image (EPI) volumes. Twenty transverse slices were obtained with a repetition time of 2 s (echo time: 30 ms; flip angle: 70°; 2 × 2 × 2-mm voxels; 0.5 mm gap). Realignment, normalization to MNI space, and general linear model (GLM) fit were performed with the SPM5 software (http://www.fil.ion.ucl.ac.uk/spm/software/spm5/). The normalization is the conventional method of SPM (implying affine and nonlinear transformations) and not the one using unified segmentation. The normalization parameters are estimated on the basis of a whole-head EPI acquired in addition and are then applied to the partial EPI volumes. The data are not smoothed. In the GLM, the effect of each of the 12 stimuli convolved with a standard hemodynamic response function was modeled separately, while accounting for serial autocorrelation with an AR(1) model and removing low-frequency drift terms using a high-pass filter with a cutoff of 128 s. The GLM is fitted separately in each session for each subject, and we used in the present work the resulting session-wise parameter estimate images (the β-maps are used as rows of X). The four different shapes of objects were pooled across for each one of the three sizes, and we are interested in finding discriminative information on sizes. This reduces to a regression problem, in which our goal is to predict a simple scalar factor (size of an object). All the analyzes are performed without any prior selection of regions of interest and use the whole acquired volume.

Intrasubject Regression Analysis —

First, we perform an intrasubject regression analysis. Each subject is evaluated independently, in a 12-fold cross-validation. The dimensions of the real data set for one subject are p ~ 7 × 104 and n = 72 (divided in 3 different sizes, 24 images per size). We evaluate the performance of the method by a leave-one-condition-out cross-validation (i.e., leave-6-image-out), and doing so the GLM is performed separately for the training and test sets. The parameters of the reference methods are optimized with a nested leave-one-condition-out cross-validation within the training set, in the ranges given before.

Intersubject Regression Analysis —

Additionally, we perform an intersubject regression analysis on the sizes. The intersubject analysis relies on subject-specific fixed-effect activations that is, for each condition, the 6 activation maps corresponding to the 6 sessions are averaged together. This yields a total of 12 images per subject, one for each experimental condition. The dimensions of the real data set are p ~ 7 × 104 and n = 120 (divided into 3 different sizes). We evaluate the performance of the method by cross-validation (leave-one-subject-out). The parameters of the reference methods are optimized with a nested leave-one-subject-out cross-validation within the training set, in the ranges given before.

3.3.2. Results on Real Data

Intrasubject Regression Analysis —

The results obtained by the different methods are given in Table 2. The P-values are computed using a paired t-test across subjects. VB-MCBR outperforms the other methods. Compared to the results on simulated data, VB-MCBR still falls in a local maximum similar to the Bayesian ridge regression which performs well in this experiment. Moreover, both Gibbs-MCBR and VB-MCBR are more stable than the reference methods.

Table 2.

Intrasubject analysis. Explained variance ζ for the three different methods. The P-values are computed using a paired t-test. VB-MCBR yields the best prediction accuracy, while being more stable than the reference methods.

Methods Mean ζ Std ζ P-val/Gibbs-MCBR
SVR 0.82 0.07 .0006***
Elastic net 0.9 0.02 .001***
BRR 0.92 0.02 .0358**
ARD 0.89 0.03 .0015***
Gibbs-MCBR 0.93 0.01
VB-MCBR 0.94 0.01 .99

**Level of significance of the P-values between 0.01 and 0.05.

***Level of significance of the P-values below 0.01.

Intersubject Regression Analysis —

The results obtained with the different methods are given in Table 3. As in the intrasubject analysis, both MCBR approaches outperform the reference methods, SVR, BRR, and ARD. However, the prediction accuracy is similar to that of elastic net. In this case, Gibbs-MCBR performs slightly better than VB-MCBR, but the difference is not significant.

One major asset of MCBR (and more particularly Gibbs-MCBR, as VB-MCBR often falls into a one-class local maximum) is that it creates a clustering of the features, based on the relevance of the features in the predictive model. This clustering can be accessed using the variable z, which is implied in the regularization performed on the different features. In Figure 5, we give the histogram of the weights of Gibbs-MCBR for the intersubject analysis. We keep the weights and the values of z of the last iteration; the different classes are represented as dots of different colors and are superimposed on the histogram. We can notice than the pink distribution represented at the bottom of the histogram corresponds to relevant features. This cluster is very small (19 voxels), compared to the two blue classes represented at the top of the histogram that contain many voxels (746 voxels) which are highly regularized, as they are noninformative.

The maps of weights found by the different methods are detailed in Figure 6. The methods are used combined with an Anova-based univariate feature selection (2500 voxels selected, in order to have a good support of the weights). As elastic net, Gibbs-MCBR yields a sparse solution but extracts a few more voxels. The map found by elastic net is not easy to interpret, with very few informative voxels scattered in the whole occipital cortex. The map found by SVR is not sparse in the feature space and is thus difficult to interpret, as the spatial layout of the neural code is not clearly extracted. VB-MCBR does not yield a sparse map either, all the features having nonnull weights

Table 3.

Intersubject analysis. Explained variance ζ for the different methods. The P-values are computed using a paired t-test. MCBR yields highest prediction accuracy than the two other Bayesian regularizations BRR and ARD.

Methods Mean ζ Std ζ P-val/Gibbs-MCBR
SVR 0.77 0.11 .14
Elastic net 0.78 0.1 .75
BRR 0.72 0.1 .01**
ARD 0.52 0.33 .02*
Gibbs-MCBR 0.79 0.1
VB-MCBR 0.78 0.1 0.4

*Level of significance of the P-values.

**Level of significance of the P-values between 0.01 and 0.05.

Figure 5.

Figure 5

Intersubject analysis. Histogram of the weights found by Gibbs-MCBR and corresponding z values (each color of dots represents a different class), for the intersubject analyzes. We can see that Gibbs-MCBR creates clusters of informative and noninformative voxels and that the different classes are regularized differently, according to the relevance of the features in each of them.

Figure 6.

Figure 6

Intersubject analysis. Maps of weights found by the different methods on the 2500 most relevant features by Anova. The map found by elastic net is difficult to interpret as the very few relevant features are scattered within the whole brain. SVR and VB-MCBR do not yield a sparse solution. Gibbs-MCBR, by performing an adaptive regularization, draws a compromise between the other approaches and yields a sparse solution, but also extracts small groups of relevant features.

4. Discussion

It is well known that in high-dimensional problems, regularization of feature loadings significantly increases the generalization ability of the predictive model. However, this regularization has to be adapted to each particular dataset. In place of costly cross-validation procedures, we cast regularization in a Bayesian framework and treat the regularization weights as hyperparameters. The proposed approach yields an adaptive and efficient regularization and can be seen as a compromise between a global regularization (Bayesian Ridge Regression) that does not take into account the sparse or focal distribution of the information and automatic relevance determination. Additionally, MCBR creates a clustering of the features based on their relevance and thus explicitly extracts groups of informative features.

Moreover, MCBR can cope with the different issues of ARD. ARD is subject to an underfitting in the hyperparameter space that corresponds to an underfitting in model selection (i.e., on the features to be pruned) [19]. Indeed, as ARD is estimated by maximizing evidence, models with less selected features are preferred, as the integration is done on less dimensions, and thus evidence is higher. ARD will choose the sparsest model across models with similar accuracy. A contrario, MCBR requires far less hyperparameter (2 × K, with Kp) and suffers less from this issue, as the sparsity of the model is defined by groups. Moreover, a full Bayesian framework for estimating ARD requires to set some priors on the hyperparameters (e.g., α 1 and α 2), and it may be sensitive to specific choice of these hyperparameters. A solution is to use an internal cross-validation for optimizing these parameters, but this approach can be computationally expensive. In the case of MCBR, the distributions of the hyperparameters are bound to a class and not to each feature. Thus, the proposed approach is less sensitive to the choice of the hyperparameters. Indeed, the choice of good hyperparameters for the features is dealt with at the class level.

On simulated data, our approach performs better than other classical methods such as SVR, BRR, ARD, and elastic net and yields a more stable prediction accuracy. Moreover, by adapting the regularization to different groups of voxels, MCBR retrieves the true support of the weights and recovers a sparse solution. Results on real data show that MCBR yields more accurate predictions than other regularization methods. As it yields less sparse solution than elastic net, it gives access to more plausible loading maps which are necessary for understanding the spatial organization of brain activity, that is, retrieving the spatial layout of the neural coding. On real fMRI data, the explicit clustering of Gibbs-MCBR is also an interesting aspect of the model, as it can extract few groups of relevant features from many voxels.

In some experiments, the variational Bayes algorithm yields less accurate predictions than the Gibbs sampling approach, which can be explained by the difficulty of initializing the different variables (especially z) when the support of the weight is overly sparse. Moreover, the VB-MCBR algorithm relies on a variational Bayes approach, which may not be optimal, due to strong approximations in model inference. A contrario Gibbs-MCBR is more time consuming but yields a better model inference. Finally, the variability in the results may be explained by the difficulty to estimate the model (optimality is not ensured).

The question of model selection (i.e., the number of classes K) has not been addressed in this paper. One can use the free energy in order to select the best model, but due to the instability of VB-MCBR, this approach does not seem promising. A more interesting method is the one detailed in [34], which can be used with the Gibbs sampling algorithm. Here, model selection is performed implicitly by emptying classes that do not fit the data well. In that respect, the choice of heterogeneous priors for different classes is crucial: replacing our priors with class-independent priors (i.e., λ 1,k = 10−2, k ∈ [1,…, K]) in the intersubject analysis on size prediction leads Gibbs-MCBR to a local maximum similar to VB-MCBR.

Finally, this model is not restricted to the Bayesian regularization and can be used for classification, within a probit or logit model [35, 36]. The proposed model may thus be used for diagnosis in medical imaging, for the prediction of both continuous or discrete variables.

5. Conclusion

In this paper, we have proposed a model for adaptive regression, called MCBR. The proposed method integrates, in the same Bayesian framework, BRR and ARD and performs a different regularization for relevant and irrelevant features. It can tune the regularization to the possible different level of sparsity encountered in fMRI data analysis, and it yields interpretable information for fMRI inverse inference, namely, the z variable (latent class variable). Experiments on both simulated and real data show that our approach is well suited for neuroimaging, as it yields accurate and stable predictions compared to the state-of-the-art methods.

Acknowledgment

The authors acknowledge support from the ANR Grant ViMAGINE ANR-08-BLAN-0250-02.

Appendices

A. VB-MCBR Algorithm

The variational Bayes approach yields the following variational distributions:

  • (i)
    q(w) ~ 𝒩(w | μ, 𝚺) with
    A¯=diag(l1¯,,lp¯)  withlj¯=k=1Kq(zj=k)l1,kl2,k  j{1,,p}, (A.1)
    Σ=(a1a2XTX+A¯)1, (A.2)
    μ=a1a2ΣXTy; (A.3)
  • (ii)
    q(λ k) ~ Γ(l 1,k, l 2,k) with
    l1,k=λ1,k+12j=1pq(zj=k), (A.4)
    l2,k=λ2,k+12j=1p(μjj2+Σjj)q(zj=k); (A.5)
  • (iii)
    q(α) ~ Γ(a 1, a 2) with
    a1=α1+n2, (A.6)
    a2=α2+12(yXμ)T(yXμ)+12Tr(ΣXTX); (A.7)
  • (iv)
    q(z j = k) ~ exp (ρjk) with
    ρjk=12(μj2+Σjj)l1,kl2,k+ln(πk)+12(Ψ(l1,k)log(l2,k)), (A.8)
    πk=exp{Ψ(dk)Ψ(k=1k=Kdk)}, (A.9)
    dk=ηk+j=1pq(zj=k), (A.10)
    where Ψ is the digamma function Ψ(x) = Γ′(x)/Γ(x). The VB-MCBR algorithm is provided in pseudo-code in Algorithm 1.

Algorithm 1.

Algorithm 1

VB-MCBR algorithm.

B. Gibbs-MCBR Algorithm

With Θ = [w, λ, α, z, π], we have the following candidate distributions (i.e., the distributions used for the sampling of the different parameters):

  • (i)
    p(w | Θ − {w}) ∝ 𝒩(w | μ, 𝚺) with
    Σ=(XTXα+A)1        with  A=diag(λz1,,λzp), (B.1)
    μ=ΣαXTy; (B.2)
  • (ii)
    p(λ | Θ − {λ}) ∝ ∏k=1 KΓ(λ k | l 1,k, l 2,k) with
    l1,k=λ1,k+12j=1pδ(zj=k), (B.3)
    l2,k=λ2,k+12j=1pδ(zj=k)wj2; (B.4)
  • (iii)
    p(α | Θ − {α}) ∝ Γ(a 1, a 2) with
    a1=α1+n2, (B.5)
    a2=α2+12(yXμ)T(yXμ); (B.6)
  • (iv)
    p(z j | Θ − {z}) ∝ mult(exp ρ j,1,…, exp ρ j,K) with
    ρjk=12wj2λk+ln(πk)+12logλk; (B.7)
  • (v)
    p(π k | Θ − {π}) ∝ Dir(d k) with
    dk=ηk+j=1pδ(zj=k). (B.8)

The algorithm is provided in pseudocode in Algorithm 2.

Algorithm 2.

Algorithm 2

Gibbs-MCBR algorithm.

References

  • 1.Friston KJ, Holmes AP, Worsley KJ, Poline JP, Frith CD, Frackowiak RSJ. Statistical parametric maps in functional imaging: a general linear approach. Human Brain Mapping. 1994;2(4):189–210. [Google Scholar]
  • 2.Dehaene S, Le Clec’H G, Cohen L, Poline JB, van de Moortele PF, Le Bihan D. Inferring behavior from functional brain images. Nature Neuroscience. 1998;1(7):549–550. doi: 10.1038/2785. [DOI] [PubMed] [Google Scholar]
  • 3.Cox DD, Savoy RL. Functional magnetic resonance imaging (fMRI) "brain reading": detecting and classifying distributed patterns of fMRI activity in human visual cortex. NeuroImage. 2003;19(2):261–270. doi: 10.1016/s1053-8119(03)00049-1. [DOI] [PubMed] [Google Scholar]
  • 4.Dayan P, Abbott LF. Theoretical Neuroscience: Computational and Mathematical Modeling of Neural Systems. MIT Press; 2001. [Google Scholar]
  • 5.Haynes JD, Rees G. Predicting the stream of consciousness from activity in human visual cortex. Current Biology. 2005;15(14):1301–1307. doi: 10.1016/j.cub.2005.06.026. [DOI] [PubMed] [Google Scholar]
  • 6.Mitchell TM, Hutchinson R, Niculescu RS, et al. Learning to decode cognitive states from brain images. Machine Learning. 2004;57(1-2):145–175. [Google Scholar]
  • 7.LaConte S, Strother S, Cherkassky V, Anderson J, Hu X. Support vector machines for temporal classification of block design fMRI data. NeuroImage. 2005;26(2):317–329. doi: 10.1016/j.neuroimage.2005.01.048. [DOI] [PubMed] [Google Scholar]
  • 8.Mourão-Miranda J, Bokde ALW, Born C, Hampel H, Stetter M. Classifying brain states and determining the discriminating activation patterns: Support Vector Machine on functional MRI data. NeuroImage. 2005;28(4):980–995. doi: 10.1016/j.neuroimage.2005.06.070. [DOI] [PubMed] [Google Scholar]
  • 9.Hanson SJ, Halchenko YO. Brain reading using full brain support vector machines for object recognition: there is no face identification area. Neural Computation. 2008;20(2):486–503. doi: 10.1162/neco.2007.09-06-340. [DOI] [PubMed] [Google Scholar]
  • 10.Yamashita O, Sato MA, Yoshioka T, Tong F, Kamitani Y. Sparse estimation automatically selects voxels relevant for the decoding of fMRI activity patterns. NeuroImage. 2008;42(4):1414–1429. doi: 10.1016/j.neuroimage.2008.05.050. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Ryali S, Supekar K, Abrams DA, Menon V. Sparse logistic regression for whole-brain classification of fMRI data. NeuroImage. 2010;51(2):752–764. doi: 10.1016/j.neuroimage.2010.02.040. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.De Martino F, Valente G, Staeren N, Ashburner J, Goebel R, Formisano E. Combining multivariate voxel selection and support vector machines for mapping and classification of fMRI spatial patterns. NeuroImage. 2008;43(1):44–58. doi: 10.1016/j.neuroimage.2008.06.037. [DOI] [PubMed] [Google Scholar]
  • 13.Guyon I, Weston J, Barnhill S, Vapnik V. Gene selection for cancer classification using support vector machines. Machine Learning. 2002;46(1–3):389–422. [Google Scholar]
  • 14.Chu C, Ni Y, Tan G, Saunders CJ, Ashburner J. Kernel regression for fMRI pattern prediction. NeuroImage. 2011;56(2):662–673. doi: 10.1016/j.neuroimage.2010.03.058. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Liu H, Palatucci M, Zhang J. Blockwise coordinate descent procedures for the multi-task Lasso, with applications to neural semantic basis discovery. In: Proceedings of the 26th International Conference On Machine Learning (ICML ’09); June 2009; pp. 649–656. [Google Scholar]
  • 16.Carroll MK, Cecchi GA, Rish I, Garg R, Rao AR. Prediction and interpretation of distributed neural activity with sparse models. NeuroImage. 2009;44(1):112–122. doi: 10.1016/j.neuroimage.2008.08.020. [DOI] [PubMed] [Google Scholar]
  • 17.Bishop CM. Pattern Recognition and Machine Learning. 1st edition. Berlin, Germany: Springer; 2007. (Information Science and Statistics). [Google Scholar]
  • 18.Tipping M. The Relevance Vector Machine. Morgan Kaufmann; 2000. [Google Scholar]
  • 19.Qi Y, Minka TP, Picard RW, Ghahramani Z. Predictive automatic relevance determination by expectation propagation. In: Proceedings of the 21st International Conference on Machine Learning (ICML ’04); 2004; ACM Press; [Google Scholar]
  • 20.Wipf D, Nagarajan S. Advances in Neural Information Processing Systems. Vol. 20. MIT Press; 2008. A new view of automatic relevance determination; pp. 1625–1632. [Google Scholar]
  • 21.Ni Y, Chu C, Saunders CJ, Ashburner J. Kernel methods for fmri pattern prediction. In: Proceedings of the IEEE International Joint Conference on Neural Networks (IJCNN ’08); 2008; pp. 692–697. [Google Scholar]
  • 22.Uğurbil K, Toth L, Kim DS. How accurate is magnetic resonance imaging of brain function? Trends in Neurosciences. 2003;26(2):108–114. doi: 10.1016/S0166-2236(02)00039-5. [DOI] [PubMed] [Google Scholar]
  • 23.Friston K, Chu C, Mourão-Miranda J, et al. Bayesian decoding of brain images. NeuroImage. 2008;39(1):181–205. doi: 10.1016/j.neuroimage.2007.08.013. [DOI] [PubMed] [Google Scholar]
  • 24.Steck H, Jaakkola TS. On the dirichlet prior and bayesian regularization. Advances in Neural Information Processing Systems. 2002;15:697–704. [Google Scholar]
  • 25.George EI, McCulloch RE. Variable selection via gibbs sampling. Journal of the American Statistical Association. 1993;88(423):881–889. [Google Scholar]
  • 26.Geman S, Geman D. Stochastic Relaxation, Gibbs Distributions, and the Bayesian Restoration of Images. Morgan Kaufmann; 1987. [DOI] [PubMed] [Google Scholar]
  • 27.Zou H, Hastie T. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society. Series B. 2005;67(2):301–320. [Google Scholar]
  • 28.Cortes C, Vapnik V. Support-vector networks. Machine Learning. 1995;20(3):273–297. [Google Scholar]
  • 29.Hughes G. On the mean accuracy of statistical pattern recognizers. IEEE Transactions on Information Theory. 1968;14(1):55–63. [Google Scholar]
  • 30.Friedman J, Hastie T, Tibshirani R. Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software. 2010;33(1):1–22. [PMC free article] [PubMed] [Google Scholar]
  • 31.Chang C-C, Lin C-J. LIBSVM: a library for support vector machines. 2001, http://www.csie.ntu.edu.tw/~cjlin/libsvm/
  • 32. scikit-learn, version 0.2, 2010, http://scikit-learn.sourceforge.net/
  • 33.Eger E, Kell CA, Kleinschmidt A. Graded size sensitivity of object-exemplar-evoked activity patterns within human LOC subregions. Journal of Neurophysiology. 2008;100(4):2038–2047. doi: 10.1152/jn.90305.2008. [DOI] [PubMed] [Google Scholar]
  • 34.Chib S, Jeliazkov I. Marginal Likelihood from the Metropolis-Hastings Output. Journal of the American Statistical Association. 2001;96(453):270–281. [Google Scholar]
  • 35.Albert JH, Chib S. Bayesian analysis of binary and polychotomous response data. Journal of the American Statistical Association. 1993;88(422):669–679. [Google Scholar]
  • 36.McCulloch RE, Polson NG, Rossi PE. A Bayesian analysis of the multinomial probit model with fully identified parameters. Journal of Econometrics. 2000;99(1):173–193. [Google Scholar]

Articles from International Journal of Biomedical Imaging are provided here courtesy of Wiley

RESOURCES