Abstract
Recent advances in medical imaging technologies generate a high volume of imaging data. Classification of cognitive outcome and disease status based on brain images is one of the most important tasks in neuroimaging studies. However it poses great challenge to the current classification methods due to the extremely high dimensionality and low signal to noise ratio in brain image data. In this article we propose a tensor boosting algorithm for classification based on neuroimaging data. The method is off-the-shelf, computationally simple and amenable to various modalities of neuroimaging data. The method is applied to an EEG data set from an alcoholism study and an MRI data set from an ADHD Global Competition and shows significantly improved classification performance.
I. Introduction
Modern technology is producing an enormous amount of medical image data, such as electroencephalography (EEG), anatomical magnetic resonance imaging (MRI) and functional magnetic resonance imaging (fMRI). A common task in neuroimaging studies is to classify subjects into different cognitive outcomes and disease status based on brain images. There is much hope that the huge amount of information inherent in these images brings discriminant power in classification, although it still remains an active research area how this can be achieved. Medical imaging data are often in the form of multidimensional arrays, also known as tensors. This imposes great challenge to the traditional classifiers. For instance, typical anatomical MRI image of size 128×128×128 contains 1283 = 2,097,152 voxels. Both computability and performance of commonly used classifiers are compromised by this ultra-high dimensionality. More seriously, traditional classifiers take a vector of features as input. Vectorizing an array before applying classification algorithm destroys the inherent spatial structure of the image that possesses wealth of information.
Typical classification methods in the current neuroimaging literature consist of three components: feature extraction, feature dimensionality reduction and feature-based classification. A few or more effective features are first extracted from the original image data. This step often requires substantial domain knowledge. Once the features have been defined and extracted, dimension reduction may be necessary to further reduce the dimensionality, by principal component analysis (PCA) [1] [2] [3], independent component analysis (ICA) [4] [5], or their variants [6]. The final summary features are then fed into commonly used classifiers, such as Fisher linear discrimination analysis (LDA) [1] [2], quadratic discriminant analysis (QDA) [6] [7], logistic regression [8], or Projection Pursuit [5].
Although these methods have been widely used in practice, a feature selection step needs to be conducted first. Combining the feature selection and feature-based classification in a unified framework could avoid the step of selecting the right set of features. In this paper, we propose a tensor LogitBoost method, which organically combines two methods. The first one is boosting, which was called the “best off-the-shelf classifier in the world” [9]. The second one is the recently proposed tensor regression method [10], which handles the tensor-valued data efficiently in the regression framework. One advantage of the proposed tensor LogitBoost method lies in its ability to keep the tensor structure during the classification stage. Also, respecting tensor structure in the tensor regression step, tensor LogitBoost improves the performance of regular LogitBoost. It is “off-the-shelf” and thus applies to various modalities of image data, which are all in the form of matrix or tensor. Also, it is computationally simple due to an efficient estimation algorithm for the tensor regression.
The rest of the paper is organized as follows. In Section II, we first introduce some notations. In Section III, we briefly review two essential components of our algorithm, boosting and tensor regression. We then present our new method of tensor LogitBoost in detail. Tensor PCA and tuning strategy for tensor boosting are also discussed in this section. In Section IV and Section V, tensor LogitBoost is applied to the analysis of the EEG data and MRI data example for further illustration. We conclude the chapter in Section VI with a discussion on future extensions.
II. Notation
Given two matrices and , if A and B have the same number of columns n = q, then the Khatri-Rao product is defined as the mp × n columnwise Kronecker product
The mode-d matricization, , is a matrix with columns the mode-d fibers of B. More precisely, the (i1, … , iD) element of the tensor B maps to the (id, j) element of the matrix B(d), where j = 1 + ∑d′≠d (id′ – 1) ∏d″<d′,d″≠d pd″ With d = 1, we observe that vecB is the same as vectorizing the mode-1 matricization B(1), where vec is the vectorization operator that stacks the columns of a matrix below one another into a vector.
An outer product of D vectors , d = 1, … , D, b1 ◦ b2 ◦ … ◦ bD is a p1 × … × pD tensor with entries .
We say a tensor admits a CANDECOMP/PARAFAC decomposition if
(1) |
(2) |
where , d = 1, …, D, r = 1, … , R, are all column vectors and , d = 1, … , D.
The mode-d matricization of B admitting CANDECOMP/PARAFAC decomposition (2) can be expresses as
Also note that
III. Method
There are multiple versions of boosting methods developed for different application scenarios. The point of view of the gradient boosting machine by [11] is particularly productive and drives derivations of many variants of boosting algorithm. See [12] for a good review.
The LogitBoost algorithm can be treated as a “statistical” version of the well known AdaBoost [13] because it learns the set of regression functions {fl(x)}l=1, … , L by minimizing the negative binomial log-likelihood instead of the exponential loss. Consider a two-class classification problem with the vector-valued predictors and class labels yi ∈ {0, 1}. The probability of x being in class 1 is represented by
where . The LogitBoost algorithm uses Newton steps for fitting a logistic model by maximizing binomial log-likelihood of the data
The detail of the LogitBoost algorithm is presented in Algorithm 1. The ν in (3) is the shrinkage parameter. The natural value is 1, but a smaller value might be a better choice. This makes the algorithm slower, since more iterations are needed, but more stable, since the steps taken are smaller. Also, an implementation protection is necessary by enforcing thresholds on the weights wi and working responses zi. In our setting, we follow the suggestion in [11] and use the following thresholds: wi = max{wi, 10−10} and zi = max{max{zi, −3}, 3}.
Algorithm 1 LogitBoost algorithm for two-class classification problem | |
---|---|
An essential ingredient of the LogitBoost algorithm is the weighted least squares solver. For image data, a naive way is to vectorize voxels and then apply the regular weighted least squares solver. Two potential pitfalls prevent this strategy. First, the number of voxels far exceeds the number of observations and there is no unique solution. Second, vectorizing voxels destroys the tensor structure of images which themselves possess huge amount of information. Ideally the procedure should respect the tensor structure to retain as much spatial information as possible. Our numerical results show that respecting tensor structure significantly increases the classification performance of the boosting algorithm.
The recent tensor regression model developed in [10] supplies a natural regression method for tensor-valued predictors. Consider the special case of weighted least squares criterion
where are scalar responses, are tensor-valued predictors, and B is a tensor regression parameter. Limited sample size prevents estimation of the full tensor parameter B which has the same size as X.
In the rank-R tensor regression introduced in [10], we consider the criterion
(4) |
where , ⊙ denotes the Khatri-Rao product, and 1R is the vector of R ones.
Dimensionality of this tensor regression model is , which is usually substantially smaller than that of the full model ∏d pd. In the ADHD-200 global competition, for example, dimensionality of the MRI data is reduced from 1283 = 2,097, 152 to 128 × 3 = 384 by a rank R = 1 model. More importantly, the multi-linear form in (3) generalizes the linear form in the classical linear regression while still respecting the tensor structure in X.
The parameter (B1, … , BD) can be efficiently estimated by an alternating least squares (ALS) algorithm, summarized in Algorithm 2, for solving the weighted tensor least squares problem (4).
Algorithm 2 Alternating least squares (ALS) algorithm for maximizing the weighted tensor least squares criterion (4) | |
---|---|
Note that, when updating the block Bd, it is a regular weighted least squares with Rpd parameters, which admits an explicit solution. Therefore, the ALS algorithm is extremely simple to implement. The ALS algorithm iterates monotonically decrease the objective value and, under mild regularity conditions, converge to a stationarity point of L.
For classification with tensor-valued predictors, our proposal is to replace the regular weighted least squares by the weighted tensor least squares in Algorithm 1. Let {Xi, yi}i=1, … , n be the training set, where and yi ∈ {0, 1}. The tensor LogitBoost is summarized in Algorithm 3.
In practice we often need to scale down image data such that the effective model size, R(pl + pr) – R2 for D = 2
Algorithm 3 Tensor LogitBoost | |
---|---|
or R(∑d Pd – D + 1) for D > 2, of tensor least squares is less than the sample size. Success of tensor LogitBoost lies in its ability to retain the tensor structure during the classification stage. Therefore in the preprocessing step we would like to keep the tensor structure too. Particularly we use tensor PCA to reduce sizes of image data.
Define the first two sample moments of d-matricization of data as
Suppose the symmetric matrix admits eigendecomposition S(d) = UdΛdUd for d = 1, … , D and the target dimension is q1 × … × qD, where qd ≤ pd for all d. Then the reduced image for each subject is
where contains the top qd eigenvectors of S(d) in its columns. Then the follow-up tensor LogitBoost is performed on reduced tensor data . Note that, for D = 1, tensor PCA reduces to the classical PCA.
Also in practice the tensor LogitBoost algorithm has to be tuned on a training data set for better performance on the testing data. The parameters subject to tuning include the number of boosting steps L, the rank of tensor regression R, and the shrinkage parameter ν in boosting algorithm [11]. In following examples, we apply cross validation on the training data to choose the best combination among the following range:
Boosting step L between 1 and 1000.
Rank R ∈ {1, 2, 3}
Shrinkage ν ∈ {0.01, 0.05, 0.1}.
IV. EEG Data Analysis
We now apply the proposed tensor LogitBoost to analyze an EEG images dataset. Our proposed method is compared to the regular LogitBoost with the vectorized predictors (regular LogitBoost). The study examines EEG correlates of genetic predisposition to alcoholism and involved two groups of subjects: an alcoholic group of 77 subjects and a control group of 45 subjects. Each subject was exposed to either one stimulus or two stimuli. During an exposure, the voltage values were measured from 64 channels of electrodes and for 256 time points. The 64 electrodes are placed at different locations on the subject’s scalp. The stimuli were pictures chosen from a picture set. When two pictures were shown, they were displayed in either a matched condition, where two pictures were identical, or a unmatched condition, where they were different. Each subject had 120 trials under these three conditions: single stimulus, two matched stimuli and two unmatched stimuli. The primary interest was to study the association between alcoholism and the pattern of voltage values over times and channels.
To keep matters simple, in this paper, we used only part of the data set: we included only the single stimulus condition and, for each subject, we took the average of all the trials under that condition. That is, the portion of the data we used consists of (X1, Y1), …, (X122, Y122), where Xi is a 256 × 64 matrix with each entry representing the mean voltage value of subject i at a combination of a time point and a channel, averaged over all trials under the single stimulus condition, and Yi is a binary random variable indicating whether the ith subject is alcoholic (Yi = 1) or nonalcoholic (Yi = 0). To visualize and illustrate the results in a meaningful way, we set pl = dr = 2.
Instead of directly applying the classification method to the original EEG data, we preprocessed the data by tensor PCA to reduce the dimension. We partition the data into k ∈ {5, 10, 122} portions. Each portion is held out once as the testing data set and the remaining k – 1 portions together form the training data set. For the training set, a further five-fold cross-validation is applied to choose the tuning parameters, including the number of boosting steps L, the rank of tensor regression R and the shrinkage parameter ν. We experimented several scales of downsizing and obtained the similar results.
Table I shows the superior performance of tensor LogitBoost compared to the regular LogitBoost under all scenarios. The best results are highlighted in boldface.
Table I:
Method | Leave-one-out | Five-fold | Ten-fold |
---|---|---|---|
Tensor LogitBoost | 0.167 | 0.221 | 0.157 |
Regular LogitBoost | 0.288 | 0.287 | 0.290 |
V. MRI Data Analysis
Attention deficit hyperactivity disorder (ADHD) is one of the most commonly diagnosed behavioral disorders among children. The primary symptoms for ADHD could be classified into three groups: developmentally inappropriate inattention, impulsive behavior and hyperactivity. Children with these symptoms do not know how to control their behaviors or have trouble organizing their thoughts. About 5% of U.S. children aged 6-17 have been affected with ADHD. The levels of these primary symptoms are widely used as the diagnosis and treatment evaluation of ADHD. Ranking of these primary symptoms is often evaluated by the teachers or parents of the children, which is inherently subjective. Therefore, more objective methods are greatly desired.
The goal of the ADHD-200 global competition is to develop novel strategies for predicting ADHD diagnostic status based on an individual’s MRI data or fMRI data. The ADHD-200 initiative organized the public release of the MRI data and fMRI data for 776 children (285 children with ADHD and 491 controls) across 8 independent sites. A testing set of 195 unlabeled children then will be utilized to measure the performance of the classifiers developed by the teams. Competition results have revealed that the prediction accuracy varied from 43.08% to 61.54% with mean = 56.02%. The mean accuracy of predicting control subjects is 71.77%, which is larger than the one of predicting children with ADHD (37.44%).
We only included the MRI data in our analysis. The MRI data set that we used is the preprocessed version of the ADHD-200 Global Competition MRI data set, which is released by the Neuro Bureau. SPM8 is used for the preprocessing of the MRI data.
The preprocessed MRI data comprises of 776 labeled children and 195 unlabelled children. The labeled training set contains 285 children with ADHD and 491 typical developed children (TDC). Not all the children in the training set and testing set were used for our analysis. For some children, the MRI data are fragmentary and of low quality. Moreover, the diagnostic labels for the 26 participants from the Brown University are unavailable. It turns out that 770 training children and 171 testing children were used in our study. The number of children from each site is listed in Table II.
Table II:
Site Name | Training | Testing |
---|---|---|
Peking University | 194 | 51 |
Brown University | 0 | 26 |
Kennedy Krieger Institute | 83 | 11 |
NeuroImage | 48 | 25 |
New York University | 217 | 41 |
Oregon Health & Science University | 79 | 34 |
University of Pittsburgh | 89 | 9 |
Washington University | 60 | 0 |
Total | 770 | 171 |
Prior to classification, we performed the tensor PCA to downsize the MRI data. The original size of the MRI data is 121 × 145 × 121. We experimented four different scales of PCA. Then we applied the tensor LogitBoost and regular LogitBoost to classify ADHD status based on the subject’s downsized MRI data. For the 770 training data, we employed a five-fold cross-validation to tune the number of boosting steps L, the rank of tensor regression R, and the shrinkage parameter ν in the tensor LogitBoost (upperleft panel). Also, L and R in the regular LogitBoost were tuned using the same strategy (upper-right panel). We then applied the tuned model to the testing data and evaluated the misclassification error rate (lower-right panel).
The classification errors on the testing set are reported in Table III. Again, we observe superior performance of tensor LogitBoost compared to the regular LogitBoost. The best prediction accuracy of 69.00% can be achieved when the 43 × 51 × 43 downsized MRI is used as the input of the tensor LogitBoost. Recall that the ADHD-200 global competition results show that the prediction accuracy varied from 43.08% to 61.54% with mean 56.02%. Hence, the results demonstrate convincingly the advantage of our new classification procedure.
Table III:
Reduced dimension | Regular LogitBoost | Tensor LogitBoost |
---|---|---|
25 × 29 × 25 | 0.37 | 0.33 |
31 × 37 × 31 | 0.39 | 0.32 |
37 × 44 × 37 | 0.39 | 0.33 |
43 × 51 × 43 | 0.45 | 0.31 |
VI. Conclusion
Modern brain image data have posed various challenges to the traditional classification methods due to their high dimension and complex structure. To address these challenges, we propose the tensor LogitBoost algorithm as an “off-the-shelf” classification tool based on tensor-valued brain image data.
Compared to the current existing classification tools in brain image literature, tensor LogitBoost is straightforward to implement and applies to a variety of brain image data, such as MRI, DTI, fMRI, PET, EEG and MEG. By maintaining the tensor structure along the pipeline, it is able to retain the rich structural information in the image data and improves the performance of the regular boosting machines. Our applications to the EEG data and MRI data demonstrate its competence.
We concentrated on a two-class classification problem. However, extensions to multi-class are worth further research. The multi-class LogitBoost algorithm would be a good starting point. The multi-class tensor LogitBoost algorithm would be similar to the original multi-class Log-itBoost, except for a few differences at the level of weak learners. The regression functions should be learned using weighted tensor least squares.
Algorithm 4 LogitBoost algorithm for multi-class classificaiton problem | |
---|---|
Acknowledgment
The authors would like to thank Professor Lexin Li from University of California, Berkeley for his constructive and inspiring comments and suggestions on our research on Tensor LogitBoosting with images as predictors.
References
- [1].Pardo PJ, Georgopoulos AP, Kenny JT, Stuve TA, Findling RL, and Schulz SC, “Classification of adolescent psychotic disorders using linear discriminant analysis,” Schizophrenia Research, vol. 87, pp. 297–306, 2006. [DOI] [PubMed] [Google Scholar]
- [2].Ford J, Farid H, Makedon F, Flashman LA, McAllister TW, Megalooikonomou V, and Saykin AJ, “Patient classification of fmri activation maps,” in in Proc. of the 6th Annual International Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI’03, 2003, pp. 58–65. [Google Scholar]
- [3].Sung C, Woo J, Goodman M, Huffman T, and Choe Y, “Scalable, incremental learning with mapreduce parallelizetion for cell detection in high-resolution 3d microscopy data,” in Neural Networks (IJCNN), The 2013 International Joint Conference on IEEE, 2013, pp. 1–7. [Google Scholar]
- [4].Calhoun VD, Maciejewski PK, Pearlson GD, Pearlson GD, and Kiehl KA, “Temporal lobe and default hemodynamic brain modes discriminate between schizophrenia and bipolar disorder,” Human Brain Mapping, vol. 29, pp. 1265–1275, 2008. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [5].Demirci O, Clark V, and Calhoun VD, “A projection pursuit algorithm to classify individuals using fmri data: Application to schizophrenia,” NeuroImage, vol. 39, pp. 1774–1782, 2008. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [6].Li B, Kim MK, and Altman N, “On dimension folding of matrix- or array-valued statistical objects,” Annals of Statistics, vol. 38, pp. 1094–1121, 2010. [Google Scholar]
- [7].Li B, Artemiou A, and Li L, “Principal support vector machines for linear and nonlinear sufficient dimension reduction,” Annals of Statistics, vol. 39, no. 6, pp. 3182–3210, 2011. [Google Scholar]
- [8].Zhang B and Wang L, “Structure preserving dimension reduction with 2d images as predictors,” in Big Data (Big Data), 2016 IEEE International Conference on IEEE, 2016, pp. 3619–3624. [Google Scholar]
- [9].Breiman L et al. , “Arcing classifier (with discussion and a rejoinder by the author),” The annals of statistics, vol. 26, no. 3, pp. 801–849, 1998. [Google Scholar]
- [10].Zhou H, Li L, and Zhu H, “Tensor regression with applications in neuroimaging data analysis,” Journal of the American Statistical Association, vol. 108, no. 502, pp. 540–552, 2013. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [11].Friedman JH, “Greedy function approximation: a gradient boosting machine,” Ann. Statist, vol. 29, no. 5, pp. 1189–1232, 2001. [Google Scholar]
- [12].Ridgeway G, “The state of boosting,” Computing Science and Statistics, vol. 31, pp. 172–181, 1999. [Google Scholar]
- [13].Freund Y and Schapire RE, “A desicion-theoretic generalization of on-line learning and an application to boosting,” in European conference on computational learning theory Springer, 1995, pp. 23–37. [Google Scholar]