Abstract
In this paper, we propose a new nonparametric Bayesian framework to cluster white matter fiber tracts into bundles using a hierarchical Dirichlet processes mixture (HDPM) model. The number of clusters is automatically learnt from data with a Dirichlet process (DP) prior instead of being manually specified. After the models of bundles have been learnt from training data without supervision, they can be used as priors to cluster/classify fibers of new subjects. When clustering fibers of new subjects, new clusters can be created for structures not observed in the training data. Our approach does not require computing pairwise distances between fibers and can cluster a huge set of fibers across multiple subjects without subsampling. We present results on multiple data sets, the largest of which has more than 120, 000 fibers.
1 Introduction
Diffusion Magnetic Resonance Imaging (dMRI) is an MRI modality that has gained tremendous popularity over the past five years and is one of the first methods that made it possible to visualize and quantify the organization of white matter in the human brain in vivo. Extracting connectivity information from dMRI, termed “tractography”, is an especially active area of research, as it promises to model the pathways of white matter tracts in the brain, by connecting local diffusion measurements into global trace-lines. In neurological studies of white matter using tractography it is often important to identify anatomically meaningful fiber bundles. Similar fibers form clusters of points, where each cluster is identified as a “fiber bundle”.
In this paper, we propose a nonparametric Bayesian framework to cluster fibers into bundles. The 3D space of the brain is quantized into voxels. A bundle is modeled as a multinomial distribution over voxels and orientations. This probabilistically models the spatial variation of the pathways of fibers. The models of bundles are learnt from how voxels are connected by fibers instead of comparing distances between fibers. If two voxels are connected by many fibers, both of the voxels have large weights in the model of the same bundle, which means that they are on the same pathway of white matter tracts. Many existing approaches have difficulty in determining the number of clusters and in clustering a very large set of fibers. Our approach automatically learns the number of clusters from data with a Dirichlet process (DP) prior [1]. While the space and time complexities of existing distance-based fiber clustering approaches are at least O(M2), where M is the number of fibers, the space complexity of our approach is O(M) since it does not compute and store pairwise distances between fibers.
After the models of bundles have been learnt from training data without supervision, they are used as priors to cluster/classify new fibers. When clustering fibers of new subjects, our approach adapts the models of bundles to new data and creates new clusters for structures which are not observed in training data, instead of fixing the number of clusters as current methods do. Our framework can be extended to multiscale clustering. First cluster fibers using a large size of voxels and bundles correspond to structures at a large scale. Then each bundle can be further clustered using a smaller size of voxels, leading to structures at a finer scale. An example is shown in Figure 1. Multiscale clustering makes it easier for experts to identify white matter structures across different scales.
Fig. 1.
An example of multiscale clustering. The spatial range of the whole brain is 200 × 200 × 200. (a): The clustering result when the space is quantized into voxels of size 11×11×11. The bundles correspond to structures at a large scale. (b): One bundle from (a). (c): The space is quantized into voxels of size 3 × 3 × 3 and the bundle in (b) is further clustered into smaller bundles corresponding to structures at a finer scale.
1.1 Related Work
Automatically clustering fibers has drawn a lot of attention in recent years. A typical framework is to first define a pairwise similarity/distance between fibers and to input the similarity matrix to standard clustering algorithms. Brun et al. [2] computed the Euclidean distances between 9-D fiber shape descriptors. Jonasson et al. [3] measured the similarity between two fibers by counting the number of points sharing the same voxel. Gerig et al. [4] proposed three measures related to Hausdorff distance: closest point distance, mean of closest distances and Hausdorff distance. Various clustering algorithms, such as hierarchical clustering (single-link and complete-link) [4,5], fuzzy c-means [6], k-nearest neighbors [7], normalized cuts [2] and spectral clustering [3,8] were used. Mean of closest distances and spectral clustering were popular among possible choices [8,9].
These clustering algorithms required manually specifying the number of clusters or a threshold for deciding when to stop merging/splitting clusters, both of which are difficult to know especially when the data sets are complicated and noisy. Moberts et al. [9] showed that the performance of clustering varied dramatically when different numbers of clusters were chosen. To avoid this difficulty, O’Donnell and Westin [8] first chose a large cluster number for spectral clustering and then manually merged clusters to obtain models for white matter structures.
Another drawback of this framework is the high space and time complexities of computing pairwise distances between fibers when the data set is large. Whole brain tractography produces between 10, 000 and 100, 000 fibers per subject. It is difficult to compute a 100, 000 × 100, 000 similarity matrix or even to store it in memory. Some clustering algorithms, such as spectral clustering, need to compute the eigenvectors of this huge similarity matrix. This problem becomes more serious when clustering fibers of multiple subjects. The current solutions are to cluster only a small portion of the whole data set after subsampling or to do some numerical approximation based on the sampled subset [8]. However, important information from the full data set may be lost after subsampling.
Maddah et al. [10] proposed a probabilistic approach to cluster fibers without computing pairwise distances. They used a Dirichlet distribution1 as a prior to incorporate anatomical information. This approach is different from ours. It used a parametric model, assuming that the number of clusters is known and required manual initialization of cluster centers. [10] required establishing point correspondence which was difficult, while our approach does not.
Dirichlet process mixtures (DPM) models were applied to medical image analysis in recent years because of their capability to learn the number of clusters and their flexibility to adapt to a wide variety of data. Adelino [11] used a DPM model for brain MRI tissue classification. In [12,13] DPM models were used to model spatial brain activation patterns in functional magnetic resonance imaging. In [14], Jbabdi et al. modeled the connectivity profiles of a brain region as an infinite mixture of multivariate Gaussian distributions with a DP prior. To the best of our knowledge, our work is the first to use HDPM for tractography segmentation to automatically learn the number of clusters from data. Our approach is related to the work [15] where HDPM models were used for word-document analysis. HDPM was also used for trajectory analysis in visual surveillance [16].
2 Method
We begin by introducing DP in Section 2.1. In Section 2.2 and 2.3 we propose our HDPM model for clustering fibers and use Gibbs sampling for inference. In Section 2.3, we explain how to use the learned models of bundles as a prior to cluster new data.
2.1 Dirichlet Process
DP [1] is used as a prior to sample probability measures. It is defined by a concentration parameter α, which is a positive scalar, and a base probability measure H. A probability measure G randomly drawn from DP(α, H) is always a discrete distribution,
(1) |
which can be obtained from a stick-breaking construction [17]. In Eq (1), ϕk is a parameter vector sampled from H, δϕk is a Dirac delta function centered at ϕk, and is a non-negative scalar constructed by .
G can be used as a prior for infinite mixture models. Let {wi} be a set of observed data points. wi is sampled from a density function p(·|θi) parameterized by θi, and θi(which is one of the ϕks in Eq (1)) is sampled from G. Data points sharing the same parameter vector ϕk are clustered together under this mixture model. Given parameter vectors θ1,…,θN of N data points, the parameter vector θN+1 of data point wN+1 can be sampled from a prior by integrating out G,
(2) |
There are K distinct parameter vectors (identifying K components) among θ1,…,θN. nk is the number of points with parameter vector . θN+1 can be assigned as one of the existing components (wN+1 is assigned to one of the existing clusters) or can sample a new component from H (a new cluster is created for wN+1). The posterior of θN+1 is
(3) |
It is likely for the Dirichlet process mixture (DPM) model to create a new component if existing components cannot well explain the data. There is no limit to the number of components. These properties make DP ideal for modeling data clustering problems when the number of clusters is not well-defined in advance.
2.2 Hierarchical Dirichlet Process Mixture Model
In probability theory, statistics, and machine learning, a graphical model is a graph that represents independences among random variables. The graphical model of our HDPM model is shown in Figure 2. There are M fibers and each fiber j has Nj points which are ordered sequentially. oji = (uji, Δuji) is the observed 3D coordinate uji= (xji, yji, zji) and shift Δuji= uji+1 − uji of point i on fiber j. The 3D space of the brain is uniformly quantized into voxels and shifts are quantized into three orientations Δu1 = (1, 0, 0)T, Δu2 = (0, 1, 0)T and Δu3 = (0, 0, 1)T. A codebook is built, in which codes (entries of the codebook) are indices of voxels and orientations. Let uw be the centroid of the voxel and dw be the index of the orientation vector corresponding to code w. Quantization is done in a probabilistic way,
(4) |
(5) |
(6) |
Since we do not distinguish the starting and ending points of a fiber, the sign of the correlation between Δuji and Δud is ignored in Eq (6). The statistical model ϕk of a bundle is a multinomial distribution over voxels and orientations. Optionally, if the symmetry across hemispheres is considered, we can do bilateral clustering as in [8]. Assuming that the brain is aligned and x = 0 is the midsagittal plane, we modify observed 3D coordinates as uji = (|xji|, yji, zji) ignoring the signs of x coordinates. Thus, learnt models of bundles are symmetric to the midsagittal reflection.
Fig. 2.
The graphical model of our HDPM model. The right side list the distributions where the random variables are sampled from. G0 is a prior on the whole data set. Gj is a prior on fiber j. Both G0 and Gj are sampled from DP. θji is the model of a bundle sampled for a point. wji and oji are the code and observation of a point.
A prior G0 on the whole data set is sampled from a DP, G0 ~ DP(γ, H), where the base measure H is a Dirichlet distribution. is a infinite mixture in which components {ϕk} are models of bundles. For a fiber j, a prior Gj is sampled from a DP, Gj = DP (α, G0). It was shown that in HDPM all the Gj share the same set of components {ϕk} as G0, however, they have different weights πj over {ϕk}, i.e. [15]. Thus the models of bundles are learnt from all the fibers, and fibers have different distributions over bundles. For a point i on fiber j, the model θji(θji ∈ {ϕk}) of a bundle is sampled from Gj, θji ~ Gj. Its index of voxel and orientation wji is sampled from the model of a bundle, wji ~ Discrete(θji). Observation oji is sampled from p(oji|wji). Concentration parameters γ and α are sampled from gamma priors, γ ~ Gamma(a1, b1), α ~ Gamma(a2, b2). In Figure 2, H, a1, a2, b2 and b2 are hyperparameters. The clustering performance is quite robust to the choice of their values in a large range. {oji} are observations. The remaining are hidden variables to be inferred. A fiber j is assigned to a bundle k with maximum πjk.
The data likelihood is higher when the distribution of a fiber concentrates on fewer bundles instead of being uniform. So if two voxels are connected by many fibers, both of them have large weights in the model of the same bundle.
The size of voxels determines the scale of the structures to be learnt. Our framework can be extended to multiscale clustering. First cluster fibers using a large size of voxels and bundles correspond to structures at a large scale. Then each bundle can be further clustered using a smaller size of voxels, showing structures at a finer scale. Multiscale clustering makes it easier for experts to identify white mater structures across different scales.
2.3 Inference
We use the Gibbs sampling inference proposed in [15], which is based on Chinese restaurant franchise. First we introduce some notations. cji is the index of the bundle assigned to point i on fiber j. njk is number of points assigned to bundle k on fiber j. nj is the number of point on fiber j. mkw is the number of points with code w and being assigned to bundle k. mk is the total number of points assigned to bundle and mean that they are statistics without counting cji. H = Dir(h,…,h) is a flat Dirichlet prior. L is the size of the codebook.
During the sampling procedure, suppose that K models of bundles (clusters) have been created and assigned to data. Then,
(7) |
The Gibbs sampling scheme proposed in [15] integrated out {πjk} and {ϕk} without sampling them. The posterior of cji is given by
(8) |
This posterior is the product of two terms which explain how many points on fiber j are assigned to bundle , and how well the code wji fits the model of an existing bundle or a flat distribution (1/L) to create a new bundle. This shows that points on the same fiber tend to choose the same bundle. The posteriors of {π0k}, γ and α involve more details of the Chinese restaurant franchise. They can be found in [15]. We need one more step to sample wji which is not observed in our model but observed in [15],
(9) |
(10) |
where p(oji|wji) is given by Eq (4).
Although {ϕk} and {πjk} are not explicitly sampled during the Gibbs sampling procedure, they can be estimated from any single sample,
The space complexity of our approach is O(M). The time complexity of each Gibbs sampling iteration is O(M). It is difficult to provide theoretical analysis on the convergence of Gibbs sampling. In practice, we stop burn-in when the data likelihood converges. From our empirical observation, the time complexity of our approach is much lower than O(M2). Recently some more efficient inference approaches, such as variational inference [18], and parallel sampling [19], have been proposed and applied to DPM and HDPM models. In future work, we will study how to improve the inference of our model using these schemes.
2.4 Clustering New Data
After the models of bundles have been learnt from training data without supervision, we can fix G0 and the number of clusters to classify new fibers. Thus our model is converted to a parametric model.
Optionally, we can use the models learnt from training data as priors to cluster instead of classifying new data. The pre-learnt models can adapt to new data, and new clusters can be created for structures not observed in the training data. Our HDPM model for clustering testing data is shown in Figure 3. Suppose that G0 represented in Eq (7) has been learnt from training data and K clusters are created. A prior on testing data is to be learnt. Different from the model shown in Figure 2, where G0 is generated from a DP with a flat base measure H, is generated from DP(γ*, F), where the based measure F is constructed from G0 and includes models learnt from training data.
(11) |
F is composed of two parts: the models learnt from training data and a flat prior. ω* is a scalar between 0 and 1. {π̂0k} are normalized weights in G0,
This assumes that before observing any testing data, there already exist K models of bundles . However, instead of letting be equal to ϕk in Eq (7), we sample from a Dirichlet distribution choosing ϕk as prior,
where is a positive scalar. Thus the models of bundles can adapt to testing data instead of being fixed.
Fig. 3.
The graphical model of our HDPM model for clustering testing data. F is used as the base measure to sample the prior on the new data set. F includes the models of bundles in G0 learnt from training data.
The choice of γ*, ω* and controls how much the models learnt from the training data affect the clustering of testing data. The two extreme cases are that the pre-learnt models have no effect on clustering new data and that the models learnt from new data are exactly the same as those learnt from training data .
Suppose there are K* models of bundles assigned to testing data. Then an explicit construction of is given by
(12) |
Models have been seen in training data. They are sampled from priors and are updated using testing data. are new models not found in training data. They are sampled from a flat prior Dir(H). The remaining parts are the same as described in Section 2.2.
3 Results
We evaluate our approach on multiple data sets. The spatial range of the whole brain is roughly 200 × 200 × 200. The size of voxels is 11 × 11 × 11. We choose the hyperparameters in Figure 2 as a1 = a2 = b1 = b2 = 1, h = 0.3. We do bilateral clustering. Running on a computer with 3GHz CPU, it takes around one minute to cluster 1, 000 fibers and around four hours to cluster 60, 000 fibers.
The first data set has 3, 152 fibers with ground truth. They are manually labeled to six anatomical structures. Figure 4 (a)–(d) plots the clustering results of our approach and a spectral clustering approach, compared with the ground truth. Colors are used to distinguish clusters. Since clusters may be permuted in different results, the meaning of colors is not consistent across different results. The spectral clustering approach uses the mean of closest distances as the distance measure, which was found the most effective in previous studies [9,8]. The clustering result of our approach is close to the ground truth. Although the correct number of clusters has been set, two anatomical structures are merged in the result of the spectral clustering approach. A few outlier fibers form a small cluster. As the number of clusters increases to 7, the two anatomical structures still cannot be separated, instead, another structure splits into two clusters.
Fig. 4.
Compare the results of two clustering approaches with the ground truth on a data set with 3, 152 fibers. Two views are plotted for each result. (a) Ground truth. (b) Our approach. (c) Spectral clustering when the number of clusters is 6. (d) Spectral clustering when number of clusters is 7. (e) The accuracies of completeness and correctness of spectral clustering and our approach (HDPM).
There are two important aspects, called correctness and completeness, to be considered when comparing a clustering result with the ground truth [9]. Correctness implies that fibers of different anatomical structures are not clustered together. Completeness means that fibers of the same anatomical structures are clustered together. Putting all the fibers into the same cluster results in 100% completeness and 0% correctness. Putting every fiber into a singleton cluster results in 100% correctness and 0% completeness. To measure correctness, we randomly sample 5, 000 pairs of fibers which are in different anatomical structures according to the ground truth and calculate the accuracy (rcorrect) of they are also in different clusters according to the clustering result. To measure completeness, we randomly sample 5, 000 pairs of fibers which are in the same anatomical structures and calculate the accuracy (rcorrect) of they are also in the same clusters. raverage = (rcorrect + rcomplete)/2 is also computed. The accuracies of our approach and spectral clustering are plotted in Figure 4 (e). As we increase the number of clusters from 2 to 25, the correctness of spectral clustering increases and its completeness decreases. Its best raverage is found when the number of clusters is five, which is close to the ground truth, and it is lower than raverage of our approach. The correctness of our approach is almost consistently better than spectral clustering until spectral clustering chooses more than 20 clusters. The completeness of our approach is significantly better than spectral clustering when the number of clusters of spectral clustering is larger than 5.
We compare our approach with the approach proposed in [8] on a larger data set with 12, 420 fibers. In [8], fibers were first grouped into a large number of clusters (200) and then experts merged these clusters to obtain anatomical structures. In this data set there are 10 anatomical structures. Our approach clusters these fibers to 27 clusters. We also manually merge them to these 10 anatomical structures, however its takes much less effort than [8] since the number of clusters is smaller. Figure 5 shows some of the anatomical structures obtained by the two approaches. 83.2% fibers have consistent anatomical labels according to the two results. To evaluate how our approach is sensitive to initialization, we run 50 trials of Gibbs sampling with random initializations. Figure 5 (g) plots the frequency of the numbers of clusters learnt from data.
Fig. 5.
Compare results of our approach and the approach proposed in [8], in which experts manually merged the clusters from spectral clustering to obtain anatomical structures. (a) Clustering all the fibers using our approach. (b1)-(f1) show the obtained anatomical structures by merging clusters from our approach (totally 27 clusters). (b2)-(f2) show the obtained anatomical structures by merging clusters from spectral clustering (totally 200 clusters). Colors are used to distinguish clusters. (g) plots the frequency of the numbers of clusters learnt by our approach when running 50 trials of Gibbs sampling with random initializations.
Figure 6 shows the results of clustering fibers across multiple subjects. The training data has 63, 751 fibers of two subjects. The models of bundles are learnt from all these fibers. The testing data has 61, 572 fibers of two subjects.
Fig. 6.
Cluster fibers across multiple subjects
4 Conclusion
We propose a nonparametric Bayesian framework for tractography segmentation. The number of clusters is automatically learnt from data through DP. This method has much lower space complexity than distance-based clustering methods can cluster a very large set of fibers. Our Bayesian model is very flexible to include knowledge from experts as priors. In future work, we plan to incorporate anatomical information in the model to guide tractography segmentation.
Acknowledgments
This work was supported by NIH grants NIH R01 MH074794, NIH P41 RR13218, NIH U54 EB005149, and NIH U41 RR019703. The authors also want to acknowledge valuable discussions with Lauren O’Donnell regarding techniques for clustering fiber tracts.
Footnotes
Dirichlet distribution is used as a prior of finite mixture models. These models can only well adapt to data from particular distributions. Dirichlet process in our approach is used as a prior in infinite mixture models. These models can well adapt to a wide variety of data.
Contributor Information
Xiaogang Wang, Email: xgwang@csail.mit.edu.
W. Eric L. Grimson, Email: welg@csail.mit.edu.
Carl-Fredrik Westin, Email: westin@bwh.harvard.edu.
References
- 1.Ferguson TS. A bayesian analysis of some nonparametric problems. The Annals of Statistics. 1973;1:209–230. [Google Scholar]
- 2.Brun A, Knutsson H, Park HJ, Shenton ME, Westin C-F. In: Clustering fiber traces using normalized cuts. Barillot C, Haynor DR, Hellier P, editors. vol. 3216. Heidelberg: Springer; 2004. pp. 368–375. MIC-CAI 2004. LNCS. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Jonasson L, Hagmann P, Thiran JP, Wedeen VJ. Fiber tracts of high angular resolution diffusion mri are easily segmented with spectral clustering. International Society for Magnetic Resonance in Medicine. 2005:1310. [Google Scholar]
- 4.Gerig G, Gouttard S, Corouge S. Analysis of brain white matter via fiber tract modeling; Proc. of IEEE Engineering in Medicine and Biology; 2004. pp. 4421–4424. [DOI] [PubMed] [Google Scholar]
- 5.Xia Y, Turken AU, Whitfield-Gabrieli SL, Gabrieli JD. In: Knowledge-based classification of neuronal fibers in entire brain. Gerig G, editor. vol. 3749. Heidelberg: Springer; 2005. pp. 205–212. MICCAI 2005 LNCS. [DOI] [PubMed] [Google Scholar]
- 6.Maddah M, Grimson WEL, Warfield SK, Wells WM. A unified framework for clustering and quantitative analysis of white matter fiber tracts. Medical Image Analysis. 2008;12:191–202. doi: 10.1016/j.media.2007.10.003. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Ding Z, Gore JC, Anderson AW. Classification amd quantification of neuronal fiber pathways using diffusion tensor MRI. Magnetic Resonance in Medicine. 2003;49:716–721. doi: 10.1002/mrm.10415. [DOI] [PubMed] [Google Scholar]
- 8.O’Donnell LJ, Westin CF. Automatic tractography segmentation using a high-dimensional white matter atlas. IEEE Trans. on Medical Imaging. 2007;26:1562–1575. doi: 10.1109/TMI.2007.906785. [DOI] [PubMed] [Google Scholar]
- 9.Moberts B, Vilanova A, Jake JW. Evaluation of fiber clustering methods for diffusion tensor imaging; Proc. of IEEE Visualization; 2005. pp. 65–72. [Google Scholar]
- 10.Maddah M, Zollei L, Grimson WEL, Wells WM., III Modeling of anatomical information in clustering of white matter fiber trajectories using dirichlet distribution. MMBIA. 2008:1–7. doi: 10.1109/CVPRW.2008.4563003. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Adelino R, Ferreira S. A dirichlet process mixture model for brain MRI tissue classification. Medical Image Analysis. 2006;11:169–182. doi: 10.1016/j.media.2006.12.002. [DOI] [PubMed] [Google Scholar]
- 12.Kim S, Smyth P. Hierarchical dirichlet processes with random effects; Proc. of NIPS; 2006. pp. 697–704. [Google Scholar]
- 13.Thirion B, Tucholka A, Keller M, Pinel P. In: High level group analysis of FMRI data based on dirichlet process mixture models. Karssemeijer N, Lelieveldt B, editors. vol. 4584. Heidelberg: Springer; 2007. pp. 482–494. IPMI 2007. LNCS. [DOI] [PubMed] [Google Scholar]
- 14.Jbabdi S, Woolrich MW, Behrens T. Multiple-subjects connectivity-based par-cellation using hierarchical dirichlet process mixture models. NeuroImage. 2009;44:373–384. doi: 10.1016/j.neuroimage.2008.08.044. [DOI] [PubMed] [Google Scholar]
- 15.Teh YW, Jordan MI, Beal MJ, Blei DM. Hierarchical dirichlet processes. Journal of the American Statistical Association. 2006;101:1566–1581. [Google Scholar]
- 16.Wang X, Ma KT, Ng GW, Grimson WEL. Trajectory analysis and semantic region modeling using a nonparametric Bayesian model; Proc. of CVPR; 2008. pp. 1–8. [Google Scholar]
- 17.Sethuraman J. A constructive definition of dirichlet priors. Statistica Sinica. 1994;4:639–650. [Google Scholar]
- 18.Blei DM, Jordan MI. Variational inference for dirichlet process mixtures. Bayesain Analysis. 2006;1:121–144. [Google Scholar]
- 19.Asuncion A, Smyth P, Welling M. Asynchronous distributed learning of topic models; Proc. of NIPS; 2008. [Google Scholar]