Abstract
The segmentation of infant brain tissue images into white matter (WM), gray matter (GM), and cerebrospinal fluid (CSF) plays an important role in studying early brain development. In the isointense phase (approximately 6–8 months of age), WM and GM exhibit similar levels of intensity in both T1 and T2 MR images, resulting in extremely low tissue contrast and thus making the tissue segmentation very challenging. The existing methods for tissue segmentation in this isointense phase usually employ patch-based sparse labeling on single T1, T2 or fractional anisotropy (FA) modality or their simply-stacked combinations without fully exploring the multi-modality information. To address the challenge, in this paper, we propose to use fully convolutional networks (FCNs) for the segmentation of isointense phase brain MR images. Instead of simply stacking the three modalities, we train one network for each modality image, and then fuse their high-layer features together for final segmentation. Specifically, we conduct a convolution-pooling stream for multimodality information from T1, T2, and FA images separately, and then combine them in high-layer for finally generating the segmentation maps as the outputs. We compared the performance of our approach with that of the commonly used segmentation methods on a set of manually segmented isointense phase brain images. Results showed that our proposed model significantly outperformed previous methods in terms of accuracy. In addition, our results also indicated a better way of integrating multi-modality images, which leads to performance improvement.
Index Terms: FCN, multi-modality, brain image, segmentation
1. INTRODUCTION
The first year of life is the most dynamic phase of the postnatal human brain development, with the rapid tissue growth and development of a wide range of cognitive and motor functions. Accurate tissue segmentation of infant brain MR images into white matter (WM), gray matter (GM) and cerebrospinal fluid (CSF) in this phase is of great importance in studying the normal and abnormal early brain development. It is well-known that the segmentation of infant brain MRI is considerably more difficult than that of the adult, due to reduced tissue contrast [1], increased noise, severe partial volume effect [2], and ongoing WM myelination [1, 3] in the infant images. As an illustration, the first two images in Fig. 1 show examples of T1 and T2 images in around 6 months of age. It can be observed that the WM and GM exhibit almost the same intensity level (especially in the cortical regions), resulting in the lowest image contrast and hence significant difficulty for tissue segmentation.
Fig. 1.
The original multi-modality data (T1, T2 and FA) of an infant subject scanned at 6-month old (isointense phase).
Although many methods have been proposed for infant brain image segmentation, most of them focused either on segmentation of the neonatal images (≤3 months) or infant images (>12 months) using a single T1 or T2 modality [4–8], which demonstrates a relatively good contrast between the WM and GM. Few studies have addressed the difficulties in segmentation of the isointense-phase images. Shi et al. [9] first proposed a 4D joint registration and segmentation framework for segmentation of infant MR images in the first year of life. In this method, longitudinal images in both infantile and early adult-like phases were used to guide the segmentation of images in the isointense phase. A similar strategy was later adopted in [10]. The major limitation of these methods is that they fully depend on the availability of longitudinal datasets [11]. Considering that the majority of infant images are single time-point, a standalone method working for cross-sectional single-time point image is largely desired. Zhang et al. [12] proposed deep convolutional neural networks (CNNs) which learned a hierarchy of increasingly complex features from T1, T2 and FA for the segmentation of isointense-phase brain image. However, their methods take the center voxel’s of each patch during the learning. Consequently, their method is somehow sensitive to the patch size, especially for the voxels in the boundaries of WM/GM. Moreover, their method contains a huge number of parameters which make the network difficult to converge.
To overcome the above-mentioned difficulties, we propose to employ the fully convolutional networks (FCNs) [13] to do a pixel-level segmentation for the infant brain image. FCNs [13] are a special case of convnets which train end-to-end, pixel-to-pixel and have much less network parameters. FCNs consist of multiple convolution and pooling layers, which reduce the number of network parameters to a large extent due to using no fully connected layers, thus FCNs can simplify and speed up learning and inference of the networks and make the learning problem much easier. FCNs can take input of arbitrary size of data to generate correspondingly-sized output with efficient inference and learning. They generate dense pixel output by interpolating the coarse output via a deconvolution layer. FCNs have achieved state-of-art performance for semantic segmentation on multiple public dataset. Moreover, we employ multi-modality information from T1, T2 and FA to address low contrast. Different from CNN [12], we train a separate network for each modality image and then fuse their outputs in higher layers of their networks. This allows weights and biases in the CNN to be specifically optimized for each modality and corresponding kernel size.
2. SUBJECTS
In our experiment, we acquired T1, T2, and diffusion weighted MR images of 10 healthy infants using a Siemens 3T head only MR scanner. T2 images and fractional anisotropy (FA) images, derived from distortion-corrected DWI, were first rigidly aligned with the T1 image and further up-sampled into an isotropic grid with a resolution of 1 × 1 × 1 mm3. With this brain mask, we finally removed the skull, cerebellum and brain stem also from the aligned T2 and FA images.
To generate manual segmentation, initial segmentation was first obtained with a publicly available infant brain segmentation software, IBEAT (Dai et al., 2013). Then, manual editing was carefully performed by an experienced rater according to the T1, T2 and FA images for correcting possible segmentation errors.
3. METHOD
Deep learning models can learn a hierarchy of features – high-level features building upon low-level ones. The CNNs [16, 17] are a type of deep models, in which trainable filters and local neighborhood pooling operations are applied in an alternating sequence starting with the raw input images. When trained with appropriate regularization, CNNs can achieve superior performance on visual object recognition and image classification tasks [14].
In this paper, we employ FCNs [13] to segment the isointense-phase brain image with T1, T2 and FA modality information. Instead of simply combining three modality data from the original (low-level) feature maps, we propose a deep architecture to effectively fuse their high-level information from three modalities.
In the following, we will first introduce our FCNs architecture for single modality, and then present multi-FCNs (mFCNs) for multiple modalities to effectively fuse their complementary information. Training details will be provided in Sections 3.3 and 3.4.
3.1 FCNs architecture for single modality
One challenge when using deep learning framework is the design for architecture. Inspired by Simonyan et al.’s work [15], we design our architecture for single modality with three groups of convolutional layers and two de-convolutional layers, as shown in Fig. 2. This network also applies softmax to the top of the networks.
Fig. 2.
FCNs architecture for single modality.
The 1st convolutional layer group consists of two convolutional layers, followed by a pooling layer, and produces feature maps of size 32 × 32. These feature maps are fed into the 2nd group of three convolution layers, followed by a pooling layer, leading to feature maps of size 16 × 16 × 64. In the 3rd layer group, one convolutional layer with a filter size 3 × 3 is applied to the outputs of the 2nd layer group. Here, we use 1 pixel as stride and add 1 pixel as pad for all convolution layers.
Then, the output feature maps from the 3rd layer group are up-sampled through a deconvolution layer to the 4th layer group with 64 filters of size 4 × 4. Thus, 64 feature maps of size 32 × 32 are generated. To form the 5th layer group, another deconvolution layer is employed for the output feature maps, and this layer consists of 4 filters with size of 4 × 4, in which 4 filters corresponds to 4 categories such as CSF, GM, WM, and background.
Finally, we have a top layer consisting of the softmax units, aimed at predicting one label, out of possible 4 labels, for each pixel.
The activation function in our networks are all rectifier linear units. Our network minimizes the softmax loss between the predicted label and ground-truth label. In total, the number of trainable parameters for this architecture is 20, 8548, 96% smaller than [12], in which there are 5,332,995 parameters.
3.2 mFCNs architecture for multiple modality
To effectively employ multi-modalities from T1, T2 and FA, we propose a new multi-FCNs (mFCNs) architecture for training one network for each modality and then fusing multiple-modality features from high-layer of each network.
The deep learning architectures in both Fig.2 and Zhang et al.’s CNN [12] are designed to learn the complementary information from T1, T2 and FA images. However, T1, T2 and FA may have little direct complementary information in their original image spaces due to different generation techniques [16]. Srivastava et al. [16] proposed a multimodal learning with Deep Boltzmann Machines, in which they trained multiple pathways for each modality, and their experiments showed this framework could present a very good feature representation for multimodal data. Inspired by their work, we propose a new mFCNs architecture aiming at learning separate FCNs for each modality and then fuse their features from the higher-layer by assuming that high-level representations from different modalities are more complementary to each other. The new architecture is presented in Fig.3.
Fig. 3.
Multi-FCNs (mFCNs) architecture for multiple modalities, i.e., T1, T2 and FA.
In our new architecture, the type of the lower-level networks in each pathway could be different, accounting for different input distributions. And we use a fusion layer to combine the outputs from different streams. The layer group settings can refer to descriptions in Section 3.1.
The intuition behind our model can be explained as follows. The statistical properties of the three data modalities are very different, which make it difficult for a single model (such as Fig. 2) to directly find correlations across modalities. In this new model, this difference can be largely bridged by fusing their higher-layer features.
3.3 Weighting the Loss
Since the quantity of data for each category differs a lot, we have to apply a class balance strategy during training. Specifically, weighting the loss is adopted to achieve a class balance. For example, we give weight for the loss inversely proportional to the fraction of cases of the corresponding class.
3.4 Training of FCNs
FCNs expected full image as input in order to have broad receptive field, but dataset in our experiment is very small. Thus we made a tradeoff of receptive field and dataset size, i.e., extracting patches at a size of 64 × 64 for both original images and manually segmented image.
With the deep architecture shown in Fig.3, we need to set optimal model hyper-parameters. We initialize values for the weights using xavier algorithm [17], which can automatically determine the scale of initialization based on the number of input and output neurons. For the network bias, we initialize it to be 0. We do a coarse line search to determine the initial learning rate and weight decay parameter, followed by decreasing the learning rate during training. Specifically, we train the proposed model by Caffe [18], a commonly used deep-learning framework.
4. EXPERIMENTAL RESULTS
In the experiments, we focused on evaluating our proposed deep architectures for segmenting the three types of infant brain tissues. We formulated the prediction of brain tissue classes as a four-class classification task (WM, GM, CSF and background). As the dataset and task are the same as [12], we use their reported performance for comparison. Also, to test the efficiency of the proposed mFCNs, we compared with a FCN model that combines 3 modalities as input feature maps for networks. Similar to CNN [12], we validated our proposed work in a leave-one-out strategy.
To quantitatively demonstrating the advantage of the proposed mFCNs, we first show the segmentation results of different tissues for one subject in Fig.4.
Fig. 4.
Comparison of the segmentation results of FCNs and mFCNs with manual ground-truth on the subject shown in Fig. 1.
To quantitatively evaluate segmentation performance, we use Dice ratio (DR) to measure the overlap ratio between automated and manual segmentation results. We report the segmentation performance with a leave-one-out validation, as shown in Table 1. We can observe that both FCNs and mFCNs outperformed other methods. In general, mFCNs outperformed FCNs, especially in GM and CSF segmentation. Specifically, mFCNs could achieve average Dice ratios of 0.855 for CSF, 0.873 for GM, and 0.887 for WM, from 8 subjects. In contrast, FCNs achieved overall Dice ratios of 0.838, 0.861, and 0.885 for CSF, GM and WM, respectively. Zhang et al.’s work [12] demonstrated that CNN achieved overall Dice ratios of 0.835, 0.852 and 0.864 for CSF, GM and WM, respectively.
Table 1.
Segmentation performance in terms of Dice ratio achieved by CNN [12], random forest (RF), integration of multi-modality representation and geometrical constraint (RGC) [19], FCNs, and mFCNs. The highest performance in each tissue class is highlighted.
Sub.1 | Sub.2 | Sub.3 | Sub.4 | Sub.7 | Sub.8 | Sub.9 | Sub.10 | Mean(std) | |||
---|---|---|---|---|---|---|---|---|---|---|---|
CNN | .832 | .831 | .830 | .837 | .848 | .849 | .821 | .834 | .835(.009) | ||
CSF | RF | .819 | .813 | .832 | .809 | .830 | .846 | .790 | .795 | .817(.019) | |
RGC | .760 | .795 | .780 | .792 | .788 | .762 | .815 | .774 | .783(.018) | ||
FCNs | .835 | .843 | .840 | .852 | .835 | .834 | .822 | .839 | .838(.009) | ||
mFCNs | .844 | .850 | .796 | .861 | .860 | .870 | .869 | .866 | .852(.024) | ||
|
|||||||||||
CNN | .853 | .857 | .885 | .818 | .812 | .865 | .863 | .861 | .852(.025) | ||
GM | RF | .829 | .848 | .877 | .808 | .798 | .850 | .846 | .835 | .836(.025) | |
RGC | .835 | .862 | .870 | .842 | .820 | .821 | .862 | .852 | .846(.019) | ||
FCNs | .856 | .874 | .894 | .843 | .829 | .854 | .864 | .876 | .861(.020) | ||
mFCNs | .869 | .884 | .909 | .847 | .834 | .873 | .874 | .891 | .873(.024) | ||
|
|||||||||||
CNN | .880 | .812 | .882 | .849 | .869 | .868 | .874 | .876 | .864(.023) | ||
WM | RF | .861 | .782 | .869 | .837 | .848 | .858 | .839 | .835 | .841(.027) | |
RGC | .886 | .859 | .868 | .891 | .888 | .864 | .876 | .875 | .876(.011) | ||
FCNs | .892 | .863 | .891 | .885 | .897 | .884 | .876 | .893 | .885(.011) | ||
mFCNs | .897 | .869 | .931 | .870 | .876 | .884 | .869 | .896 | .887(.021) |
5. CONCLUSION
In this paper, we proposed mFCNs to segment isointense-phase brain images with three modality images (T1, T2 and FA). We trained mFCNs end-to-end, pixel-to-pixel for achieving pixel-level segmentation. With largely reduced parameters, our method is easier and faster to train. We compared our proposed method with several commonly used segmentation methods, and the results showed that our proposed method outperformed previous methods on isointense-phase brain segmentation.
References
- 1.Weisenfeld NI, Warfield SK. Automatic segmentation of newborn brain MRI. Neuroimage. 2009;47(2):564–72. doi: 10.1016/j.neuroimage.2009.04.068. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Xue H, et al. Automatic segmentation and reconstruction of the cortex from neonatal MRI. Neuroimage. 2007;38(3):461–477. doi: 10.1016/j.neuroimage.2007.07.030. [DOI] [PubMed] [Google Scholar]
- 3.Gui L, et al. Morphology-driven automatic segmentation of MR images of the neonatal brain. Med Image Anal. 2012;16(8):1565–79. doi: 10.1016/j.media.2012.07.006. [DOI] [PubMed] [Google Scholar]
- 4.Prastawa M, et al. Automatic segmentation of MR images of the developing newborn brain. Med Image Anal. 2005;9(5):457–66. doi: 10.1016/j.media.2005.05.007. [DOI] [PubMed] [Google Scholar]
- 5.Wang L, et al. LINKS: learning-based multi-source IntegratioN frameworK for Segmentation of infant brain images. Neuroimage. 2015;108:160–72. doi: 10.1016/j.neuroimage.2014.12.042. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Wang L, et al. Segmentation of neonatal brain MR images using patch-driven level sets. Neuroimage. 2014;84:141–158. doi: 10.1016/j.neuroimage.2013.08.008. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Wang L, et al. Automatic segmentation of neonatal images using convex optimization and coupled level sets. Neuroimage. 2011;58(3):805–817. doi: 10.1016/j.neuroimage.2011.06.064. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Warfield SK, et al. Adaptive, template moderated, spatially varying statistical classification. Med Image Anal. 2000;4(1):43–55. doi: 10.1016/s1361-8415(00)00003-7. [DOI] [PubMed] [Google Scholar]
- 9.Shi F, et al. Spatial-temporal Constraint for Segmentation of Serial Infant Brain MR Images. Medical Imaging and Augmented Reality. 2010;6326:42–50. [Google Scholar]
- 10.Wang L, et al. 4D multi-modality tissue segmentation of serial infant images. PLoS One. 2012;7(9):e44596. doi: 10.1371/journal.pone.0044596. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Kim SH, et al. Adaptive prior probability and spatial temporal intensity change estimation for segmentation of the one-year-old human brain. J Neurosci Methods. 2013;212(1):43–55. doi: 10.1016/j.jneumeth.2012.09.018. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Zhang W, et al. Deep convolutional neural networks for multi-modality isointense infant brain image segmentation. Neuroimage. 2015;108:214–24. doi: 10.1016/j.neuroimage.2014.12.061. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Long J, Shelhamer E, Darrell T. Fully convolutional networks for semantic segmentation. arXiv preprint. 2014;arXiv:1411.4038. doi: 10.1109/TPAMI.2016.2572683. [DOI] [PubMed] [Google Scholar]
- 14.Lecun Y, et al. Gradient-based learning applied to document recognition. Proceedings of the Ieee. 1998;86(11):2278–2324. [Google Scholar]
- 15.Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition. arXiv preprint. 2014;arXiv:1409.1556. [Google Scholar]
- 16.Srivastava N, Salakhutdinov R. Multimodal Learning with Deep Boltzmann Machines. Journal of Machine Learning Research. 2014;15:2949–2980. [Google Scholar]
- 17.Glorot X, Bengio Y. Understanding the difficulty of training deep feedforward neural networks. International conference on artificial intelligence and statistics. 2010 [Google Scholar]
- 18.Jia Y, et al. Proceedings of the ACM International Conference on Multimedia. ACM; 2014. Caffe: Convolutional architecture for fast feature embedding. [Google Scholar]
- 19.Wang L, et al. Integration of sparse multi-modality representation and anatomical constraint for isointense infant brain MR image segmentation. Neuroimage. 2014;89:152–64. doi: 10.1016/j.neuroimage.2013.11.040. [DOI] [PMC free article] [PubMed] [Google Scholar]