Abstract
Compared to computed tomography (CT), magnetic resonance imaging (MRI) delineation of craniomaxillofacial (CMF) bony structures can avoid harmful radiation exposure. However, bony boundaries are blurry in MRI, and structural information needs to be borrowed from CT during the training. This is challenging since paired MRI-CT data are typically scarce. In this paper, we propose to make full use of unpaired data, which are typically abundant, along with a single paired MRI-CT data to construct a one-shot generative adversarial model for automated MRI segmentation of CMF bony structures. Our model consists of a cross-modality image synthesis sub-network, which learns the mapping between CT and MRI, and an MRI segmentation sub-network. These two sub-networks are trained jointly in an end-to-end manner. Moreover, in the training phase, a neighbor-based anchoring method is proposed to reduce the ambiguity problem inherent in cross-modality synthesis, and a feature-matching-based semantic consistency constraint is proposed to encourage segmentation-oriented MRI synthesis. Experimental results demonstrate the superiority of our method both qualitatively and quantitatively in comparison with the state-of-the-art MRI segmentation methods.
Keywords: Craniomaxillofacial Bone Segmentation, MRI, Generative Adversarial Learning, One-Shot Learning
I. Introduction
ACCURATE segmentation of facial bony structures to construct a precise skeletal model is important in surgical planning for patients with craniomaxillofacial (CMF) deformities, e.g., congenital defects and trauma [1], [2]. Computed tomography (CT), commonly used for delineating bony structures, however, exposes patients to harmful radiation. A safer radiation-free imaging modality, e.g., magnetic resonance imaging (MRI), is desirable. Unfortunately, accurate annotation of bony structures based on MRI is challenging even for experienced experts, due to 1) unclear bony boundaries, 2) low signal-to-noise ratio, and 3) partial volume effects.
Deep convolutional neural networks (CNNs) have achieved great success in segmentation tasks [3]–[8] due to their ability to learn high-level task-specific imaging features. Existing CNN methods for automatic MRI annotation of bony structures usually transfer detailed bony structural information from CT to guide the training of MRI segmentation networks. Nie et al. [9] proposed a cascaded framework (i.e., a 3D U-Net followed by a cascaded CNN) for segmenting bony structures from MRI images using bone annotations with reference to CT images. Zhao et al. [10] proposed to first estimate a realistic CT image from an MRI image, and then combine them to provide complementary information for the segmentation of CMF bony structures. In these approaches, accuracy and generalizability are greatly limited by the lack of MRI-CT training pairs, since only CT is routinely acquired for patients with CMF deformities.
In this work, we propose a model that makes full use of both paired and unpaired MRI-CT data for MRI segmentation of CMF bony structures. We design an end-to-end deep neural network for one-shot learning [11] and generative adversarial learning [12]. Our network consists of two components: 1) a cross-modality image synthesis sub-network and 2) a MRI segmentation sub-network. Inspired by generative adversarial network (GAN) [13]–[15], our cross-modality synthesis sub-network extends CycleGAN [13] by a one-shot learning strategy. Unlike CycleGAN training, which is unsupervised using only unpaired data, our synthesis sub-network is trained in a semi-supervised manner using a single pair of MRI-CT images via one-shot learning. Based on the paired data, we use a novel neighbor-based anchoring method to narrow down the space of possible cross-modality mappings for unpaired images, thus alleviating the intrinsic ambiguity problem of CycleGAN [13]. Moreover, a feature-matching-based semantic consistency constraint is designed to encourage high-level semantic consistency between real and synthetic MRI images, i.e., to improve the effectiveness of synthetic MRI images for CMF bony structure segmentation. Both qualitative and quantitative experimental results demonstrate that our method can effectively improve segmentation performance compared with state-of-the-art segmentation models.
The main contribution of this paper is three-fold:
We propose a one-shot generative adversarial learning framework to address the problem of limited paired training data. Specifically, we design a semi-supervised synthesis sub-network to learn the cross-modality mappings between MRI and CT, in which synthetic MRI images are generated to augment the training set, and CT structural information is transferred to guide the training of MRI segmentation sub-network. The experimental results demonstrate that the proposed method can effectively improve the segmentation performance in terms of both Dice Similarity Coefficients (DSC) and Average Symmetric Surface Distance (ASSD).
To reduce ambiguity in cross-modality synthesis, we propose a novel neighbor-based anchoring method based on one-shot learning to effectively reduce the space of possible translation mappings for unpaired images.
To ensure the effectiveness of synthetic images for the segmentation task, a feature-matching-based consistency constraint is proposed to encourage the high-level semantical consistency between real and synthetic images.
The rest of the paper is organized as follows. In Section II, we briefly review previous studies on one-shot learning and adversarial image synthesis. Then, the proposed method is described in detail in Section III, followed by the experiments and results in Section IV. Finally, conclusion and discussion are presented in Section V.
II. Related Works
A. One-Shot Learning
One-shot learning, or more generally, few-shot learning, aims to improve the generalization capacity of a learning model when only a few or even a single training sample is available [11]. The key idea is to make full use of the knowledge learned from the unlabeled data or other classification tasks to assist the model learning with only a few labeled data. In the past decades, one-shot learning has found plenty of applications in various computer vision tasks where labeled data are very limited. For example, Li et al. [11] successfully used a handful of training samples to develop a generative object category model in the Bayesian learning framework. Wan et al. [16] proposed a bag-of-feature model to recognize gesture by using only one training sample per class. Shaban et al. [17] proposed a two-branch approach for image semantic segmentation, in which only a single image with its corresponding pixel-level annotations are required for each class. On the other hand, one/few-shot learning has also contributed to medical image analysis, where the number of training samples is relatively more limited, considering that the ground-truth annotation is not only time-consuming but also relies on domain knowledge and skills. Recently, Mondal et al. [18] applied GANs for MRI brain tissue segmentation by using unlabeled data and only a few labeled data. However, this method synthesizes images from random noise to assist the segmentation task in an adversarial learning manner, which does not address the label limitation problem (i.e., the labels utilized in training are still scare). In contrast, our method synthesizes images from another modality (i.e., CT) for extra annotation information to improve segmentation.
B. Adversarial Image Synthesis
As a form of unsupervised learning, generative adversarial networks (GANs) consist of two components that compete with each other, i.e., a generative network that generates synthetic data, and a discriminative network that differentiates between the synthetic and real data. GANs have been successfully applied in many studies to provide extra discrimination-based supervision [19]–[23], and the synthetic images generated by GANs are often used as supplementary training data [24]–[28] to improve the generalization capacity of trained models. For example, Zhu et al. [13] proposed a cycle-consistent adversarial network (CycleGAN) to tackle the image-to-image translation problem, which requires only unpaired training images. Xiang et al. [15] introduced a novel cross-modality medical image synthesis method that can be trained using only unpaired MRI-CT data. Zhang et al. [29] proposed to utilize the synthetic data generated by a cross-modality synthesis model to improve image segmentation. Huo et al. [30] proposed a CycleGAN-based approach, in which synthetic data is used to train a segmentation network for a target imaging modality without ground-truth annotations. These methods demonstrate that cross-modality image synthesis based on CycleGAN is promising, which can also assist the image segmentation task. However, as these methods use only unpaired images during training, the corresponding cycle-consistency constraint cannot guarantee a reliable synthetic result (i.e., there exists numerous feasible mappings that satisfy the cycle-consistent constraint), thus leading to ambiguity.
In this work, using unpaired and a single pair of MRI-CT data, we combine one-shot learning and generative adversarial learning to develop an end-to-end deep neural network for CMF bony structure segmentation using MRI images. The ambiguity problem in cross-modality image synthesis is tackled by a neighbor-based anchoring method, and the effectiveness of synthetic images for the segmentation task is improved by a feature-matching-based semantic consistency constraint.
III. Methods
In this section, we introduce the proposed method in detail. We first present an overview of the proposed method in Section III-A, and then elaborate the architecture of our network in Section III-B. In Sections III-C and III-D, we respectively describe the proposed neighbor-based anchoring method and feature-matching-based semantic consistency constraint for one-shot generative adversarial learning. Finally, the implementation details of our proposed method are presented in Section III-E.
A. Overview
In this work, we aim to automate MRI bony structure segmentation, overcoming challenges associated with unclear bony boundaries in MRI and limited ground-truth annotations. We propose a deep neural network consisting of two parts, i.e., a cross-modality synthesis sub-network and a segmentation sub-network (see Fig. 1). We establish the cross-modality mappings between MRI and CT through generative adversarial learning, which not only generates realistic MRI images from CT images to augment the training data for the subsequent segmentation sub-network, but also implicitly transfers CT bony structural information to assist the segmentation in the MRI space.
Fig. 1.
The framework of the proposed method. The cross-modality synthesis sub-network learns mappings between MRI and CT using both paired and unpaired MRI-CT patches. The synthetic MRI patches from unpaired CT along with the real MRI patches from paired dataset are then used to train the segmentation sub-network.
Specifically, the sub-network for cross-modality image synthesis consists of two generators and two discriminators. The generator GC→M learns to synthesize an MRI image for each input CT image. In line with the original CycleGAN, the generator GM→C learns the complementary inverse translation (i.e., from MRI to CT). These two translations form a cycle to enforce cycle consistency constraint for unpaired data synthesis. The two discriminators, DM and DC, are trained to distinguish between the real and synthetic MRI images and CT images, respectively. Based on the synthesis sub-network, the segmentation sub-network S learns to automatically annotate the bony structures from any input MRI images. The synthesis and segmentation sub-networks can be jointly optimized in an end-to-end manner, and their detailed structures will be elaborated in Section III-B.
Notably, our model is trained in a totally different but more effective way compared with the original CycleGAN-based models. First, in contrast to the original CycleGAN that uses only unpaired multi-modal data, we make full use of both unpaired and only a single pair of MRI-CT images to learn more precise cross-modality image synthesis. Second, in such one-shot learning settings, we propose a neighbor-based anchoring method to reduce the ambiguity problem during cross-modality synthesis, and also propose a feature-matching-based semantic consistency constraint to improve the effectiveness of synthetic images for the segmentation task. These two critical components will be introduced in Sections III-C and III-D, respectively.
A total of seven losses are designed to jointly optimize all components of our network, which will also be introduced in detail in Section III-B. The symbols used in the following sections are summarized in Table I.
TABLE I.
Summary of Symbols
Symbol | Meaning |
---|---|
, | paired MRI-CT dataset |
, | unpaired MRI-CT dataset |
MRI dataset | |
CT dataset | |
bone annotation dataset corresponding to | |
bone annotation dataset corresponding to | |
GM→C | MRI-to-CT generator |
GC→M | CT-to-MRI generator |
DM | MRI discriminator |
DC | CT discriminator |
S | Segmentor |
B. Network
In this subsection, we introduce the architectures of our synthesis and segmentation sub-network, and the loss functions to train these sub-networks.
1). Generators:
Fig 2 illustrates the architecture of the generator. The two generators (i.e., GC→M and GM→C) have the same architecture, which are U-Net [5] like 3D fully convolutional networks (FCNs), composed of: 1) an encoder, 2) a decoder, 3) a series of residual blocks [31] that bridges the encoder and decoder, and 4) several long-range skip-connections that link the encoder and decoder layers at the same resolution level. Each encoder takes 3D patches (size: 32×32×32) in the source domain as input. It consists of three blocks, where each encoder block is made up of two 3×3×3 convolutional layers with strides 1 and 2, respectively. The numbers of convolutional filters for the three encoder blocks are 32, 64, and 128, respectively. The decoder is symmetric to the encoder, which consists of 3 blocks and an output layer to produce 3D patches (size: 32 × 32 × 32) in the target domain. Each decoder block contains a deconvolutional layer [32] followed by a convolutional layer, with the same number of filters. Specifically, the numbers of filters for the three decoder blocks are 128, 64, and 32, respectively. A total of 3 residual blocks [31] are used to bridge the encoder and decoder parts to improve the nonlinear modeling capability of our generators. Similar to U-Net [5], we also use long-range shortcuts to connect the encoder and decoder layers at the same resolution level to achieve smooth results and fast convergence. In addition, the instance normalization [33] and ReLU [34] are applied after each convolutional and deconvolutional layers except the output layer after the last decoder block, where tanh activation is used to normalize the output.
Fig. 2.
The architecture of the generator.
2). Discriminators:
Similar to the encoder of the generators, the discriminator also contains 3 encoding blocks. These encoding blocks consist of two convolutional layers with strides 1 and 2, respectively, gradually reducing the size of the feature maps. Following these encoding blocks, a convolutional layer (32 kernels with the size of 3 × 3 × 3) and a fully connected layer are appended to generate the output with only one element, which denotes whether the input images are synthesized or real However, different from the generators and following the suggestion in [36], we adopt Leaky ReLU [35] with slope 0.2 as the activation function for each convolutional layer in the discriminators.
3). Segmentor:
The segmentation sub-network has a very similar structure with the generators, except the activation in the last convolutional layer. Specifically, our segmentor is a 3D FCN with U-Net architecture, as shown in Fig. 2, where the softmax activation is applied after the last convolutional layer to predict the probability of each voxel belonging to the bone or background. During training, it processes both the real MRI patches (from the paired MRI-CT data) and the synthetic MRI patches (generated from the unpaired CT) to estimate the corresponding bony structures, using the bone annotations on CT images as the ground truth. After training, our segmentation sub-network takes the 3D patches from any unseen MRI images as the input, and outputs the corresponding CMF bony structures.
4). Loss Functions:
Using both paired and unpaired MRI-CT images, our synthesis and segmentation sub-networks (i.e., two generators, two discriminators, and a segmentor) can be trained alternatively via minimizing several general loss functions. Specifically, to train the cross-modality generators, we apply the cycle-consistent constraint on unpaired training data, forming the indirect supervision proposed in the original CycleGAN [13]:
(1) |
where x ∈ and y ∈ denote the 3D patches from the unpaired CT set (i.e., ) and unpaired MRI set (i.e., ), respectively. On the other hand, we also apply the direct supervision on the paired data, defined as:
(2) |
where x ∈ and y ∈ denote the paired CT-MRI patches. Besides the direct and indirect supervision, the adversarial loss generally used in GANs is also included to train the generators, by fixing the parameters of discriminators:
(3) |
To train the discriminators for both modalities, we use the loss function defined as
(4) |
where the generator parameters are fixed during the training of discriminators.
Both real and synthetic MRI patches are used to train the segmentation sub-network, by minimizing the general cross-entropy loss, defined as
(5) |
(6) |
where and represent the segmentation losses on real and synthetic MRI data, respectively. bp ∈ and bu ∈ are the ground-truth segmentations of bony structures defined on CT, y ∈ denotes the MRI with paired CT, and x ∈ denotes the CT without paired MRI
However, training the proposed network using above mentioned losses, i.e., (1) to (6), have two main limitations. First, the cycle-consistent loss defined by (1) cannot ensure the reliable mappings between unpaired data across different domains. Second, the synthetic MRI images might not be well coordinated with the segmentation sub-network, considering that there is no explicit constraint for segmentation-oriented image synthesis. In the following two subsections, we propose our specific designs to deal with these two limitations, respectively
C. Neighbor-Based Anchoring Method
As discussed above, the original CycleGAN is constructed on unpaired training data, and using only the cycle consistency constraint suffers from the ambiguity problem. That is, there may exist numerous feasible mappings satisfying such cycle-consistency requirement. Although we include a pair of MRI-CT data to train our model in the one-shot learning framework, the direct supervision (i.e., on paired data) defined by (2) has very limited influence on the mappings of unpaired data, especially considering that unpaired training images are much more than paired images in our case.
To tackle this challenge, based on the guidance provided by the paired MRI-CT data, we propose a neighbor-based anchoring method to mitigate the ambiguity problem of cross-modality mappings learned on unpaired data. Our basic assumption is that the source and target domains should lie on consistent manifolds with similar local geometry. For example, if the images for two different subjects have similar appearance in the source domain, such similarity should also be preserved in the target domain. Based on this assumption, we propose to use paired MRI-CT patches as anchor points to locate the desired synthetic results for unpaired MRI or CT patches.
Specifically, we build two sets of anchor points in the MRI and CT domains, respectively, with one-to-one correspondence between anchor points from different domains. As illustrated in Fig. 3, given any image patch x in the source domain (e.g., CT domain), we first find its K nearest anchor points (e.g., {x1, … , xK }) in the source domain set. These anchor points, called neighbors, have their counterparts (e.g., {y1, … , yK }) in the target domain set (i.e., MRI domain), which are then used to locate the desired output y in the target domain for x, under the constraint that {y1, … , yK } should also form the neighborhood for y. In this way, we effectively reduce the space of possible mappings during synthesis, thus alleviating the ambiguity problem. The loss function to this end is defined as
(7) |
where x and y denote the 3D patches from the unpaired CT and MRI sets, respectively, and xi and yi denote their neighbor in the respective domain. For the anchor point in a given domain, the operation T (•) gets the corresponding anchor point in another domain, e.g., T (xi) = yi. The coefficients and balance the importance of the respective neighbors for x and y based on the distances and , respectively, and α is a smoothing parameter.
Fig. 3.
Assuming that patches in the source and target domains lie on manifolds with locally similar geometry. Then given an input x with its neighboring anchor points {x1, x2, x3} in the source domain, its corresponding output y in the target domain can be located near the {y1, y2, y3} (i.e., anchor points in the target domain corresponding to {x1, x2, x3}).
Furthermore, considering that the very limited pairs of MRI-CT images cannot provide enough anchor points to cover the entire manifolds, we propose to progressively augment the anchor point sets by adding patches from the unpaired dataset with their synthetic results. To be specific, we initialize the anchor point sets with paired patches randomly cropped from paired MRI-CT. Then during training, we record the distances between the unpaired patches and their nearest anchor points. The top 20% unpaired patches that are farthest from the existing anchor points (as well as their synthetic results) will be added into the anchor point sets for every 5,000 training iterations, so as to progressively expand the coverage area of anchor points on the manifolds.
Finally, to speed up the process of finding neighbors, we adopt a location-based clustering method. To be specific, all MRI and CT images were linearly aligned onto a common template and cropped to have the same size. After that, each linearly-aligned image was evenly split into 1,000 small rectangular prisms (i.e., 10 sections along each coordinate axis in the Euclidean space). We regarded each small rectangular prism as a cluster, i.e., if one unpaired MRI-CT image patch locates in a specific cluster, we then look for its neighboring anchor patches in the same cluster. In this way, we can roughly constrain each unpaired patch and its neighboring anchors to have consistent anatomical meaning, thus effectively reducing the computational complexity compared with the case of searching over the whole scan.
D. Feature-Matching-Based Semantic Consistency
We expect the synthetic results to not only be visually realistic but also preserve the anatomical structures for the segmentation task. However, this is not guaranteed in the original CycleGAN, since the synthetic result can be geometrically distorted as has been discussed in [29]. To address this problem, we propose a feature-matching-based semantic consistency constraint to limit the geometrical variances of the synthetic images by utilizing the information from limited paired data. Specifically, assume fS(•) is the operation to get the high-level feature map from the last activation layer (before the final convolutional layer) in the segmentor S, our semantic consistency constraint forces the semantic difference between the real and synthetic data to be small enough, by minimizing the loss function defined as:
(8) |
where x ∈ and y ∈ denote the paired CT-MRI patches. In this way, the segmentation sub-network can provide critical semantic guidance to regularize the training of the synthesis sub-network, and these two sub-networks can be jointly trained and coordinated with each other.
E. Implementation Details
Due to GPU memory limitation, we processed the 3D images in small sub-regions (i.e., patches) with the size of 32 × 32 × 32 during the training and testing. The three components of our network (i.e., two generators GM→C and GC→M, two discriminators DM and DC, and the segmentor S) were trained in turn at each iteration. To be specific, we first trained the two discriminators DM and DC with the generators and segmentor fixed. Then, we trained the two generators GC→M and GM→C with the discriminators and segmentor fixed. Finally, the segmentor was trained with the synthesis sub-network fixed. The training procedure is detailed in Algorithm 1.
For fast convergence, we first pre-trained each sub-network separately using only the paired training data for 100,000 iterations, and then trained the whole network jointly using all training data for another 100,000 iterations according to Algorithm 1. We also found that training the whole network directly from scratch can produce similar results, while in this case the neighbor-based anchoring method should be frozen in the first few iterations to avoid expanding the anchor point set with low-quality synthetic data. The model was trained by mini-batches with the size of 4. Each mini-batch consisted of 2 paired and 2 unpaired MRI-CT patches that were randomly cropped from the paired and unpaired images. We adopted Adam [38] as the optimizer to train the model with a learning rate of 10Š4. Other hyper-parameters were empirically set as K = 3, α= 1, λ1 = 0.5, λ2 = 0.05, λ3 = 0.1, λ4 = 0.5, and λ5 = 0.5.
IV. Experiments and Results
In this section, we evaluate the effectiveness of the proposed method both qualitatively and quantitatively. We first describe the detailed experimental settings in Section IV-A. We then evaluate the performance of the proposed method on cross-modality image synthesis and MRI bony structure segmentation in Sections IV-B and IV-C, respectively. Finally, an ablation study is conducted in Section IV-D to analyze the effects of the proposed neighbor-based anchoring method and feature-matching-based semantic consistency constraint.
A. Experimental Settings
The dataset used in our experiments consisted of 8 pairs of MRI-CT scans, 50 unpaired MRI scans and 50 unpaired CT scans. All MRI scans were obtained from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) dataset [39], and acquired by Siemens Trio TIM 3T scanner with the voxel size of 1.2 × 1.2 × 1 mm3, TE = 2.09ms, TR = 2300 ms, and flip angle = 9°. The CT scans in the paired subset were also obtained from the ADNI dataset [39], and acquired by Siemens Somatom scanner with the voxel size of 0.59×0.59×3 mm3. On the other hand, the unpaired CT scans were from the CQ500 dataset [40], which were collected from different centers using various scanners with individual voxel sizes around 0.45×0.45×0.65 mm3 (refer to [40] for more details). All images were linearly aligned onto a common template using FLIRT (FMRIB’s Linear Image Registration Tool) [41] and cropped to have the same size (i.e., 148×180×175). The intensities of all images were linearly scaled into the range of [−1, 1]. Fig. 4 shows an example of paired and unpaired MRI-CT images used in this study.
Fig. 4.
An example of paired and unpaired MRI-CT data.
The ground-truth annotations of bony structures were defined on CT scans by intensity value thresholding and further refined by an expert as needed. The proposed method was evaluated in a one-shot setting, i.e., we trained and validated our model 8 times, corresponding to 8 pairs of MRI-CT images. At each experiment, only a single pair of MRI-CT images along with all unpaired images were used for training, leaving the remaining 7 pairs for testing.
B. Synthesis Performance
We qualitatively analyzed the cross-modality image synthesis performance of the proposed method, and compared it with the Baseline method, i.e., training GC→M and GM→C separately using only paired data. Moreover, we also trained the CycleGAN [13] on our unpaired MRI-CT data for comparison.
Fig. 5 shows the representative synthetic results of MRI-to-CT and CT-to-MRI translations, where the five columns represent (a) the input, (b) the results of the Baseline method, (c) the results of CycleGAN [13], (d) the results of the proposed method, and (e) the ground truth. The synthetic MRI and CT images generated by the Baseline method tend to be blurry and contain more artifacts, implying that the generator can hardly generalize well when trained on only limited paired data. On the other hand, the CycleGAN method yielded unsatisfactory synthesis results, mainly because using solely the cycle-consistent constraint on unpaired data cannot provide enough contextual guidance to ensure reliable cross-modality synthesis, while small image patches with limited global information further exacerbated this issue. In contrast, our method is able to generate more realistic MRI and CT outputs. This can be attributed to the utilization of both unpaired and paired training data to better learn the data distribution in the target domain through one-shot generative adversarial learning.
Fig. 5.
Visual results of cross-modality image synthesis.
Fig. 5 also shows that synthesizing MRI from CT is more challenging than synthesizing CT from MRI. This is expected as soft tissues are much less visible on the CT images, making CT-to-MRI translation a highly ill-posed problem. However, it is worth noting that CMF surgery is a bony surgical procedure and soft-tissue information is unnecessary. Our method allows precise synthesis of bony structures (e.g., marked by the red boxes) and effective transfer of CT information to guide segmentation training in the MRI space.
C. Segmentation Performance
We evaluated the segmentation performance of our method qualitatively and quantitatively. The proposed method was compared with two state-of-the-art medical image segmentation methods, i.e., 3D U-Net [42] and Deep-supGAN [10]. In addition, two variants of the proposed method were also included in the comparison: the Baseline method that trained the segmentor solely with only paired data, and the Two-stage method that trained the synthesis and segmentation sub-networks independently. Using the manual annotations of the CT images as the ground truth, the segmentation performance was quantified using both Dice Similarity Coefficients (DSC) and Average Symmetric Surface Distance (ASSD).
From Table II, two observations can be obtained. First, the Baseline method, 3D U-Net [42] and Deep-supGAN [10] yielded relatively poor results, implying that they suffer from the overfitting due to training with very limited data (i.e., paired MRI-CT data). In contrast, our method successfully improves the segmentation performance both in terms of DSC (85.66 ± 3.26) and ASSD (1.04 ± 0.19), due to 1) the augmentation of the training data with the synthetic MRI images generated from the unpaired CT images, and 2) the implicit translation of the CT structural information to assist the training of the MRI segmentation models. Second, we can also observe that the proposed method also outperformed its two-stage variant (by 1.81 and 0.08 in terms of DSC and ASSD, respectively), demonstrating the effectiveness of training the synthesis and segmentation sub-networks jointly in an end-to-end manner.
TABLE II.
Comparison of Segmentation Performance
Method | DSC (%) | ASSD |
---|---|---|
Baseline | 78.68 ± 5.42 | 1.33 ± 0.25 |
3D U-Net [38] | 77.41 ± 5.41 | 1.34 ± 0.20 |
Deep-supGAN [7] | 77.82 ± 3.18 | 1.49 ± 0.20 |
Two-Stage | 83.85 ± 4.03 | 1.12 ± 0.22 |
Proposed | 85.66 ± 3.26 | 1.04 ± 0.19 |
Mean ± Standard Deviation
We also visually compared the segmentation results obtained by the different methods, with some representative examples shown in Fig. 6, where the six columns respectively represent (a) the input MRI images, (b) the results of Baseline method, (c) the results of 3D U-Net method [42], (d) the results of Deep-supGAN method [10], (e) the results of proposed method, and (f) the ground truth. We can observe that, compared with all competing methods, the bones estimated by our proposed method are smoother and more complete, as indicated by the red boxes, especially for small bony structures. These visual evaluations are consistent with the quantitative evaluations presented in Table II, where the bold numbers indicate the best performance.
Fig. 6.
Visual results of bony structures segmentation.
D. Ablation Study
We evaluated the effectiveness of two important strategies used to train our method, i.e., the neighbor-based anchoring method (Section III-C) and the feature-matching-based semantic consistency constraint (Section III-D). For this purpose, we trained two variants without using the anchor guidance (denoted as w/o Anchor) and semantic consistency constraint (denoted as w/o SC), respectively.
We quantitatively compared the segmentation performance of w/o Anchor and w/o SC with the proposed method in Table III. The comparison results demonstrate that the proposed anchoring method and the segmentation consistency constraint effectively improved the segmentation performance, by mitigating the ambiguity problem during image synthesis and enhancing the semantic significance of synthetic images in a segmentation-oriented manner.
TABLE III.
Ablation Study
Methods | DSC (%) | ASSD |
---|---|---|
w/o Anchor | 84.81 ± 3.44 | 1.08 ± 0.20 |
w/o SC | 84.27 ± 3.54 | 1.16 ± 0.20 |
Proposed | 85.66 ± 3.26 | 1.04 ± 0.19 |
Mean ± Standard Deviation
We analyzed the impact of parameter K (i.e., the number of nearest anchor points) with respect to the segmentation performance. Fig. 7 illustrates the segmentation performance of the proposed method with respect to different K ranging from 1 to 11 with stride 2. As can be observed, the segmentation performance was first improved with the increase of K (from 1 to 3), and then became stable when K ∈ {3, 5, 7}. However, further increasing K (>7) leads to a slight drop of the segmentation performance, potentially because including farther anchor points could be counterproductive during synthesis. We therefore set K = 3 in all our experiments.
Fig. 7.
Segmentation performance of the proposed method with different K.
V. Conclusion and Discussion
In this paper, we propose an end-to-end neural network to tackle the problem of segmenting CMF bony structures on MRI images. Considering the difficulty in annotating bones in MRI images, we propose to transfer the bone annotation from CT to MRI space to guide the training of MRI-based segmentation network via generative adversarial learning. Specifically, the proposed network consists of two sub-networks, i.e., a cross-modality image synthesis sub-network and an MRI segmentation sub-network. The image synthesis sub-network learns the mappings between MRI and CT images by using both paired (a single set) and unpaired (multiple sets) MRI-CT data in a one-shot learning framework, and the synthetic data are then used to assist the training of segmentation sub-network to reduce the over-fitting risk. To mitigate the ambiguity problem in synthesis, we present a novel neighbor-based anchoring method to reduce the space of possible mappings by making full use of the single paired data. Moreover, we also propose a feature-matching-based semantic consistency constraint to improve the effectiveness of synthetic MRI images for CMF bony structure segmentation by encouraging the high-level semantic consistency between real and synthetic MRI images. The qualitative and quantitative experimental results demonstrate that the proposed method can effectively improve the generalizability of segmentation sub-network, leading to the improvement in segmentation performance. The effectiveness of the proposed neighbor-based anchoring method and feature-matching-based semantic consistency constraint is also verified through ablation study. In addition, although our proposed method was implemented for segmenting all the bones, it can also be applied to the particular bones such as CMF bony sub-structures. This is mainly because our model is trained on small 3D patches, thus applicable to partial CMF scans.
In future, we will explore StarGAN [43] based methods in our CycleGAN-based approach for multi-modality cases, especially with partially missing data. This makes it possible to estimate CMF bony structures from both T1- and T2-weighted MRI images by transferring annotations from CT and CBCT images, even with incomplete image dataset in a given modality (e.g., partial jaws are missing in a head CT scan).
Acknowledgments
This work was in part supported by NIH/NIDCR Grants (DE027251 and DE022676).
Contributor Information
Hannah Deng, Department of Oral and Maxillofacial Surgery, Houston Methodist Research Institute, TX, USA.
Steve H. Fung, Department of Radiology, Houston Methodist Research Institute, TX, USA.
Dinggang Shen, Department of Radiology and Biomedical Research Imaging Center, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA.
REFERENCES
- [1].Kraft A, Abermann E, Stigler R, Zsifkovits C, Pedross F, Kloss F, and Gassner R, “Craniomaxillofacial trauma: synopsis of 14,654 cases with 35,129 injuries in 15 years,” Craniomaxillofacial Trauma & Reconstruction, vol. 5, no. 1, p. 41, 2012. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [2].Huynh T, Gao Y, Kang J, Wang L, Zhang P, Lian J and Shen D. “Estimating CT image from MRI data using structured random forest and auto-context model,” IEEE transactions on medical imaging, 35(1), pp.174–183, 2015. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [3].Acharya UR, Oh SL, Hagiwara Y, Tan JH, and Adeli H, “Deep convolutional neural network for the automated detection and diagnosis of seizure using eeg signals,” Computers in Biology and Medicine, vol. 100, pp. 270–278, 2018. [DOI] [PubMed] [Google Scholar]
- [4].Milletari F, Navab N, and Ahmadi S-A, “V-net: Fully convolutional neural networks for volumetric medical image segmentation,” in 2016 Fourth International Conference on 3D Vision (3DV). IEEE, 2016, pp. 565–571. [Google Scholar]
- [5].Ronneberger O, Fischer P, and Brox T, “U-net: Convolutional networks for biomedical image segmentation,” in International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, 2015, pp. 234–241. [Google Scholar]
- [6].Moeskops P, Wolterink JM, van der Velden BH, Gilhuijs KG, Leiner T, Viergever MA, and Is gum I, “Deep learning for multi-task medical image segmentation in multiple modalities,” in International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, 2016, pp. 478–486. [Google Scholar]
- [7].Lian C, Liu M, Zhang J, et al. “Hierarchical Fully Convolutional Network for Joint Atrophy Localization and Alzheimer’s Disease Diagnosis using Structural MRI”. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [8].Lian C, Zhang J, Liu M, et al. “Multi-channel multi-scale fully convolutional network for 3D perivascular spaces segmentation in 7T MR images”. Medical Image Analysis, 46: 106–117, 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [9].Nie D, Wang L, Trullo R, Li J, Yuan P, Xia J, and Shen D, “Segmentation of craniomaxillofacial bony structures from mri with a 3d deep-learning based cascade framework,” in International Workshop on Machine Learning in Medical Imaging. Springer, 2017, pp. 266–273. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [10].Zhao M, Wang L, Chen J, Nie D, Cong Y, Ahmad S, Ho A, Yuan P, Fung SH, Deng HH et al. , “Craniomaxillofacial bony structures segmentation from mri with deep-supervision adversarial learning,” in International Conference on Medical Image Computing and Computer Assisted Intervention. Springer, 2018, pp. 720–727. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [11].Fei-Fei L, Fergus R, and Perona P, “One-shot learning of object categories,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 28, no. 4, pp. 594–611, 2006. [DOI] [PubMed] [Google Scholar]
- [12].Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, and Bengio Y, “Generative adversarial nets,” in Advances in Neural Information Processing Systems, 2014, pp. 2672–2680. [Google Scholar]
- [13].Zhu J-Y, Park T, Isola P, and Efros AA, “Unpaired image-to-image translation using cycle-consistent adversarial networks In Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 2223–2232. [Google Scholar]
- [14].Yang H, Sun J, Carass A, Zhao C, Lee J, Xu Z, and Prince J, “Unpaired brain MR-to-CT synthesis using a structure-constrained CycleGAN,” in Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support. Springer, 2018, pp. 174–182. [Google Scholar]
- [15].Xiang L, Li Y, Lin W, Wang Q, and Shen D, “Unpaired deep cross modality synthesis with fast training,” in Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support. Springer, 2018, pp. 155–164. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [16].Wan J, Ruan Q, Li W, and Deng S, “One-shot learning gesture recognition from RGB-D data using bag of features,” The Journal of Machine Learning Research, vol. 14, no. 1, pp. 2549–2582, 2013. [Google Scholar]
- [17].Shaban A, Bansal S, Zhen L, Essa I, and Boots B, “One-shot learning for semantic segmentation,” arXiv:1709.03410, 2017. [Google Scholar]
- [18].Mondal AK, Dolz J, and Desrosiers C, “Few-shot 3D multi-modal medical image segmentation using generative adversarial learning,” arXiv:1810.12241, 2018. [Google Scholar]
- [19].Luc P, Couprie C, Chintala S, and Verbeek J, “Semantic segmentation using adversarial networks,” arXiv:1611.08408, 2016. [Google Scholar]
- [20].Huo Y, Xu Z, Bao S, Bermudez C, Plassard AJ, Liu J, Yao Y, Assad, Abramson RG, and Landman BA, “Splenomegaly segmentation using global convolutional kernels and conditional generative adversarial networks,” in Medical Imaging 2018: Image Processing, vol. 10574 International Society for Optics and Photonics, 2018, p. 1057409. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [21].Yang D, Xu D, Zhou SK, Georgescu B, Chen M, Grbic S, Metaxas D, and Comaniciu D, “Automatic liver segmentation using an adversarial image-to-image network,” in International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, 2017, pp. 507–515. [Google Scholar]
- [22].Kohl S, Bonekamp D, Schlemmer H-P, Yaqubi K, Hohen fellner M, Hadaschik B, Radtke J-P, and Maier-Hein K, “Adversarial networks for the detection of aggressive prostate cancer,” arXiv:1702.08014, 2017. [Google Scholar]
- [23].Xue Y, Xu T, Zhang H, Long LR, and Huang X, “Segan: Adversarial network with multi-scale l 1 loss for medical image segmentation,” Neuroinformatics, pp. 1–10, 2018. [DOI] [PubMed] [Google Scholar]
- [24].Huo Y, Xu Z, Bao S, Assad A, Abramson RG, and Landman BA, “Adversarial synthesis learning enables segmentation without target modality ground truth,” in 2018 IEEE 15th International Symposium on Biomedical Imaging (ISBI). IEEE, 2018, pp. 1217–1220. [Google Scholar]
- [25].Iglesias JE, Konukoglu E, Zikic D, Glocker B, Van Leemput K, and Fischl B, “Is synthesizing MRI contrast useful for inter-modality analysis?” in International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, 2013, pp. 631–638. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [26].Shrivastava A, Pfister T, Tuzel O, Susskind J, Wang W, and Webb R, “Learning from simulated and unsupervised images through adversarial training,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 2107–2116. [Google Scholar]
- [27].Xue J, Zhang H, Dana K, and Nishino K, “Differential angular imaging for material recognition,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), vol. 5, 2017. [Google Scholar]
- [28].Chartsias A, Joyce T, Dharmakumar R, and Tsaftaris SA, “Adversarial image synthesis for unpaired multi-modal cardiac data,” in International Workshop on Simulation and Synthesis in Medical Imaging. Springer, 2017, pp. 3–13. [Google Scholar]
- [29].Zhang Z, Yang L, and Zheng Y, “Translating and segmenting multimodal medical volumes with cycle-and shapeconsistency generative adversarial network,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 9242–9251. [Google Scholar]
- [30].Huo Y, Xu Z, Moon H, Bao S, Assad A, Moyo TK, Savona MR, Abramson RG, and Landman BA, “Synseg-net: Synthetic segmentation without target modality ground truth,” IEEE transactions on Medical Imaging, 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [31].He K, Zhang X, Ren S, and Sun J, “Identity mappings in deep residual networks,” in European Conference on Computer Vision Springer, 2016, pp. 630–645. [Google Scholar]
- [32].Zeiler MD, Taylor GW, and Fergus R, “Adaptive deconvolutional networks for mid and high level feature learning,” in 2011 IEEE International Conference on Computer Vision (ICCV). IEEE, 2011, pp. 2018–2025. [Google Scholar]
- [33].Vedaldi VLDUA, “Instance normalization: The missing ingredient for fast stylization,” arXiv:1607.08022, 2016. [Google Scholar]
- [34].Nair V and Hinton GE, “Rectified linear units improve restricted boltzmann machines,” in Proceedings of the 27th International Conference on Machine Learning (ICML-10), 2010, pp. 807–814. [Google Scholar]
- [35].Maas AL, Hannun AY, and Ng AY, “Rectifier nonlinearities improve neural network acoustic models,” in Proc. ICML, vol. 30, no. 1, 2013, p. 3. [Google Scholar]
- [36].Radford A, Metz L, and Chintala S, “Unsupervised representation learning with deep convolutional generative adversarial networks,” arXiv:1511.06434, 2015. [Google Scholar]
- [37].Mao X, Li Q, Xie H, Lau RY, Wang Z, and Paul Smolley S, “Least squares generative adversarial networks,” in Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 2794–2802. [Google Scholar]
- [38].Kingma DP and Ba J, “Adam: A method for stochastic optimization,” arXiv:1412.6980, 2014. [Google Scholar]
- [39].Trzepacz PT, Yu P, Sun J, Schuh K, Case M, Witte MM, Hochstetler H, Hake A, Initiative ADN et al. , “Comparison of neuroimaging modalities for the prediction of conversion from mild cognitive impairment to alzheimer’s dementia,” Neurobiology of Aging, vol. 35, no. 1, pp. 143–151, 2014. [DOI] [PubMed] [Google Scholar]
- [40].Chilamkurthy S, Ghosh R, Tanamala S, Biviji M, Campeau NG, Venugopal VK, Mahajan V, Rao P, and Warier P, “Deep learning algorithms for detection of critical findings in head CT scans: a retrospective study,” The Lancet, vol. 392, no. 10162, pp. 2388–2396, 2018. [DOI] [PubMed] [Google Scholar]
- [41].Jenkinson M, Bannister P, Brady M and Smith S. “Improved optimization for the robust and accurate linear registration and motion correction of brain images”. NeuroImage, 17(2), pp.825–841, 2002. [DOI] [PubMed] [Google Scholar]
- [42].içek ÖÇ, Abdulkadir A, Lienkamp SS, Brox T, and Ronneberger O, “3D U-Net: learning dense volumetric segmentation from sparse annotation,” in International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, 2016, pp. 424–432. [Google Scholar]
- [43].Choi Y, Choi M, Kim M, Ha JW, Kim S, and Choo J, “Stargan: Unified generative adversarial networks for multi-domain image-to-image translation.” In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8789–8797, 2018. [Google Scholar]