SynSeg-Net: Synthetic Segmentation Without Target Modality Ground Truth

Yuankai Huo; Zhoubing Xu; Hyeonsoo Moon; Shunxing Bao; Albert Assad; Tamara K Moyo; Michael R Savona; Richard G Abramson; Bennett A Landman

doi:10.1109/TMI.2018.2876633

. Author manuscript; available in PMC: 2020 Apr 17.

Published before final editing as: IEEE Trans Med Imaging. 2018 Oct 17:10.1109/TMI.2018.2876633. doi: 10.1109/TMI.2018.2876633

SynSeg-Net: Synthetic Segmentation Without Target Modality Ground Truth

Yuankai Huo ^1,^✉, Zhoubing Xu ², Hyeonsoo Moon ³, Shunxing Bao ⁴, Albert Assad ⁵, Tamara K Moyo ⁶, Michael R Savona ⁷, Richard G Abramson ⁸, Bennett A Landman ⁹

PMCID: PMC6504618 NIHMSID: NIHMS1526444 PMID: 30334788

Abstract

A key limitation of deep convolutional neural network (DCNN)-based image segmentation methods is the lack of generalizability. Manually traced training images are typically required when segmenting organs in a new imaging modality or from distinct disease cohort. The manual efforts can be alleviated if the manually traced images in one imaging modality (e.g., MRI) are able to train a segmentation network for another imaging modality (e.g., CT). In this paper, we propose an end-to-end synthetic segmentation network (SynSeg-Net) to train a segmentation network for a target imaging modality without having manual labels. SynSeg-Net is trained by using: 1) unpaired intensity images from source and target modalities and 2) manual labels only from source modality. SynSeg-Net is enabled by the recent advances of cycle generative adversarial networks and DCNN. We evaluate the performance of the SynSeg-Net on two experiments: 1) MRI to CT splenomegaly synthetic segmentation for abdominal images and 2) CT to MRI total intracranial volume synthetic segmentation for brain images. The proposed end-to-end approach achieved superior performance to two-stage methods. Moreover, the SynSeg-Net achieved comparable performance to the traditional segmentation network using target modality labels in certain scenarios. The source code of SynSeg-Net is publicly available.¹

Keywords: Synthesis, segmentation, splenomegaly, TICV, synthetic segmentation, GAN, adversarial, DCNN, convolutional

I. Introduction

DEEP learning techniques have proven effective for medical image synthesis across (1) different sequencing types within the same image modality (e.g., between T1w, T2w, PD, FLAIR etc.) and (2) different imaging modalities (e.g. MRI to CT, CT to MRI etc.) [1]. While there are impediments to use the synthetic images directly in clinical practice, synthetic images have been shown to be an effective intermediate representation for image processing including registration [2], data augmentation [3], and segmentation [4]. Historically, paired training data for both imaging modalities were typically required for image synthesis. Recent advances with cycle generative adversarial networks (CycleGAN) [5] have demonstrated high quality cross-modality image synthesis without paired data.

In this paper, we propose an end-to-end synthetic segmentation network (SynSeg-Net) to train a DCNN segmentation network without having manual labels on the target imaging modality. The network is trained by unpaired source and target modality images with manual segmentations only on the source modality (Figure 1). This method alleviates the manual segmentation efforts for the medical image analyses by taking the advantage of cross-modality image synthesis learning.

Fig. 1. — The proposed synthetic segmentation network (SynSeg-Net) is able to train a CT splenomegaly segmentation network from unpaired MRI and CT training images without using manual CT labels.

To evaluate the segmentation performance of the proposed SynSeg-Net, two experiments were employed. The first experiment performed CT splenomegaly (extraordinary large spleen) synthetic segmentation without having any spleen labels on CT images. The second experiment performed MRI total intracranial volume (TICV) synthetic segmentation without having any TICV labels on MRI. From the empirical validations, the proposed end-to-end approach achieved superior performance to the two stage methods. Moreover, the SynSeg-Net achieved comparable performance to the traditional way of training a segmentation network using target modality labels in certain scenarios. Note that the “comparable performance” in this paper is defined as two methods do not show statistically significant differences on segmentation performance.

This work extends our previous conference paper [6] with the following new efforts: (1) the methodology is presented in greater detail, (2) new external validations (MRI to CT) were provided for CT splenomegaly synthetic segmentation, (3) total intracranial volume segmentation was provided as a new experiment (CT to MRI), and (4) the source code of SynSeg-Net has been made publicly available at https://github.com/MASILab/SynSeg-Net.

II. Related Works

A. Cross-Modality Image Synthesis

Medical image synthesis is defined as the generation of realistic images through learning models [1]. From a technical perspective, image synthesis can be achieved from a generative model (e.g., from noise) or a cross-modality adaptation model (e.g., from MRI to CT). Our work is mostly related to the cross-modality image synthesis approaches, in which a synthetic image in target imaging modality is synthesized from a real image in source imaging modality.

Historically, cross-modality image synthesis methods can be ascribed to three categories (1) registration-based methods, (2) intensity-based methods, and (3) deep learning based methods. The registration-based cross-modality image synthesis methods were inspired by Miller et al. [2], in which the synthetic images were achieved by registering a subject image to a collection of co-registered images. Then, Burgos et al. [7] extended this idea to a multi-atlas information propagation scheme by integrating multi-atlas registration and intensity fusion, and applied on MRI to CT synthesis. Cardoso et al. [8] proposed a variant of this approach by introducing a multi-atlas generative model for image synthesis and outlier detection. The second family of the cross-modality image synthesis approaches is intensity-based methods, whose principle is to learn an intensity transformation function to map source intensities to target intensities [9-16].

Herein, we focus on the third family - deep learning based image synthesis methods. In [17], a location-sensitive deep synthesis method was introduced to utilize the both intensity and spatial information between modalities during training stage. Sevetlidis et al. proposed a deep encoder-decoder network [18] using a patch-based learning fashion. Xiang et al. [19] proposed a deep embedding convolutional neural network, which utilize the intermediate feature maps between MRI and CT scans. Nie et al. [20] proposed a context-aware generative adversarial network to generate CT images from MRI images.

Recently, Goodfellow et al. [21] proposed generative adversarial networks (GANs) that provided a new perspective of image synthesis and domain adaptation in using either paired training images [22] or unpaired images [5]. GAN-based methods have been successfully applied to a variety of computer vision problems [23, 24] and have been adapted to medical imaging community [20, [25-27]. Compared with previous adversarial learning based synthesis method, the cycle consistent loops leads to more representative synthetic images.

B. Synthetic Segmentation

One major application of image synthesis is to leverage segmentation performance. Iglesias et al. [4] demonstrated that the synthesized MRI images could improve the segmentation performance (Figure 2a). Several studies used the adversarial learning as an extra GAN-based supervision on medical image segmentation networks [28-31]. In this study, we focus on the synthetic segmentation, which used the synthetic images as training images to train a segmentation network in target imaging modalities.

Fig. 2. — This figure illustrates the prevalent strategies of performing segmentation without ground truth in the target modality. “Mod. S” means source modality images while “Mod. T” represents target modality images. “Syn. 1” is a source to target transformation generator, while “Syn. 2” is a target to source transformation generator. “Seg. T” is the segmentation network for target modality. (a) is the a two-stage framework that considered the synthesis (left side of the red dash line) and segmentation (right side of the red dash line) as two independent training stages. (b) connects the synthesis and segmentation network into an end-to-end fashion. (c) employs the latest CycleGAN framework as the synthesis network for unpaired cross-modality image synthesis (left side of the red dash line), and then performs another independent training stage for segmentation (right side of the red dash line). (d) is the proposed method which integrate the cycle adversarial synthesis and segmentation into a end-to-end framework.

Figure 2 presents the different strategies for synthetic segmentation. Kamnitsas et al. [32] introduced unsupervised domain adaptation for brain lesion segmentation (Figure 2b). It reveals the possibility of training a lesion segmentation network using cross-modality synthetic segmentation. However, (1) the source imaging sequence (GE) and target imaging sequence (SWI) are still from to the same MRI modality. (2) Overlapped image modalities (e.g., FLAIR, T2, PD, MPRAGE) were used in both source and target imaging modalities to ensure performance. Cross-modality synthetic segmentation on two independent imaging mechanisms (e.g. MRI to CT) without having overlapped imaging modalities is appealing.

Recently, the cycle generative adversarial networks (CycleGAN) [5] provided a promising tool for cross-modality synthesis from unpaired training images [33, 34]. With CycleGAN, one is able to synthesize the images for one imaging modality (e.g., MRI) while targeting another imaging modality (e.g., CT). Using CycleGAN, Chartsias et al. [35] proposed an CT to MRI synthesis method, and then trained another independent MRI segmentation network (called “Seg.”) using the synthetic MRI images (Figure 2c). Although still using manual labels for both two modalities, this two-stage framework (we refer to as “CycleGAN+Seg.”) revealed a promising direction of integrating cycle adversarial networks in synthetic segmentation.

Building upon CycleGAN, Zhang et al. [3] and our group [6] proposed end-to-end synthesis and segmentation networks. Zhang et al. [3] focus on leveraging both synthesis and segmentation performance simultaneously using both true images and manual labels on both MRI and CT. Therefore, the manual segmentation on target imaging modalities have still been used. By contrast, Huo et al. [6] introduced the end-to-end synthesis and segmentation network, which designed a synthetic segmentation network without using manual labels in target imaging modality. In this paper, we described such method with more detailed descriptions. Moreover, external validation and new experiments were employed to evaluate the proposed method as well as the baseline methods.

III. Method

Figure 3 introduces the network design, while preprocessing, postprocessing, hyperparameters and the experimental platforms are presented below.

Fig. 3. — The upper panel showed the network structure of the proposed SynSeg-Net during training stages. The left side was the CycleGAN synthesis subnet, where S was MRI and T was CT. G₁ and G₂ were the generators while D₁ and D₂ were discriminators. The right subnet was the segmentation subnet *Seg* for an end-to-end training. Loss function were added to optimize the SynSeg-Net. The lower panel showed the network structure of SynSeg-Net during testing stage. Only the trained subnet *Seg* was used to segment a testing image from target imaging modality.

A. Preprocessing

The intensities of every input MRI scan were normalized to 0-1 scale such that the highest 2.5% and lowest 2.5% intensities were excluded from the normalization to reduce the outliers’ effects. For CT, the voxels whose HU values were greater than 1000 were set to 1000, whose HU values were less than −1000 were set to −1000. Then, the intensities between −1000 to 1000 were normalized to 0-1 scale. Next, the axial slices from normalized intensity image volume (both MRI and CT) were resampled to 256×256 using bilinear interpolation, while the corresponding segmentation axial slices were resampled to the same resolution using nearest neighbor interpolation. Hence, the same image dimensions (256 × 256) were match for both modalities, following CycleGAN [5].

B. SynSeg-Net

Figure 3 presents the network structure of SynSet-Net, where “S” indicates the source imaging modality (e.g., MRI), while “T” indicates the target imaging modality (e.g., CT). The SynSeg-Net consisted of two major portions: cycle synthesis subnet and segmentation subnet.

1). Cycle Synthesis Subnet:

The 9 block ResNet (defined in [5] and [36]) was employed as the two generators G₁ and G₂. The generator G₁ transferred a real image x in modality S to a synthetic image G₁ (x) in modality T, while the generator G₂ synthesized a real image y in modality T to a synthetic image G₂ (y) in modality S. Next the PatchGAN (defined in [5] and [37]) was used as the two adversarial discriminators D₁ and D₂. D₁ determined whether a provided image is a synthetic image G₁ (x) or a real image y, while D₂ judged whether a provided image is a synthetic image G₂ (y) or a real image x. When deploying such network on unpaired images from modality S and T, two forward training paths (Path A and Path B) were used (in Figure 3).

2). Segmentation Subnet:

Since the final aim of the proposed SynSeg-Net was to perform end-to-end synthetic segmentation, we concatenate a segmentation network “Seg” after G₁ directly, as an extension of the training Path A. To be consistent with the cycle synthesis subnet, the same the 9 block ResNet [5, 36] were used as S, whose network structure was identical to G₁.

3). Loss Functions:

In SynSeg-Net, five loss functions have been used during the training stage. After discriminators D₁ and D₂, two adversarial loss functions were used to train the adversarial generators G₁ and G₂.

L_{GAN} (G_{1}, D_{1}, S, T) = E_{y \sim T} [\log D_{1} (y)] + E_{x \sim S} [\log (1 - D_{1} (G_{1} (x)))]

(1)

L_{GAN} (G_{2}, D_{2}, T, S) = E_{x \sim S} [\log D_{2} (x)] + E_{y \sim T} [\log (1 - D_{2} (G_{2} (y)))]

(2)

Meanwhile, two cycle consistent loss functions were used to minimize the difference between true images and cycle reconstructed images.

L_{cycle} (G_{1}, G_{2}, S) = E_{x \sim A} [‖ G_{2} (G_{1} (x)) - x ‖_{1}]

(3)

L_{cycle} (G_{2}, G_{1}, T) = E_{y \sim B} [‖ G_{1} (G_{2} (y)) - y ‖_{1}]

(4)

The last loss function is the segmentation loss, which was the weighted cross entropy loss.

L_{seg} (S e g, G_{1}, S) = - \sum_{i} m_{i} \log (S e g (G_{1} (x_{i})))

(5)

After defining five loss functions, we added them together by assigning different weights.

L_{total} = λ_{1} \cdot L_{GAN} (G_{1}, D_{1}, S, T) + λ_{2} \cdot L_{GAN} (G_{2}, D_{2}, T, S) + λ_{3} \cdot L_{cycle} (G_{1}, G_{2}, S) + λ_{4} \cdot L_{cyle} (G_{2}, G_{1}, T) + λ_{5} \cdot L_{seg} (S e g, G_{1}, S)

(6)

C. Training and Testing

In all experiments, the lambdas were empirically set to λ₁ = 1, λ₂ = 1, λ₃ = 10, λ₄ = 10, λ₅ = 1. The λ₁ to λ₄ were chosen using the same values in the original CycleGAN paper [5], where λ₅ was simply assigned it to 1 without tuning for different applications in this study. The Adam optimizer [5] was used to minimize the L_total. The number of input and output channels of all networks are all one except S, which had seven output channels. The Adam learning rate was 0.0001 for G₁, G₂ and Seg and 0.0002 for D₁ and D₂.

In testing stage, only the segmentation network Seg was employed by SynSeg-Net (Figure 3). To segment a testing scan in the target modality, the testing scan was normalized to 0-1. Next, the axial slices from normalized testing image volume were resampled to 256 × 256 using bilinear interpolation. Last, the final segmentation slices were resampled to the original resolution using nearest neighbor interpolation and were concatenated. During training, the 2D slices were sampled randomly across all scans without forcing each batch to have only consecutive slices or only from the same subject.

The experiments were performed on an Ubuntu workstation, with NVIDIA Titan GPU (12 GB memory) and CUDA 8.0. The code of preprocessing and processing was implemented in MATLAB 2016a (www.mathworks.com), while the code of SynSeg-Net methods was implemented in Python 2.7 (www.python.org). For DCNN methods, the PyTorch 0.2 version (www.pytorch.org) was used to establish the network structures and perform training.

D. Evaluation Metrics

The Dice similarity coefficient (DSC) was employed to evaluate different approaches by comparing their segmentation results against the ground truth voxel-by-voxel. Differences between methods were evaluated by Wilcoxon signed rank test [38] with a significance threshold of p<0.05.

IV. Experimental Design and Results

We conducted experiments on two different applications to evaluate the relative effectiveness of different approaches. The first application is the MRI to CT splenomegaly synthetic segmentation. The second application is the CT to MRI TICV synthetic segmentation. In the first experiment, we first employed the target abdominal CT intensity images in the training (without using the manual labels), which would provide the best synthetic segmentation performance since the target intensity images were used in the synthesis learning. Then, we use an independent CT cohort for validation.

A. MRI-to-CT Splenomegaly Synthetic Segmentation for Abdomen

1). Data:

A collection of 60 clinical acquired whole abdomen MRI T2w scans as well as 19 clinical acquired whole abdomen CT scans from splenomegaly patients were used as the training and testing data. The MRI and CT scans were acquired in the axial plane. In total, 3262 MRI slices and 1874 CT slices were used in the experiments.

2). Experimental Design:

CT Segmentation with CT Manual Labels

First, our previously developed spleen segmentation network (SSNet) [31] (trained by 75 normal spleen CT scan) was employed to assess performance of a network trained by normal spleens applied to splenomegaly scans.

Then, multi-atlas segmentation and residual FCN network were used as two baseline methods, which used traditional segmentation strategies: trained by 19 splenomegaly CT scans as well as the corresponding manual spleen labels in a leave-one-out cross validation manner. Briefly, the adaptive Gaussian mixture model multi-atlas segmentation (AGMM MAS) was used as the first baseline method, which has been shown its superior performance on splenomegaly segmentation [39]. The second baseline approach employed the 9 block ResNet FCN [5, 36]. To compare with the synthetic segmentation methods, the network structure and the hyperparameters of the ResNet were kept exactly the same as the generators and segmentation networks in SynSeg-Net. This method evaluated the performance of traditional supervised DCNN segmentation, which used the manual labels in target imaging during training. Since only spleen manual labels were available in the CT domain, the supervised learning methods using CT manual labels provided spleen segmentation results.

CT Segmentation Without CT Manual Labels

Then, we evaluated the performance of synthetic segmentation, which did not use the manual labels in target imaging modality during training. In this section, the two stage CycleGAN+Seg. strategy proposed by Chartsias et al. [35] as well as the proposed end-to-end SynSeg-Net were evaluated. To be a fair comparison, the network structures of CycleGAN+Seg. and SynSeg-Net were the same except that the SynSeg-Net, which connected the synthesis and segmentation in an end-to-end training. Briefly, the CycleGAN+Seg. strategy firstly trained the CycleGAN network to achieve 60 synthetic CT scans from 60 real MRI scans. Then the manual labels from real MRI scans as well as the corresponding synthetic CT scans were used to train an independent 9 block ResNet network. Hence, two independent training phrases were used.

By contrast, the proposed SynSeg-Net integrated the two synthesis and segmentation training phrases into an end-to-end training framework. The examples of real, synthesized, reconstructed and segmentation images for Path A and Path B were shown in Figure 4.

Fig. 4. — The intermediate results of the real, synthesized, and reconstructed images as well as segmentations in training Path A and Path B.

We also performed an experiment that trained the SynSeg-Net only using the source to target path (from MRI to CT) without the target to source path. The experiment only used G₁ and T in the half cycle (HC), which was called SynSeg-Net-HC. This experiment presented the segmentation performance with/without the complete cycle.

All networks were trained and validated for 100 epochs. The epoch with highest mean DSC between predicted and manual segmentation on 19 splenomegaly CT scans were reported in the results. The best performance of ResNet (epoch=90) was obtained from leave-one-subject-out validation. The best performance of SSNet (epoch=10), SynSeg-Net-HC (epoch=10), CycleGAN+Seg. (epoch=50) and SynSeg-Net (epoch=40) were evaluated from the external validation since labels for 19 splenomegaly CT scans were never used in the training. Since liver, left kidney, right kidney and stomach manual labels were avilable in additon to spleen labels in MRI, the corresponding automatic organ segmentation results were also presented qualitatively in Figure 5 for SysSeg-Net-HC, CycleGAN+Seg, and SynSeg-Net. However, we did not evaluate the results except spleen since (1) we did not have manual labels for the remaining organs in CT domain, (2) the purpose of this experiment is to perform spleen segmetnation.

Fig. 5. — The qualitative results were presented in this figure, including (1) three canonical methods using CT manual labels in CT segmentation, and (2) CycleGAN+Seg. and the proposed SynSeg-Net methods without using CT manual labels. The splenomegaly CT labels were only used in validation and excluded from training for (2). Moreover, later methods not only performed spleen segmentation but also estimated labels for other organs, which were not provided by canonical methods when such labels were not available on CT.

3). Results:

The qualitative and quantitative results were shown in Figure 5 and 6 respectively. Three subjects with largest, median and smallest DSC of SynSeg-Net were presented. From the results, the SynSeg-Net was not only able to perform the spleen segmentation, but also estimated segmentations on liver, left kidney, right kidney and stomach. The “*” indicates the difference between methods were significant, while “N.S.” means not significant. Average surface distance (ASD) measurements (median, mean, and standard deviation (Std)) were presented as well as the DSC measurements in Table 1.

Fig. 6. — The boxplot results of all CT splenomegaly testing images, where “*” means the difference are significant at p<0.05, while “N.S.” means not significant.

TABLE I.

Dice similarity score (DSC) and average surface distance (ASD) for CT splenomegaly testing images.

	SSNet	AGMM MAS	Seg.	SynSeg-Net-HC	CycleGAN+Seg.	SynSeg-Net
Median DSC	0.679	0.912	0.911	0.628	0.880	0.919
Mean±Std DSC	0.630±0.269	0.861±0.101	0.911±0.040	0.605±0.084	0.878±0.056	0.895±0.063
Median ASD	8.882	3.164	2.005	15.181	5.835	2.864
Mean±Std ASD	18.340±27.991	6.726±7.710	3.004±2.797	14.383±4.521	5.600±3.619	3.898±3.397

Open in a new tab

the unit for ASD related measurements is millimeter (mm).

Without using CT labels, the SynSeg-Net achieved significant superior performance compared with CycleGAN+Seg. and SynSeg-Net-HC methods, while achieving comparable performance with baseline ResNet segmentation network using CT labels.

B. External Validation for MRI-to-CT Splenomegaly Synthetic Segmentation

In the previous experiment, the target images were used in the training stages (only intensity images were used and the labels were excluded). This strategy would provide the best performance of the proposed SynSeg-Net since the target images were used to model the target distributions. However, the training stage needs to be performed again for an unseen target image. A more general strategy is to apply the trained model on new target images directly. in this experiment, we employed an independent external validation cohort to evaluate the performance of the baseline segmentation network, two stages CycleGAN+Seg. network, and proposed SynSeg-Net.