Abstract
Deep networks are now ubiquitous in large-scale multi-center imaging studies. However, the direct aggregation of images across sites is contraindicated for downstream statistical and deep learning-based image analysis due to inconsistent contrast, resolution, and noise. To this end, in the absence of paired data, variations of Cycle-consistent Generative Adversarial Networks have been used to harmonize image sets between a source and target domain. Importantly, these methods are prone to instability, contrast inversion, intractable manipulation of pathology, and steganographic mappings which limit their reliable adoption in real-world medical imaging. In this work, based on an underlying assumption that morphological shape is consistent across imaging sites, we propose a segmentation-renormalized image translation framework to reduce inter-scanner heterogeneity while preserving anatomical layout. We replace the affine transformations used in the normalization layers within generative networks with trainable scale and shift parameters conditioned on jointly learned anatomical segmentation embeddings to modulate features at every level of translation. We evaluate our methodologies against recent baselines across several imaging modalities (T1w MRI, FLAIR MRI, and OCT) on datasets with and without lesions. Segmentation-renormalization for translation GANs yields superior image harmonization as quantified by Inception distances, demonstrates improved downstream utility via post-hoc segmentation accuracy, and improved robustness to translation perturbation and self-adversarial attacks.
Keywords: unpaired image translation, conditional normalization, generative adversarial networks, image segmentation, image harmonization
I. Introduction
Large-scale multi-center imaging studies acquire data over several years and analyze them jointly to track potential biomarkers for specific diseases. However, data collection may involve the upgrade of imaging devices over time or the use of multiple scanners in parallel, introducing non-biological variability and batch-effects into any statistical analysis. For example, magnetic resonance images acquired from 1.5 T and 3 T scanners differ significantly in intensity, contrast, and noise distribution (see Fig. 1 (a)), with even higher inter-scanner variability observed in Optical Coherence Tomography (OCT) as in Fig. 1 (b). Therefore, harmonization to a consistent imaging standard is critical to downstream tasks. Further, harmonization enables the introduction of legacy data into new studies, thereby accelerating collaborative imaging research by increasing statistical power and reducing costs.
Fig. 1:

Unpaired samples across medical imaging scanners illustrating inter-device appearance variability.
The multi-site imaging harmonization problem has been studied extensively, with varying approaches [1]–[3]. One strategy takes an Empirical Bayes approach [4] to harmonize image-derived statistical values such as cortical thickness across sites to an intermediate domain between source and target. Another image-specific approach attempts to normalize intensity between scanners via non-parametric density flows [5]. Given paired images, i.e. the same subject scanned on different devices, supervised learning with convolutional networks can be applied to learn mappings between medical imaging domains [6], [7]. However, acquiring paired images is expensive and logistically challenging for most study designs.
Recently, Cycle-consistent Generative Adversarial Networks [8] (CycleGAN) and its derivatives have been successful in unpaired image-to-image translation with direct applicability to the unpaired harmonization task when the imaging sites have roughly similar subject demographics. For example, [9] applies CycleGAN to OCT harmonization and [10] uses spherical U-Nets within CycleGAN for cortical thickness harmonization. However, cycle-consistent adversarial methods are prone to instability, can invert contrast, and can manipulate or introduce pathological lesions [11], [12]. Further, as translation between domains with differing amounts of structural information is ill-posed (as in image harmonization), a cycle-consistent generator may employ self-adversarial noise and learn mappings that are highly susceptible to noise [13], [14] making its outputs unreliable for downstream tasks.
Current work has sought to improve translation quality and robustness via multi-task strategies. For instance, translation can be improved by imposing a segmentation consistency loss between input and output [15], [16], or by further providing an instance mask coupled with the input to the generator [17]. In contrast to the goal of leveraging segmentation for high-quality robust image translation, another line of work proposes to segment target modalities without labels by utilizing translation GANs and source modality labels [18], [19]. While these methods regularize translation at the image (input/output) level, they do not consider feature-level regularization.
Common GAN generators are constructed as stacks of repeating convolutions, normalizations, and activations. Unfortunately, the normalization layers can ‘wash away’ spatial context [20] and may be detrimental to harmonization given that anatomical structure should remain consistent between domains. Feature-level segmentation-conditioning for paired translation was proposed in [20]. However, it learned affine parameters for each index in a feature tensor and thus does not scale to unpaired problems which require a reverse translation.
In this work, we improve unpaired image-to-image translation and hence image harmonization by conditionally normalizing cycle-consistent adversarial methods at the feature level via segmentation-derived linear modulation [21], [22]. In the normalization layers of the generator, we first normalize features to zero mean and unit standard deviation, and then renormalize them with affine scale and shift transformations whose parameters are learned from a jointly trained semantic segmentation-derived embedding branch.
Importantly, our approach is generic and could be applied to any other image translation framework where segmentation labels are available. Our contributions are as follows,
We improve on cycle-consistent adversarial methods in the unpaired multi-site harmonization task with a novel segmentation-aware renormalization layer;
Our proposed method regularizes mappings towards maintaining anatomical segmentation consistency while eliminating non-biological batch-effects;
We evaluate the proposed methodologies against both standard and segmentation-aware baselines across diverse imaging modalities (T1w MRI, FLAIR MRI, and OCT) via sample fidelity, post-hoc segmentation accuracy scores, and sensitivity to translation perturbation.
Finally, we demonstrate the robustness of our methods in harmonization tasks both in the context of pathological image translation where standard methods hallucinate lesions (or lack thereof), and in robustness to adversarial attacks in image translations.
Our image translation code is publicly available1.
II. METHODS
A. Methodological Overview
We formalize unpaired multi-site harmonization as an image-to-image translation problem between two domains X and Y. An overview of the proposed pipeline is shown in Fig. 2. In canonical cycle-consistent translation methods, the forward mapping is defined as G: X → Y, harmonizing images from source scanner X to target scanner Y, such that G(X) is indistinguishable from Y to the discriminator DY. As G is learned from unpaired samples, the framework additionally learns an inverse mapping F: Y → X and enforces a cycle-consistency constraint F(G(X)) ≈ X to achieve unpaired translation consistent with the input.
Fig. 2:

A high-level overview of the forward/backward translation in our framework. The generator learns cross-domain translations with its normalized layer-wise feature-maps affinely modulated by a jointly learned shared segmentation-embedding derived from the semantic extractor subnetwork. Seg Renorm Res Block denotes the proposed segmentation-renormalized residual block detailed in Figure 3. Green subnetworks are source domain-specific, blue subnetworks are target domain-specific, and gray boxes correspond to our objective functions.
We begin our improvements by incorporating segmentation networks S and Q in the CycleGAN setting as shown in Fig. 2, where S is trained with (image, segmentation)-pairs from the source domain, and Q is trained with pairs from the target domain. Motivated by the assumption that morphological shape is consistent across imaging domains, S and Q are encouraged to produce the same segmentation via a segmentation-objective as in [15], [16].
In addition to a loss-based approach, we further use segmentation information to adaptively control the scales and biases of each layer within the generator. Using a semantic extractor subnetwork (shown in Fig. 2), we derive a learned structural embedding from the predicted segmentation map which is then used to channel-wise condition the normalized output of each convolutional layer of the generator via Feature-wise Linear Modulation (FiLM) [22]. We hypothesize that feature-level segmentation-conditioning provides improved translation generator activations over using input/output-level segmentation-regularization alone. Importantly, as the structural embedding is learned using a convolutional network (which are in practice neither translation-equivariant [23], [24] nor deformation-equivariant), the embedding still contains information about the relative positioning of anatomical structures.
B. Segmentation-renormalization.
We incorporate anatomical priors by replacing the convolutional residual blocks in the generator subnetworks with the proposed segmentation-renormalized residual block as illustrated in Fig. 3, modulating intermediate features with an affine transformation conditioned on a learned segmentation. Formally, a standardization and renormalization is inserted into each residual block of generators G and F between the convolutional layers and their activations. Features in the generator are standardized in a channel-wise manner and linearly modulated with learnable scales and shifts defined as,
| (1) |
where is the 2D spatial output of the convolutional layer and are the mean and standard deviation of channel c in the ith block from the nth sample in the batch. Fig. 3 shows a comparison between a standard residual block [25] and the proposed residual block. and are renormalization parameters that scale and shift the normalized feature at each channel. In contrast to standard normalization, we learn and using FiLM layers (i.e., two fully connected layers and in our implementation) on the structure embedding f(m) (which is derived from a predicted segmentation mask m) such that: , and . To enable feature reuse, this structural embedding is shared across all blocks as shown in Figs. 2 and 4 and is then linearly modulated for each translation layer separately.
Fig. 3:

A comparison between (top) standard residual blocks and (bottom) our proposed segmentation-renormalized residual blocks (Seg Renorm Res Block).
Fig. 4:

Overall translator architecture with subnetworks A. segmentation net, B. semantic extractor, and C. segmentation-renormalized generator. n is the number of classes in the segmentation. For the IXI experiments, input spatial resolutions are 128 × 128 and the AvgPool(2,4) in B. was replaced by AvgPool(1,2).
We note that conditional normalization has generically been extensively studied for both image translation [20], [21] and medical imaging segmentation [26], [27]. We instead propose to conditionally normalize generative unpaired translation networks based on learned segmentation embeddings, leading to both improved translation fidelity and downstream utility as shown in Section III.
C. Architecture and Design Details
The proposed framework as shown in Fig. 2 can be decomposed into segmentation networks, semantic extractors, generators, and discriminators. Briefly, the segmentation network, semantic extractor, and generator jointly form the image-to-image translator with the discriminator trained in combination to calculate the adversarial objective. Architectural details are given in Figs. 4 and 5 using the following nomenclature:
ResBlock(k) represents two repeating {k 3 × 3 convolutions, batch normalization, ReLU} blocks with an additive skip connection as illustrated in Fig. 3.
Analogously, SegRenormResBlock(k) denotes the proposed residual block where each normalization is replaced by the segmentation-renormalization illustrated in Fig. 3.
TConv represents a transposed convolutional layer.
Lastly, AvgPool(k,s) denotes an average pooling operation with kernel size k and stride size s.
Fig. 5:

Patch discriminator network architecture. For the IXI experiments, the input image resolution is 128 × 128 × 1.
1). Segmentation nets:
U-Nets [28] detailed in Fig. 4 A are used to produce segmentation masks from input images to both regularize the image translation via segmentation-consistency and to provide an input to the semantic extractor, detailed below. These auxiliary networks are trained under a weighted cross-entropy cost for its robustness to label noise induced by imperfect segmentation [29].
2). Semantic Extractors:
Taking the predicted segmentation as input, a shallow CNN (detailed in Fig. 4 B) extracts a semantic embedding that serves as input to FiLM layers which modulate generator features as described below.
3). Generators:
Residual U-Nets (detailed in Fig. 4 C) are used for the generator networks with its residual blocks replaced with our proposed segmentation-renormalized residual blocks. The proposed renormalization is used throughout the generator to propagate the learnt segmentation embedding at multiple resolutions during synthesis.
4). Discriminators:
We adopt PatchGAN discriminators [8] as described in Fig. 5, distinguishing between real and synthesized images at the patch level with patch size determined by the network receptive field (34 × 34 in our implementation). Spectral normalization [30] was used in the discriminator for training stability.
D. Learning objectives
The framework is trained end-to-end with multiple losses.
1). Adversarial terms:
We employ a least squares objective for adversarial training [31], due to its improved stability over a cross-entropy objective in our task. The two-player adversarial game is optimized as,
| (2) |
with analogous optimization applied for F and DX.
2). Segmentation terms:
The segmentation networks S and Q shown in Fig. 2 for source and target domains, respectively, are trained under a weighted cross-entropy objective. In the forward cycle, the source domain sample x has a groundtruth segmentation mask which is used as a reference for subnetwork S. However, once x is translated to the target domain, G(x) does not have expert annotation. Motivated by our assumption that anatomical layout is consistent across domains, we give the segmentation subnetwork Q the same segmentation reference as S (the groundtruth source domain labels). Analogous reasoning follows for the backward cycle, where subnetwork S is given the same reference as Q (the groundtruth target domain labels). The loss is defined as,
| (3) |
where n is the number of classes, S(x,c) is the prediction for class c from input x, sc is the ground truth mask acquired from source domain, λ0 indicates the weights applied on negative samples (background), and λ1,...,λn−1 are weights for positive samples (foreground). Q(y,c) and qc are analogously defined in the target domain.
3). Cycle-consistency terms:
To reduce the space of possible mapping functions and enable training with an unpaired dataset, a cycle consistency loss is defined as,
| (4) |
4). Total objective:
The complete objective function of our model to minimize is summarized as
| (5) |
III. Experiments
A. Data and preprocessing:
We tested our method across several modalities: MRI (T1w FLAIR) and OCT on three public datasets IXI [32], MS-SEG [33], and RETOUCH [34], respectively. For IXI and, RETOUCH, we used a 70/30 train/test split at the subject level. For MS-SEG we performed leave-one-subject-out cross-validation as only five subjects were imaged per scanner. In all datasets, no individual was scanned on more than one device.
1). MS-SEG:
FLAIR MRI of subjects with Multiple Sclerosis (MS) were collected for an MS lesion segmentation challenge [33]. Five non-overlapping subjects per scanner were imaged on three different MR scanners: Siemens Aera 1.5 T (SA), Philips Ingenia 3 T (PI), and Siemens Verio 3 T (SV). The scanner specifications are detailed in Table I. See Fig. 6 A. for samples from the three different scanners that illustrate markedly different image appearance
TABLE I:
Comparison between MS-SEG scanners.
| Scanner | Field Strength | Voxel Size (mm) | Resolution |
|---|---|---|---|
| SA | 1.5 T | 1.2 × 1 × 1 | 128 × 224 × 256 |
| PI | 3 T | 1.1 × 0.5 × 0.5 | 144 × 512 × 512 |
| SV | 3 T | 0.7 × 0.7 × 0.7 | 261 × 336 × 336 |
Fig. 6:

Harmonization results on test 2D multi-site FLAIR slices from MS-SEG. Downstream segmentation results are shown below translated images, annotated with Dice Coefficients (DC) for the visualized slices. A. Example slices from three sites, highlighting their varying contrasts; B. SA → SV translation; C. SV → PI translation; D. SA → PI translation. Severe prediction artefacts appear in baseline outputs whereas the proposed model preserves semantic layout and appearance. For example, note the contrast inversion of tissue in B and the lesion in D (see yellow insets) using CycleGAN and overall contrast inversion in C using S-CycleGAN. Strong decreases in downstream segmentation performance appear across baselines with false-positive and false-negative examples marked by white and yellow arrows, respectively. In all settings, the proposed methodology demonstrates significant improvements in both translation fidelity and post-hoc segmentation performance.
We performed harmonization between each pair of scanners (i.e. SA to SV, SA to PI and PI to SV), where the target domains were selected to be scanners with overall higher image qualities (e.g., higher field strengths, or smaller voxel sizes). All images were brain extracted, denoised, and bias field corrected by the challenge organizers. We finally affinely registered them to MNI [35] coordinates. The groundtruth segmentation was constructed from a consensus of 7 expert-delineated parenchyma lesion masks for each image.
2). RETOUCH:
We further tested our method by repurposing the Retinal OCT Fluid Challenge dataset [34], originally for multi-scanner pathology segmentation. Compared to MRI, distinct OCT scanners show higher imaging variability. The Cirrus (CR) and Spectralis (SP) scanners were treated as source and target respectively, with 24 unpaired scans each. Scanner differences are shown in Table II.
TABLE II:
Comparison between OCT scanners.
| Scanner | Cirrus (CR) | Spectralis (SP) |
|---|---|---|
| Resolution | 512 × 1024 × 128 | 256 × 496 × 49 |
| B-scans | 128 | 49 |
| Voxel Size(mm) | 0.01 × 0.001 × 0.05 | 0.01 × 0.004 × 0.1 |
Manual expert segmentation annotations were provided delineating three classes of abnormalities: Intraretinal Fluid (IRF), Subretinal Fluid (SRF) and Pigment Epithelium Detachments (PED). Following [9], we used nearest-neighbors resampling on CR OCTs to match target dimensionality.
3). IXI:
T1w MRI of 241 non-overlapping healthy subjects from two distinct sites were collected from IXI [32]. 18 subject scans with ringing effects or outlier intensity distributions were excluded, yielding 74 scans obtained on a GE Healthcare 1.5 T scanner (GE) and 167 scans from a Philips 3 T scanner (PL). Detailed scanner comparisons are given in Table III.
TABLE III:
Comparison between IXI scanners.
| Scanner | GE Healthcare | Philips (PL) |
|---|---|---|
| Field Strength | 1.5T | 3T |
| Resolution | 146 × 256 × 256 | 150 × 256 × 256 |
| Voxel Size(mm) | 1.2 × 0.9 × 0.9 | 1.2 × 0.9 × 0.9 |
We aim to harmonize GE to PL which displays higher image quality. All T1 scans underwent brain extraction with ROBEX [36] and bias-field correction to minimize intensity inhomongeneity. As the dataset does not provide manual segmentation, we simulated a groundtruth segmentation for each image using FSL FAST [37] for whole-brain segmentation into three tissue types: grey matter (GM), white matter (WM), and cerebrospinal fluid (CSF). Similar pseudo-ground truth estimation approaches have been taken for large-scale medical deep learning [38].
B. Implementation details:
The learning rates for the segmentation nets, generators and discriminators were set to 2 × 10−4, 2 × 10−4, and 1 × 10−4, respectively, and Adam optimizers were adopted with β1 = 0.5 and β2 = 0.999. For IXI, networks were trained on randomly cropped 2D axial slices of size 128 × 128, with batch size 4. For RETOUCH and MS-SEG, the crop size was increased to 256 × 256, with batch size 2. We empirically set segmentation weights of background/foreground to 0.5/0.5 for IXI, 0.3/0.7 for RETOUCH, and 0.2/0.8 for MS-SEG given the imbalance between positive (lesion) and negative segmentation labels. We set λGAN = 1,λcyc = 10,λseg = 1 across datasets.
C. Evaluation scores and baseline methods
In the absence of paired data, we evaluated harmonization performance from three different perspectives: visual sample fidelity (Sec. III-D), downstream task usability (Sec. III-E), as well as sensitivity to translation perturbation and self-adversarial attacks (Sec. III-F). We conducted comparisons against the original CycleGAN [8] and two methods which include segmentation-information into cycle-consistent translation: S-CycleGAN [15] and SemGAN [16]. We used the official CycleGAN implementation from [39], and reimplemented the other methods as they do not have public code repositories. S-CycleGAN [15] proposed to use additional segmentation networks on the translated outputs, with segmentation loss as an image-level regularization. For consistent comparison, a 2D version of S-CycleGAN was used in our experiments. Adding on to S-CycleGAN, SemGAN [16] further uses semantic dropout in training by randomly masking out object classes from the inputs to enhance class-to-class translation. As the MS-lesion masks in MS-SEG are sparse, they cannot be meaningfully dropped out and we excluded comparison with SemGAN on MS-SEG. Sections III-D, III-E and III-F provide detailed analysis and comparison. All comparisons were conducted on a held-out test set, split at the subject-level.
D. Post-harmonization visual fidelity results
Representative harmonization results are qualitatively shown in Figs. 6, 7, 8 for MS-SEG, RETOUCH, and IXI respectively. As observed from the visualizations, baseline methods often produce severe artefacts and lose semantic information crucial to medical image analysis. For example, CycleGAN introduced either global contrast inversion (Fig. 6 B) or localized contrast inversion to the MS lesion (Fig. 6 D). In Fig. 6 C, CycleGAN hallucinated artefacts and S-CycleGAN incorrectly removed the MS lesion, whereas our method preserved semantic layout and appearance, with significantly improved downstream segmentation performance via Dice coefficient. In Fig. 8 row I, we display zoomed-in periventricular regions for comparison: S-CycleGAN creates strong checkerboard-like artefacts, whereas our method displays improved image quality in terms of matching the fidelity, sharpness, and contrast of the target domain. In OCT harmonization, we observed improved image sharpness and fidelity especially within the red insets of Fig. 7 as compared to other methods. While we observe no strong artefacts in structured retinal regions across methods, baseline methods hallucinated stripe-like artefacts in the background indicated by cyan arrows in Fig. 7, whereas our framework did not.
Fig. 7:

Harmonization results on test multi-site OCT images on RETOUCH. Columns {A, F} show the input with a zoomed region-of-interest above and its annotation below; {B, G}, {C, H}, and {D, I} display CycleGAN, S-CycleGAN, SemGAN harmonizations and post-hoc segmentations, respectively; {E, J} present our results. Generalized Dice Coefficient (GDC) [40] is displayed for visualized slices. White and yellow arrows show false positive and negative downstream segmentation predictions. Cyan arrows indicate artefacts introduced by translation methods. Readers are encouraged to zoom in for inspection.
Fig. 8:

Harmonization results on test multi-site MRI from IXI. Column A shows a T1w axial slice from the source domain with its segmentation groundtruth below; B, C, D show harmonized images from CycleGAN, S-CycleGAN and SemGAN, respectively, alongside the results of a segmentation network trained on the original target domain images; E shows results from our method together with the output from the same segmentation network, revealing an improved post-hoc segmentation (see yellow insets); Generalized Dice Coefficient (GDC) [40] is annotated for each visualized segmentation prediction. F shows an example unpaired target scan. To show finegrained changes, row I shows a zoomed-in region corresponding to the red insets outlined in row II. All images were contrast enhanced with gamma correction for improved visualization.
Further, non-expert human perception may not catch subtle artefacts and distortions present in CNN-synthesized images across datasets which are still detectable via algorithmic means [41]. We observed empirical support for this hypothesis as the different translation methods lead to different inception distances and different segmentation results (analyzed below in section III-E) when predicted from the same separately-trained segmentation model, illustrating the underlying differences which affect downstream performance.
To quantify the visual fidelity of the harmonized results in the absence of paired data, we measured similarity to real unpaired target images in the feature space of a pre-trained network as commonly done with the Fréchet and Kernel Inception Distances (FID and KID) [42], [43]. We compared the similarity between source and target domains as a baseline and then between harmonized and target domains. FID measures the similarity of real and synthesized distributions by fitting multivariate Gaussians to feature-space embeddings and calculating the Wasserstein-2 distance between them. Though commonly used, FID has a strongly biased estimator [43], motivating the use of KID which does not assume a parametric distribution on embeddings and applies a polynomial kernel on samples independently drawn from each distribution. For natural images, representations are obtained by running each set of images into an ImageNet-pretrained Inception-v3 [44] network, extracting its last pooling layer features.
Yet, ImageNet-derived features may not yield good representations for comparing the distributions of embedded medical images. To this end, we trained a multitask autoencoder as the feature extractor to calculate FID and KID. Note that we refer to these scores as FID and KID despite not using Inception-v3 as a feature extractor to maintain consistency with the literature. The feature extractor network is composed of a residual encoder-decoder for image reconstruction, with a domain-classification branch trained on its latent features as in [45]. We did not use skip connections between encoder and decoder to enforce a meaningful bottleneck representation. Network configurations and training details are presented in Appendix A. As the network both classifies and reconstructs its input, its bottleneck representation is jointly discriminative and reconstructive. Once trained, data from both domains, and the harmonized data are fed into the network for feature extraction.
KID and FID results are presented in Table IV. We see the large domain gap as measured by both KID and FID between source and target domains greatly reduce after harmonization. Compared to baselines, our method achieves significantly improved KID on both MS-SEG and RETOUCH, and comparable KID on IXI. For FID, our method showed consistent improvements on IXI over all compared baselines. We skipped FID on RETOUCH and MS-SEG as FID exhibits strong bias with small sample sizes [43].
TABLE IV:
Kernel (KID) and Fréchet (FID) Inception distances before and after harmonization (lower is better). Given that the FID estimator is strongly biased when the sample size is small [43], we skip FID evaluation on RETOUCH and MS-SEG.
| KID (↓)\Dataset | MS-SEG |
RETOUCH | IXI | FID (↓)\Dataset | IXI | ||
|---|---|---|---|---|---|---|---|
| SA → SV | SA → PI | SV → PI | |||||
| Source, Target | 2.37±0.17 | 0.370±7e-4 | 3.09±1e-2 | 4.65±1e-6 | 1.140±4e-3 | Source, Target | 385.03 |
| Harmonized, Target [8] | 1.28±0.01 | 0.034±5e-4 | 0.18±3e-3 | 1.81±1e-6 | 0.009±2e-4 | Harmonized, Target [8] | 54.9 |
| Harmonized, Target [15] | 2.67±0.25 | 0.024±3e-4 | 0.87±2e-2 | 2.56±1e-6 | 0.013±4e-4 | Harmonized, Target [15] | 4.66 |
| Harmonized, Target [16] | - | - | - | 1.87±1e-6 | 0.014±5e-4 | Harmonized, Target [16] | 5.02 |
| Harmonized, Target (Ours) | 0.48±0.02 | 0.024±3e-4 | 0.05±3e-3 | 0.83±1e-6 | 0.010±2e-4 | Harmonized, Target (Ours) | 3.32 |
E. Post-harmonization segmentation accuracy results
If an image is correctly translated, then a segmentation network trained on the original target domain images should generalize to the harmonized images. Therefore, we assess the downstream utility of the harmonized images generated by all compared methods by applying domain-specific segmentation networks (trained separately post-hoc on the original domains outside of any translation framework) to the generated harmonized images. Dice Coefficient (DC) and Intersection over Union (IoU) were used as criteria to assess segmentation performance and thus image harmonization.
Figs. 6, 7, 8 show qualitative comparisons of post-hoc segmentation performance on the harmonized images generated by all methods. Our harmonized images were accurately segmented by the networks trained on the original target domain images, whereas baseline methods failed due to sub-optimal translation patterns such as contrast inversion (Fig. 6 B, D) and semantic loss (Fig. 6 C showing the disappearance of a lesion) and generally result in high segmentation error with false positive (examples marked with white arrows) or false negative predictions (examples marked by yellow arrows). In Fig. 8 where labels are densely annotated, we observe minor improvements on the segmentation continuity especially within white matter (see blue labels within yellow insets).
Quantitatively, DC and IoU are shown in Table V for all datasets. As baseline upper bounds, we first performed indomain evaluation (i.e., without harmonization) using two segmentation networks trained on source and target domains, denoted as ‘Source Segmentor→Source Image’ and ‘Target Segmentor→Target Image’. For MS-SEG, we trained three segmentation networks for the three domains. We then performed a lower bound cross-domain test by segmenting the source images with the target domain trained network without harmonization ‘Target Segmentor→Source Image’, observing the expected large performance drop across all scores due to domain shift. On using this target domain trained network to segment harmonized images ‘Target Segmentor→Harmonized Image’, we expect to see similar performance to ‘Source Segmentor→Source Image’ given that Harmonized Image is generated from Source Image. Harmonized images produced by our method achieved higher quality segmentation in the vast majority of settings evaluated over other baselines, indicating higher downstream utility via smaller domain gap towards the target domain.
TABLE V:
Post-hoc segmentation results on the 3 settings of the MS-SEG (lesion) dataset, the IXI MRI dataset (CSF: cerebrospinal fluid, GM: grey matter, WM: white matter), and the RETOUCH OCT dataset (IRF: intraretinal fluid, SRF: subretinal fluid, PED: pigment epithelial detachment). A→B indicates that the segmentation network is trained on domain A and tested on domain B. DC and IoU are Dice coefficient and Intersection over Union, respectively.
| MS-SEG Results (Leave-one-subject-out cross-validation) | SA → SV |
SA → PI |
SV → PI |
|||
| DC (↑) | IoU (↑) | DC (↑) | IoU (↑) | DC (↑) | IoU (↑) | |
| Source Segmentor → Source Image | 0.53 | 0.39 | 0.53 | 0.39 | 0.60 | 0.45 |
| Target Segmentor → Target Image | 0.60 | 0.45 | 0.57 | 0.41 | 0.57 | 0.41 |
| Target Segmentor → Source Image | 0.37 | 0.26 | 0.33 | 0.23 | 0.43 | 0.30 |
| Target Segmentor → Harmonized Image (CycleGAN [8]) | 0.42 | 0.30 | 0.31 | 0.19 | 0.26 | 0.18 |
| Target Segmentor → Harmonized Image (S-CycleGAN [15]) | 0.40 | 0.29 | 0.39 | 0.28 | 0.22 | 0.15 |
| Target Segmentor → Harmonized Image (Ours) | 0.50 | 0.38 | 0.49 | 0.35 | 0.51 | 0.37 |
| RETOUCH OCT Results | IRF |
SRF |
PED |
|||
| DC (↑) | IoU (↑) | DC (↑) | IoU (↑) | DC (↑) | IoU (↑) | |
| Source Segmentor → Source Image | 0.71 | 0.55 | 0.54 | 0.42 | 0.55 | 0.40 |
| Target Segmentor → Target Image | 0.72 | 0.56 | 0.70 | 0.54 | 0.11 | 0.06 |
| Target Segmentor → Source Image | 0.63 | 0.47 | 0.54 | 0.41 | 0.34 | 0.24 |
| Target Segmentor → Harmonized Image (CycleGAN [8]) | 0.58 | 0.41 | 0.57 | 0.43 | 0.51 | 0.40 |
| Target Segmentor → Harmonized Image (S-CycleGAN [15]) | 0.59 | 0.42 | 0.56 | 0.43 | 0.51 | 0.40 |
| Target Segmentor → Harmonized Image (SemGAN [16]) | 0.58 | 0.42 | 0.55 | 0.42 | 0.51 | 0.41 |
| Target Segmentor → Harmonized Image (Proposed) | 0.67 | 0.51 | 0.62 | 0.48 | 0.52 | 0.43 |
| IXI MRI Results | CSF |
GM |
WM |
|||
| DC (↑) | IoU (↑) | DC (↑) | IoU (↑) | DC (↑) | IoU (↑) | |
| Source Segmentor → Source Image | 0.90 | 0.83 | 0.91 | 0.83 | 0.88 | 0.81 |
| Target Segmentor → Target Image | 0.90 | 0.83 | 0.89 | 0.82 | 0.88 | 0.81 |
| Target Segmentor → Source Image | 0.85 | 0.75 | 0.82 | 0.72 | 0.77 | 0.70 |
| Target Segmentor → Harmonized Image (CycleGAN [8]) | 0.79 | 0.66 | 0.77 | 0.64 | 0.73 | 0.64 |
| Target Segmentor → Harmonized Image (S-CycleGAN [15]) | 0.61 | 0.46 | 0.64 | 0.48 | 0.71 | 0.61 |
| Target Segmentor → Harmonized Image (SemGAN [16]) | 0.80 | 0.68 | 0.76 | 0.64 | 0.72 | 0.64 |
| Target Segmentor → Harmonized Image (Ours) | 0.85 | 0.75 | 0.84 | 0.74 | 0.81 | 0.73 |
F. Sensitivity to self-adversarial attacks
As translation between domains with differing amounts of structural information is ill-posed, a translation GAN trained under cycle-consistency may learn mappings that are highly susceptible to noise as shown in [13], [14], where a complete collapse of cycle-consistent reconstruction is observed when low-amplitude noise is added to the intermediate translation. This susceptibility has been linked to the generator learning to inject human-imperceptible high-frequency noise into the translations as an adversarial attack on the discriminator. Therefore, despite the visual appeal of GAN-translations, their adoption for tasks in real-world medical imaging is still limited by these fragile generator mappings which make their translations unreliable for downstream tasks. This problem is exacerbated for medical image harmonization, where the translations are typically further processed by other algorithms for surface extraction, lesion detection, etc.
We speculate that our proposed segmentation-renormalization would empirically improve robustness to such attacks via semantic-regularization. Therefore, we evaluated the sensitivity S to self-adversarial attacks of the generator networks by adding zero-mean Gaussian noise with increasing standard deviation to the intermediate translation as in [14] and measured the reconstruction error (lower is better) as Eq. 6, where N is the number of 2D slices,
| (6) |
We further propose to use the structural similarity index (SSIM) to evaluate reconstruction quality (higher is better), as it correlates better with human perception than MSE [46],
| (7) |
Fig. 9a shows the quantitative effects of linearly increasing σ from 0 to 0.5, where we found that the proposed model outperforms the compared models across all datasets in terms of both and . Interestingly, we found that S-CycleGAN and Sem-GAN are more sensitive to noise than the baseline CycleGAN, whereas our approach yields improved SSIM and MSE across all datasets.
Fig. 9.
(a) Sensitivity curves of cycle-consistent reconstruction in terms of mean-squared-error (top row, lower is better) and SSIM (bottom row,higher is better) under increasing zero-mean Gaussian perturbation. (a) MS-SEG (SA → SV), (b) MS-SEG (SA → PI), (c) MS-SEG (SV → PI), (d) RETOUCH, (e) IXI.
(b) Reconstructions from CycleGAN, S-CycleGAN, and our method on the MS-SEG dataset, under increasing perturbation to the intermediatetranslation. Columns 4–6 show reconstruction results with different levels of Gaussian noise injection with zero mean and standard deviation 0.1, 0.2, and 0.5 applied to the harmonized/translated images. Our method demonstrates increased robustness to perturbation over competing methods.
Qualitatively, in Fig. 9b we observed a sharp decline in the perceptual quality of the reconstructions as the noise variance increases. However, the proposed method maintained consistency with its input to a perceptually larger extent as compared to S-CycleGAN and CycleGAN as shown in the rightmost three columns of Fig. 9b. These results illustrate that fine-grained control over the network features via linear scales and shifts based on segmentation empirically makes the translation more robust to self-adversarial attacks which may yield improved downstream task performance.
IV. Discussion
We present an anatomically-regularized unpaired image-to-image translation framework with a novel segmentation-renormalization. As quantified by improved KID/FID scores, our method reduces image batch variability between source and target domains across diverse imaging modalities, while also proving to be more effective for downstream tasks such as structural or lesion segmentation as compared to existing translation methods. Further, as cycle-consistent GANs may produce translations that are corrupted by imperceptible high-frequency self-adversarial noise, we evaluated the sensitivity of all methods to this phenomenon to assess the downstream utility of their translations. We find that our proposed framework outperforms all evaluated baselines, closing the gap towards reliable real-world medical image translation adopted in future studies and biomedical practice.
Some open issues exist and will be addressed in future work,
We assumed stable subject demographics across scanners such that the harmonization experiments focused on imaging batch effects as opposed to biological batch effects. However, image harmonization between two different groups (e.g., MRI with scanner A imaging neonatal brains and scanner B imaging pathological adult brains) may be contraindicated. While the proposed model preserves subject-level morphology via individual segmentation information, retaining group-level differences (e.g., age) is not directly addressed. In general, removal of non-biological batch effects while retaining biological batch effects is difficult statistically for even scalar measurements [47] and such an extension for image translation GANs will be explored in future work.
The presented framework was developed in 2D for general modality-agnostic applicability. For example, OCT images are highly anisotropic thereby making 3D networks inapplicable. Sequential application of 2D translators to slices from nearly-isotropic volumes (e.g., MRI) may, at times, yield volumetric translations with slice-wise intensity inconsistencies. However, slice consistency can be addressed in a straightforward manner via post-processing extensions such as multi-view fusion [48], [49] which will be incorporated into future work.
We use a pseudo-ground truth obtained algorithmically [37] in the IXI experiments instead of dense expert annotations which are infeasible to obtain for hundreds of volumes. However, when we do have expert labels (RETOUCH and MS-SEG), we still observe a strong increase in post-harmonization segmentation performance, indicating translation improvements that are relatively insensitive to segmentation quality. We will explore more effective dense whole-brain segmentation approaches such as multi-atlas methods [50].
We conducted all experiments on public datasets without paired data (i.e. the same subject scanned on two devices). Therefore, qualitative and quantitative comparisons need careful interpretation and analysis. We use surrogate scores of translation quality based on fidelity, distribution matching, segmentation, and robustness in this work. We believe that ideal harmonization validation should include paired held-out subjects scanned in both domains. However, to our knowledge, no such large-scale database of subjects on multiple scanners currently exists.
In summary, our proposed harmonization method improves on previous cycle-consistent adversarial methods, reduces batch effects in multi-center imaging studies, and enables the introduction of large amounts of legacy data into new studies. The presented methodology is fully generic and can be applied in various image translation tasks and is architecture-agnostic, requiring only that the network use normalization layers.
Acknowledgments
This work was supported by National Institutes of Health under grants 1R01DA038215-01A1, R01-HD055741-12, 1R01HD088125-01A1, 1R01MH118362-01, R01EB021391, 2R01EY013178-15, R01EY030770-01A1, R01ES032294, R01MH122447, and 1R34DA050287.
V. Appendix
A. Multitask autoencoder for FID/KID
The multitask autoencoder described in Fig. 10 was trained on data from both source and target domains with a binary cross entropy loss for the domain classification branch and an L1 loss for the image reconstruction branch. We used the Adam optimizer with a batch size of 64. Network weights were initialized from N(0,0.02). The learning rate of the network was set to 0.001 for the first 20 epochs and then linearly decayed to zero over the next 100 epochs.
Fig. 10:

Network details of the multitask autoencoder for feature extraction. For IXI, the input/output resolutions are 128 × 128 × 1 and AvgPool(32,32) was replaced by AvgPool(64,64) to accommodate for the output feature size.
Footnotes
References
- [1].Pomponio R et al. , “Harmonization of large MRI datasets for the analysis of brain imaging patterns throughout the lifespan,” NeuroImage, vol. 208, p. 116450, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [2].Garcia-Dias R et al. , “Neuroharmony: A new tool for harmonizing volumetric MRI data from unseen scanners,” NeuroImage, vol. 220, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [3].Blumberg SB, Palombo M, Khoo CS, Tax CMW, Tanno R, and Alexander DC, “Multi-stage prediction networks for data harmonization,” in Medical Image Computing and Computer Assisted Intervention. Cham: Springer International Publishing, 2019, pp. 411–419. [Google Scholar]
- [4].Johnson WE, Li C, and Rabinovic A, “Adjusting batch effects in microarray expression data using empirical Bayes methods,” Biostatistics, vol. 8, no. 1, pp. 118–127, 04 2006. [DOI] [PubMed] [Google Scholar]
- [5].Fortin J-P et al. , “Harmonization of cortical thickness measurements across scanners and sites,” NeuroImage, vol. 167, pp. 104–120, 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [6].Dewey BE, Zhao C, Carass A, Oh J, Calabresi PA, van Zijl PCM, and Prince JL, “Deep harmonization of inconsistent MR data for consistent volume segmentation,” in Simulation and Synthesis in Medical Imaging. Cham: Springer International Publishing, 2018, pp. 20–30. [Google Scholar]
- [7].Dewey BE et al. , “DeepHarmony: A deep learning approach to contrast harmonization across scanner changes,” Magnetic Resonance Imaging, vol. 64, pp. 160–170, 2019, artificial Intelligence in MRI. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [8].Zhu J-Y, Park T, Isola P, and Efros AA, “Unpaired image-to-image translation using cycle-consistent adversarial networks,” in The IEEE International Conference on Computer Vision (ICCV), Oct 2017. [Google Scholar]
- [9].Seebck P et al. , “Using cycleGANs for effectively reducing image variability across OCT devices and improving retinal fluid segmentation,” in IEEE 16th International Symposium on Biomedical Imaging, 2019. [Google Scholar]
- [10].Zhao F et al. , “Harmonization of infant cortical thickness using surface-to-surface cycle-consistent adversarial networks,” in Medical Image Computing and Computer Assisted Intervention – MICCAI, 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [11].Zhang R, Pfister T, and Li J, “Harmonic Unpaired Image-to-image Translation,” in International Conference on Learning Representations, 2018. [Google Scholar]
- [12].Cohen JP, Luck M, and Honari S, “Distribution matching losses can hallucinate features in medical image translation,” in International conference on medical image computing and computer-assisted intervention. Springer, 2018, pp. 529–536. [Google Scholar]
- [13].Chu C, Zhmoginov A, and Sandler M, “CycleGAN, a master of steganography,” arXiv preprint arXiv:1712.02950, 2017. [Google Scholar]
- [14].Bashkirova D, Usman B, and Saenko K, “Adversarial self-defense for cycle-consistent GANs,” NeurIPS, pp. 635–645, 2019. [Google Scholar]
- [15].Zhang Z, Yang L, and Zheng Y, “Translating and segmenting multimodal medical volumes with cycle- and shape-consistency generative adversarial network,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018. [Google Scholar]
- [16].Cherian A and Sullivan A, “Sem-GAN: Semantically-consistent image-to-image translation,” in IEEE Winter Conference on Applications of Computer Vision, 2019. [Google Scholar]
- [17].Mo S, Cho M, and Shin J, “Instance-aware image-to-image translation,” in International Conference on Learning Representations, 2019. [Google Scholar]
- [18].Huo Y, Xu Z, Bao S, Assad A, Abramson RG, and Landman BA, “Adversarial synthesis learning enables segmentation without target modality ground truth,” in 2018 IEEE 15th international symposium on biomedical imaging (ISBI 2018). IEEE, 2018, pp. 1217–1220. [Google Scholar]
- [19].Chen C, Dou Q, Chen H, Qin J, and Heng PA, “Unsupervised bidirectional cross-modality adaptation via deeply synergistic image and feature alignment for medical image segmentation,” IEEE Transactions on Medical Imaging, 2020. [DOI] [PubMed] [Google Scholar]
- [20].Park T, Liu M-Y, Wang T-C, and Zhu J-Y, “Semantic image synthesis with spatially-adaptive normalization,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, June 2019. [Google Scholar]
- [21].Huang X and Belongie SJ, “Arbitrary style transfer in real-time with adaptive instance normalization,” CoRR, vol. abs/1703.06868, 2017. [Online]. Available: http://arxiv.org/abs/1703.06868 [Google Scholar]
- [22].Perez E, Strub F, De Vries H, Dumoulin V, and Courville A, “FiLM: Visual Reasoning with a General Conditioning Layer,” in AAAI Conference on Artificial Intelligence, 2018. [Google Scholar]
- [23].Kayhan OS and Gemert J. C. v., “On translation invariance in cnns: Convolutional layers can exploit absolute spatial location,” in IEEE Conference on Computer Vision and Pattern Recognition, 2020. [Google Scholar]
- [24].Zhang R, “Making Convolutional Networks Shift-Invariant Again,” ser. Proceedings of Machine Learning Research, vol. 97. PMLR, 2019. [Google Scholar]
- [25].He K, Zhang X, Ren S, and Sun J, “Deep residual learning for image recognition,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016. [Google Scholar]
- [26].Chartsias A, Papanastasiou G, Semple S, Williams M, Newby D, Dharmakumar R, and Tsaftaris S, “Disentangled representation learning in cardiac image analysis,” Medical Image Analysis, 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [27].Jacenków G, O’Neil AQ, Mohr B, and Tsaftaris SA, “Inside: Steering spatial attention with non-imaging information in cnns,” in Medical Image Computing and Computer Assisted Intervention – MICCAI 2020. Cham: Springer International Publishing, 2020, pp. 385–395. [Google Scholar]
- [28].Ronneberger O, Fischer P, and Brox T, “U-net: Convolutional networks for biomedical image segmentation,” in International Conference on Medical image computing and computer-assisted intervention. Springer, 2015, pp. 234–241. [Google Scholar]
- [29].Rolnick D, Veit A, Belongie S, and Shavit N, “Deep learning is robust to massive label noise,” arXiv preprint arXiv:1705.10694, 2017. [Google Scholar]
- [30].Miyato T, Kataoka T, Koyama M, and Yoshida Y, “Spectral Normalization for Generative Adversarial Networks,” CoRR, vol. abs/1802.05957, 2018. [Online]. Available: http://arxiv.org/abs/1802.05957 [Google Scholar]
- [31].Mao X, Li Q, Xie H, Lau RY, Wang Z, and Paul Smolley S, “Least squares generative adversarial networks,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 2794–2802. [Google Scholar]
- [32].“IXI brain database,” http://brain-development.org/ixi-dataset/, accessed: 2020–03-14.
- [33].Commowick O et al. , “Objective Evaluation of Multiple Sclerosis Lesion Segmentation using a Data Management and Processing Infrastructure,” Scientific Reports, vol. 8, no. 1, p. 13650, December. 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [34].Bogunovic H´ et al. , “RETOUCH - The Retinal OCT Fluid Detection and Segmentation Benchmark and Challenge,” IEEE Transactions on Medical Imaging, vol. 38, no. 8, pp. 1858–1874, August 2019. [DOI] [PubMed] [Google Scholar]
- [35].Fonov V, Evans A, Mckinstry R, Almli C, and Collins L, “Unbiased nonlinear average age-appropriate brain templates from birth to adulthood,” Neuroimage, vol. 47, July 2009. [Google Scholar]
- [36].Iglesias JE, Liu C, Thompson PM, and Tu Z, “Robust brain extraction across datasets and comparison with publicly available methods,” IEEE Transactions on Medical Imaging, vol. 30, no. 9, pp. 1617–1634, September. 2011. [DOI] [PubMed] [Google Scholar]
- [37].Zhang Y, Brady M, and Smith S, “Segmentation of brain MR images through a hidden markov random field model and the expectation-maximization algorithm,” IEEE Transactions on Medical Imaging, vol. 20, no. 1, pp. 45–57, January 2001. [DOI] [PubMed] [Google Scholar]
- [38].Dalca AV, Guttag J, and Sabuncu MR, “Anatomical priors in convolutional networks for unsupervised biomedical segmentation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 9290–9299. [Google Scholar]
- [39].“CycleGAN Implementation,” https://github.com/junyanz/pytorch-CycleGAN-and-pix2pix, accessed: 2020–08-19.
- [40].Crum WR, Camara O, and Hill DLG, “Generalized overlap measures for evaluation and validation in medical image analysis,” IEEE Transactions on Medical Imaging, vol. 25, no. 11, pp. 1451–1461, 2006. [DOI] [PubMed] [Google Scholar]
- [41].Wang S-Y, Wang O, Zhang R, Owens A, and Efros AA, “Cnn-generated images are surprisingly easy to spot... for now,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020. [Google Scholar]
- [42].Heusel M, Ramsauer H, Unterthiner T, Nessler B, and Hochreiter S, “GANs trained by a two time-scale update rule converge to a local nash equilibrium,” 2017. [Google Scholar]
- [43].Bikowski M, Sutherland DJ, Arbel M, and Gretton A, “Demystifying MMD GANs,” in International Conference on Learning Representations, 2018. [Google Scholar]
- [44].Szegedy C, Vanhoucke V, Ioffe S, Shlens J, and Wojna Z, “Rethinking the inception architecture for computer vision,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016. [Google Scholar]
- [45].Le L, Patterson A, and White M, “Supervised autoencoders: Improving generalization performance with unsupervised regularizers,” in Advances in Neural Information Processing Systems, 2018, pp. 107–117. [Google Scholar]
- [46].Wang Z, Bovik AC, Sheikh HR, and Simoncelli EP, “Image quality assessment: from error visibility to structural similarity,” IEEE transactions on image processing, vol. 13, no. 4, pp. 600–612, 2004. [DOI] [PubMed] [Google Scholar]
- [47].Nygaard V, Rødland EA, and Hovig E, “Methods that remove batch effects while retaining group differences may lead to exaggerated confidence in downstream analyses,” Biostatistics, 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [48].Schilling KG, Blaber J, Huo Y, Newton A, Hansen C, Nath V, Shafer AT, Williams O, Resnick SM, Rogers B et al. , “Synthesized b0 for diffusion distortion correction (synb0-disco),” Magnetic resonance imaging, vol. 64, pp. 62–70, 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [49].Yun J, Lee M, Park H, Lee J, Seo J, and Namkug, “Improvement of fully automated airway segmentation on computed tomographic images using 2.5 d and 3d convolutional neural net,” Medical Image Analysis, vol. 51, pp. 13–20, January 2019. [DOI] [PubMed] [Google Scholar]
- [50].Wang H, Suh JW, Das SR, Pluta JB, Craige C, and Yushkevich PA, “Multi-atlas segmentation with joint label fusion,” IEEE transactions on pattern analysis and machine intelligence, vol. 35, no. 3, pp. 611–623, 2012. [DOI] [PMC free article] [PubMed] [Google Scholar]

