Skip to main content
Wiley Open Access Collection logoLink to Wiley Open Access Collection
. 2025 Mar 18;52(6):4299–4317. doi: 10.1002/mp.17757

Histogram matching‐enhanced adversarial learning for unsupervised domain adaptation in medical image segmentation

Xiaoxue Qian 1, Hua‐Chieh Shao 1, Yunxiang Li 1, Weiguo Lu 1, You Zhang 1,
PMCID: PMC12149727  PMID: 40102198

Abstract

Background

Unsupervised domain adaptation (UDA) seeks to mitigate the performance degradation of deep neural networks when applied to new, unlabeled domains by leveraging knowledge from source domains. In medical image segmentation, prevailing UDA techniques often utilize adversarial learning to address domain shifts for cross‐modality adaptation. Current research on adversarial learning tends to adopt increasingly complex models and loss functions, making the training process highly intricate and less stable/robust. Furthermore, most methods primarily focused on segmentation accuracy while neglecting the associated confidence levels and uncertainties.

Purpose

To develop a simple yet effective UDA method based on histogram matching‐enhanced adversarial learning (HMeAL‐UDA), and provide comprehensive uncertainty estimations of the model predictions.

Methods

Aiming to bridge the domain gap while reducing the model complexity, we developed a novel adversarial learning approach to align multi‐modality features. The method, termed HMeAL‐UDA, integrates a plug‐and‐play histogram matching strategy to mitigate domain‐specific image style biases across modalities. We employed adversarial learning to constrain the model in the prediction space, enabling it to focus on domain‐invariant features during segmentation. Moreover, we quantified the model's prediction confidence using Monte Carlo (MC) dropouts to assess two voxel‐level uncertainty estimates of the segmentation results, which were subsequently aggregated into a volume‐level uncertainty score, providing an overall measure of the model's reliability. The proposed method was evaluated on three public datasets (Combined Healthy Abdominal Organ Segmentation [CHAOS], Beyond the Cranial Vault [BTCV], and Abdominal Multi‐Organ Segmentation Challenge [AMOS]) and one in‐house clinical dataset (UTSW). We used 30 MRI scans (20 from the CHAOS dataset and 10 from the in‐house dataset) and 30 CT scans from the BTCV dataset for UDA‐based, cross‐modality liver segmentation. Additionally, 240 CT scans and 60 MRI scans from the AMOS dataset were utilized for cross‐modality multi‐organ segmentation. The training and testing sets for each modality were split with ratios of approximately 4:1–3:1.

Results

Extensive experiments on cross‐modality medical image segmentation demonstrated the superiority of HMeAL‐UDA over two state‐of‐the‐art approaches. HMeAL‐UDA achieved a mean (± s.d.) Dice similarity coefficient (DSC) of 91.34% ± 1.23% and an HD95 of 6.18 ± 2.93 mm for cross‐modality (from CT to MRI) adaptation of abdominal multi‐organ segmentation, and a DSC of 87.13% ± 3.67% with an HD95 of 2.48 ± 1.56 mm for segmentation adaptation in the opposite direction (MRI to CT). The results are approaching or even outperforming those of supervised methods trained with “ground‐truth” labels in the target domain. In addition, we provide a comprehensive assessment of the model's uncertainty, which can help with the understanding of segmentation reliability to guide clinical decisions.

Conclusion

HMeAL‐UDA provides a powerful segmentation tool to address cross‐modality domain shifts, with the potential to generalize to other deep learning applications in medical imaging.

Keywords: adversarial learning, cross‐modality medical image segmentation, histogram matching, unsupervised domain adaptation (UDA)

1. INTRODUCTION

Learning generalized data representations is one of the key goals for deep neural networks, but it relies heavily on extensive labeled data for training. 1 , 2 , 3 In medical image segmentation, obtaining pixel‐wise labels can be time‐consuming, labor‐intensive, and often impractical. 4 , 5 Furthermore, most networks assume that the training and test data are drawn from the same domain and share similar joint probability distributions. 6 , 7 However, this assumption is often violated in clinical applications, as training and test data can originate from different scanner hardware, imaging protocols, performance sites, or subject cohorts, resulting in divergent data distributions and styles. 8 , 9 As a result, when the training dataset (source domain) does not accurately reflect the characteristics of the test dataset (target domain), model performance can significantly degrade. A straightforward approach is to use transfer learning for supervised fine‐tuning, 10 where a model pre‐trained on the source domain is retrained on labeled target‐domain data, allowing the adaptation of its parameters to the new data distribution. However, annotating data for each new domain for supervised learning is costly and often impractical. Alternatively, unsupervised domain adaptation (UDA) is more practical and appealing, which combines labeled source‐domain data and unlabeled target‐domain data to train a model to generalize well to the target domain. Correspondingly, different UDA techniques were proposed, 11 , 12 , 13 aiming to develop networks to leverage the domain‐invariant features shared by the source and target domains to generalize their segmentation capabilities.

The primary objective of domain adaptation is to bridge distributional discrepancies and enhance model generalizability by leveraging knowledge from the related domains. For domain adaptation in medical image segmentation, most studies use discrepancy‐based, 13 , 14 reconstruction‐based, 15 , 16 and adversarial‐based 17 , 18 , 19 methods to minimize the impact of the domain gap on model performance. Discrepancy‐based methods aim to reduce domain shifts by minimizing feature differences between the source and target domains in cross‐modality image segmentation. To achieve this, various metrics have been introduced to quantify distributional differences. By measuring these differences, the model can iteratively adjust and better align the feature distributions, enabling effective segmentation performance on unlabeled target data. Common metrics include maximum mean discrepancy, 20 Kullback–Leibler divergence, 21 and Wasserstein distance. 22 The efficacy of discrepancy‐based methods is highly dependent on the choice of discrepancy metrics and the actual domain distribution discrepancy. Reconstruction‐based domain adaptation methods for image segmentation, such as SAE, 14 and SM‐VAE, 15 mapped the source and target domains into a shared latent space, minimized reconstruction errors for both domains based on the shared space to learn domain‐invariant features. The aligned latent space enhanced the generalization of the segmentation model. However, these methods may struggle to capture generalized representations based solely on the reconstruction error, especially when handling complex domain shifts. For UDA in medical image segmentation, many studies leveraged adversarial learning to generate target‐like images from labeled source‐domain images to train the segmentation model or directly guiding the model to focus on domain‐invariant information. It typically involves two models—a generator and a discriminator—that compete against each other. At Nash equilibrium, the discriminator can no longer distinguish between real and generated samples, while the generator has learned to extract domain‐invariant features. 16 , 17 These characteristics make adversarial learning more flexible and effective than discrepancy‐based and reconstruction‐based methods for domain adaptation in image segmentation.

Adversarial learning‐based domain adaptation methods tend to focus on image‐level 18 and feature‐level 19 alignment through adversarial training to learn generalizable representations. Image‐level alignment uses image‐to‐image (I2I) transformation models to transfer the image appearance between domains for generating target‐like images, followed by training a segmentation network on the translated images. The cycle‐consistent generative adversarial network with cycle‐consistency constraints 20 and contrastive unpaired image translation network with contrastive losses 21 are the two most commonly‐used I2I networks in domain adaptation. Feature‐level alignment applies adversarial learning directly to the feature or output spaces of source and target domains, suppressing domain‐specific information and emphasizing domain‐invariant representation. 22 Dou et al. tackled the domain shift by an adversarial domain adaptation network for aligning the feature spaces of source and target domains at multiple scales in an unsupervised manner. 19 Tsai et al. introduced AdaptSeg, 23 which incorporated two adaptation modules to perform output space adversarial learning at multiple levels. Yan et al. incorporated the Canny edge detector into an adversarial learning‐based adaptation network, enhancing the model's attention to edges during adversarial training. 24 Since feature and image alignment can be complementary, a variety of works, such as synergistic image and feature alignment (SIFA) 25 and symmetric domain adaptation (SymDA), 26 have exploited alignment of domains from both image and feature perspectives to improve domain adaptation performance. Other studies further integrated discrepancy‐based, reconstruction‐based, adversarial‐based, and disentanglement learning 27 , 28 for a more complicated domain adaptation strategy. However, existing methods often involve cumbersome loss functions and multiple rounds of adversarial learning, leading to convoluted and inefficient training, especially for UDA. Continuously increasing model and loss function complexity to enhance domain adaptation performance has several drawbacks. First, these approaches introduce multiple interdependent hyper‐parameters, requiring careful adjustments to achieve optimal performance. Second, the training process becomes increasingly intricate and prone to instability, leading to higher computational demands and challenges in model convergence. Third, redundant model complexity may weaken model generalizability. Finally, it renders the model less robust or reproducible, ultimately affecting its practical applications.

To address the above challenges, we introduced a plug‐and‐play histogram matching strategy to reduce the domain gap during adversarial learning, to replace complicated model or loss function designs. Experimental results indicated that adversarial learning was more effective with a smaller domain gap compared to a larger one, 11 , 12 , 13 thus the introduction of histogram matching will help to pre‐align the two domains to facilitate adversarial learning. Histogram matching 29 , 30 is a classical image processing technique that adjusts the pixel intensity distribution of one image to match the histogram of another, thereby making the two images more similar in terms of visual characteristics. It has been widely applied in the field of medical image processing for style transfer and image enhancement. 31 , 32 , 33 Different histogram matching strategies, for example, randomized histogram matching 34 and batch momentum update‐based histogram matching, 35 have been used to suppress domain‐specific visual characteristics, such as brightness, contrast, texture, and appearance. In this study, we used histogram matching to adjust the voxel values of an source‐domain image, aligning its histogram with that of the target‐domain image. This adjustment enabled the histogram‐matched source‐domain volumes to exhibit a similar image style to target‐domain volumes in terms of intensity distribution. Subsequently, we used the histogram‐matched volumes instead of the original source‐domain volumes in the adversarial learning, significantly narrowing the domain gap and enhancing the performance of adversarial learning for cross‐modality medical images.

Additionally, most studies on segmentation methods focus on model performance evaluation, ignoring the confidence or uncertainty of model predictions, which can be essential information to guide clinical applications. 36 , 37 Uncertainty estimation can serve as an indicator of algorithmic robustness, reflecting the reliability of segmentation results in the absence of “ground‐truth” data. 37 , 38 , 39 Previous studies have explored two types of uncertainty: aleatoric (data) uncertainty and epistemic (model) uncertainty, 40 which can be estimated from the distribution of the network's weights and its predictions. Monte Carlo (MC) dropout has been demonstrated to approximate the Bayesian posterior of the network's weight distribution, making it widely used for uncertainty estimation across various models. 41 Therefore, in this paper, we estimated two voxel‐level uncertainties based on the MC dropout, as well as a volume‐level uncertainty score derived from the voxel‐level uncertainties, to quantify the reliability of a segmentation model's predictions.

In summary, we proposed a simple, yet effective UDA method based on histogram matching‐enhanced adversarial learning (HMeAL), and provided uncertainty estimations to assess the confidence of model predictions. The primary contributions can be summarized as follows:

  1. We proposed a novel adversarial learning framework (HMeAL‐UDA) to improve the performance of UDA in cross‐modality medical image segmentation.

  2. We utilized a histogram matching strategy to minimize the style bias between different modalities, reducing the domain gap and enhancing the effectiveness of adversarial learning.

  3. We provided two voxel‐level uncertainty estimations and an aggregated uncertainty score to assess the confidence of segmentation predictions, which can be incorporated into clinical workflows to guide decision‐making.

2. MATERIALS AND METHODS

2.1. Overview of HMeAL‐UDA

For UDA, given the source domain data Xs and their corresponding labels Ys, and the target domain data Xt without labels, the objective of UDA is to develop a model to perform well on the target domain. Figure 1 illustrates an overview of the proposed HMeAL‐UDA framework for addressing the domain gaps in medical image segmentation. HMeAL‐UDA is composed of two core components: adversarial learning and histogram matching. The adversarial learning was designed to align source domain predictions with target domain predictions, thereby enforcing geometric consistency between source and target domain representations. The histogram matching was employed to transfer the 3D image styles from the source domain to the target domain while preserving the anatomical information of source labels, thus reducing the domain gap to enhance the efficacy of adversarial learning. Details are provided in the following sections.

FIGURE 1.

FIGURE 1

Overview of the HMeAL‐UDA framework for medical image segmentation. The training iterates between two steps: (1) segmentor training and (2) discriminator training. During step 1, the discriminator parameters were frozen and during step 2, the segmentor parameters were not updated. HM: histogram matching.

2.1.1. Adversarial learning

Adversarial learning aims to assist the cross‐modality medical image segmentation to address the domain gap, focusing on the domain‐invariant characteristics through the synergistic optimization of segmentation, adversarial, and discriminator losses. To implement the adversarial learning process, a segmentor, also serving as the generator, is concatenated with a discriminator. The segmentor uses a U‐Net structure, and the discriminator consists of five fully connected layers. During training, the segmentation and adversarial losses encourage the segmentor/generator to learn domain‐invariant features and generate domain‐consistent predictions to confuse the discriminator. Meanwhile, the discriminator is optimized using the discriminator loss to distinguish between the features of source and target predictions. As the segmentor gradually learns to generate domain‐invariant segmentations, the discriminator is progressively challenged to differentiate the predictions from the two domains. This dynamic interaction drives the model toward more robust domain‐invariant representations and predictions, effectively minimizing the domain gap and enhancing generalization on the target domain. Thus, this adversarial learning approach can be formulated as a minimax optimization problem minGmaxDF(G,D), where the generator G works to generate segmentations to minimize the discriminator D’s ability to distinguish segments between the source and the target domains, while the discriminator seeks to maximize the distinguishing power. F repersents the optimization function. This process was implemented by alternately optimizing the segmentor and discriminator in two steps.

In step 1 of the adversarial learning (Figure 1), the source domain data (Xs,Ys) and target domain volumes Xt were fed into the segmentor to generate voxel‐wise mask probabilities. The outputs of source domain volumes were used to calculate the segmentation loss Lseg (where ‘ground‐truth’ label exists), and the outputs of target domain volumes were fed into the discriminator to produce a vector for computing the adversarial loss Ladv, as follows:

Lseg=12yisGxisyis+Gxis (1)
Ladv=12ExjtXt(DGxjt12 (2)

where Lseg caculates the Dice loss, (xis,yis) is the i‐th sample from the source domain (Xs,Ys), xjt is the j‐th volume from the target domain Xt, and ExjtXt denotes the expected value (or mean) calculated over all target domain samples.

The total loss LG drives the optimization of the segmentor/generator, while the discriminator's parameters are frozen.

LG=Lseg+λLadv (3)

where λ=0.01 is an empirically‐determined weighting parameter to balance the two loss terms.

In step 2, the segmentor's parameters were fixed, and the discriminator's parameters were updated with the discriminator loss LD. In this step, both the outputs of source and target domains were fed into the discriminator to generate two vectors for computing the discriminator loss, as follows:

LD=12ExisXs(DGxis12+12ExjtXt(DGxjt02 (4)

In adversarial learning‐based UDA, the essence lies in the alternating treatment of domain outputs as real or fake to drive the model toward domain‐invariant feature learning. Specifically, in step 1, the output of the target domain is considered real, pushing the generator toward generating domain‐invariant segmentations. In step 2, the target domain is labeled as fake, while the source domain output is treated as real, encouraging the discriminator to recognize the differences of the predictions between the two domains, as shown in Figure 1. By alternately optimizing step 1 and step 2, the segmentor gradually learns to make the discriminator unable to distinguish between segmentations from the source and target domains. This strategy allows the model to acquire more domain‐invariant information, thereby improving segmentation performance in the label‐less target domain for UDA.

2.1.2. Histogram matching

The style differences among cross‐modality medical images are a key feature of domain gaps. To address the style differences and reduce its burden on UDA, we adopted a plug‐and‐play histogram matching module 29 to achieve image style transfer without additional training, which is particularly useful for aligning the style of intensity, brightness, and contrast of a source image (e.g., MRI) to a target image (e.g., CT). First, we calculated the histograms of both the source and target images, followed by the construction of a mapping function to transform the source image to approach the target image in intensity, brightness, and contrast, while maintaining the spatial and semantic structures of the source image. In HMeAL‐UDA, we applied histogram matching to transfer the source domain images, producing many target‐domain‐like volumes while preserving the original source domain labels. Figure 2 illustrates the histogram matching process, using MRI (source) to CT (target) matching as an example.

FIGURE 2.

FIGURE 2

The histogram matching workflow from MRI to CT for image style transfer.

As illustrated in Figure 1, we used the matched source volumes (to the target domain) instead of the original source volumes for adversarial learning, yielding the HMeAL‐UDA. During training, in each iteration, a source domain image and a target domain image were randomly selected from their respective datasets. The histogram of the source domain image was matched to the intensity distribution of the target domain image. The histogram matching resulted in a matched image that retained the anatomy/label of the source image while adopting the style/representation of the target image. The matched image was then fed into the network along with the target domain image for adversarial learning. This process was repeated for each iteration until the model converged. By incorporating histogram matching, the domain gap between source and target domains was substantially reduced, mitigating the difficulties faced by the adversarial learning process in capturing domain‐invariant features. On the foundation of histogram‐matched images, adversarial learning further minimized the discrepancies between the source and target domains, and used the labeled source domain data to learn segmentations for the label‐less target domain. As a result, HMeAL‐UDA significantly boosts the performance of cross‐modality medical image segmentation.

2.2. Uncertainty estimation

In this study, we primarily focus on the epistemic uncertainty 38 that reflects a model's reliability when tested on heterogeneous data. To quantify the confidence of model segmentation results, we used MC dropout 41 on HMeAL‐UDA's predictions. MC dropout provides a practical approximation to Bayesian inference by enabling dropout during both training and inference, effectively simulating sampling from the posterior distribution of the network's weights. It offers a feasible method for estimating epistemic uncertainty in deep learning models. Specifically, during testing, a target image volume was put into the segmentor multiple times to generate T predictions as MC samples, which were used to calculate two voxel‐level uncertainties: variance uncertainty and entropy uncertainty. The variance uncertainty was then aggregated to compute a volume‐level uncertainty score. Figure 3 provides a flowchart depicting the process of uncertainty estimations.

FIGURE 3.

FIGURE 3

Flowchart of the MC dropout‐based uncertainty estimations. MC: Monte Carlo.

As shown in Figure 3, we performed MC dropouts on multiple convolutional layers of the segmentor over (T=50) repetitions, using a fixed dropout rate of (p=0.2). The voxel‐level variance uncertainty UV,i is estimated as follows:

UV,i=1Tt=1Ty^i,t1Tt=1Ty^i,t2 (5)

where y^i,t denotes the post‐softmax prediction result of the i‐th voxel in the t‐th MC sample.

The variance uncertainty measure UV,i captures information about the inconsistency across different MC samples, while the entropy‐based uncertainty UH,i reflects the uncertainty within the model's predicted probability distribution across all MC samples. The measure of the entropy uncertainty was achieved in two steps, we first calculated the average predicted probabilities y^i,avg from T MC samples, followed by calculating the entropy over the average prediction for each class:

UH,i=c=1Cy^i,avgclogy^i,avgc (6)
UH,i=y^i,avglogy^i,avg1y^i,avglog1y^i,avg (7)

where C denotes the class. Equation (7) is the derivative from Equation (6) under a binary classification.

Following the uncertainty score defined in the BraTS segmentation uncertainty challenge 2020, 42 we aggregated the voxel‐wise variance uncertainty to a volume‐level uncertainty score, to assess the overall confidence in segmentation results. The score is designed to reward high confidence in accurate predictions and low confidence in erroneous predictions, while penalizing low confidence in voxels where the predictions are correct. 37 In detail, we set an uncertainty threshold τ[0,1]. For each τ, voxels with an uncertainty value UV,i>τ were marked as uncertain, and the corresponding segmentation result was filtered out. The total number of true positive and true negative voxels in the remaining segmentations are denoted as TPτ and TNτ, respectively. The ratio of filtered true positive (FTP) at threshold τ is defined as follows:

FTP=TP1TPτTP1 (8)
FTN=TN1TNτTN1 (9)

where TP1 is the number of true positive voxels in the unfiltered segmentation result. FTN is defined in a similar manner.

The volume‐level uncertainty score US is defined as:

US=AUC1+1AUC2+1AUC33 (10)

where AUC1 represents the area under the curve for the Dice similarity coefficient (DSC) versus τ, AUC2 is the area under the curve for FTP versus τ, and AUC3 is the area under the curve for FTN versus τ.

2.3. Datasets

In our experiments, we utilized three public datasets (Combined Healthy Abdominal Organ Segmentation [CHAOS], Beyond the Cranial Vault [BTCV], and Abdominal Multi‐Organ Segmentation Challenge [AMOS]) and one in‐house clinical dataset (UTSW): (1) CHAOS dataset 43 : this dataset has 20 T2‐SPIR MRI cases from the ISBI 2019 CHAOS Challenge. (2) UTSW dataset: we collected 10 clinical T2‐SPIR MRI scans of the abdomen from UT Southwestern Medical Center. (3) BTCV dataset 44 : It contains 30 CT scans from the Multi‐Atlas Labeling BTCV Challenge. (4) AMOS dataset 45 : the Multi‐Modality AMOS dataset consists of abdominal CT and MRI scans acquired from various patients. We used 240 CT scans and 60 MRI scans for this study. The characteristics of the four datasets are presented in Table 1.

TABLE 1.

Key characteristics of the four datasets used in this study.

Dataset Name Image modalities Total Training Testing Organs
In‐house UTSW MRI 10 6 4 Liver
Public CHAOS MRI 20 16 4 Liver
BTCV CT 30 22 8 Liver
AMOS MRI 60 48 12 Spleen, Left/Right Kidney, Liver
CT 240 192 48 Spleen, Left/Right Kidney, Liver

Abbreviations: AMOS, abdominal multi‐organ segmentation challenge; BTCV, Beyond the Cranial Vault; CHAOS, Combined Healthy Abdominal Organ Segmentation.

Among the four datasets, we combined the CHAOS dataset and the UTSW dataset into a total of 30 MRI scans. The 30 MRI scans were used along with the 30 CT scans of the BTCV dataset for UDA‐based, bidirectional cross‐modality liver segmentation (MRI to CT, and CT to MRI). From each modality, we randomly selected 22 samples for training, and 8 for testing. Similarly, for the AMOS dataset, 240 CT and 60 MRI scans were used for cross‐modality multi‐organ (spleen, left kidney, right kidney, and liver) segmentation. Among the 240 CT scans, 192 were used for training and 48 for testing. And for the 60 MRI scans, 48 were used for training and 12 for testing.

For data preprocessing, we resampled CT and MRI scans to 1.5 ×1.5 ×3 mm, and resized all volumes to 256×256×64. The intensity values of CT scans were truncated within the range of [−500, 500], and MRI scans were truncated at the 99th percentile. Each scan was then normalized to zero mean and unit variance.

2.4. Implementation details

We used a 3D U‐Net 46 with a 5‐layer encoder and a 4‐layer decoder as the segmentor. The discriminator for domain adaptation consisted of 5 convolutional layers, with the number of channels set to {64, 128, 256, 512, 1}. We used the Adam optimizer with learning rates of {1e‐4, 1e‐5} to train the segmentor and the discriminator sequentially in each iteration, for a total of 500 epochs. The batch size was set to 1. The weighting parameter in the adversarial loss was λ=0.01. MC dropout was applied to the segmentor, with dropout added to layers {2, 3, 4} of the encoder and layers {1, 2, 3} of the decoder, using a dropout rate of p=0.2 during both training and testing, and set T=50 MC samples for uncertainty estimations.

We compared HMeAL‐UDA with four methods to assess its efficacy: (1) Supervision: Supervised training and testing within the same domain. (2) W/o Adaptation: Supervised training in the source domain, followed by testing in the target domain without adaptation. (3) SIFA 25 : This method aligned both image and feature spaces using adversarial learning to drive UDA. (4) FS‐UDA 47 : It utilized frequency and spatial knowledge distillation with a multi‐teacher architecture for UDA. All models were implemented in PyTorch version 2.2.1, and experiments were conducted on an NVIDIA GeForce RTX 4090 GPU with 24 GB of memory.

2.5. Evaluation metrics

To evaluate the model's performance, we used the mean (±s.d.) DSC to verify structure segmentation accuracy, and the mean (±s.d.) 95th Hausdorff Distance (HD95) to evaluate segmentation boundary quality. Additionally, we conducted statistical significance tests using the Wilcoxon signed‐rank test with Bonferroni correction. Finally, we evaluated model confidence across different approaches by analyzing two voxel‐level uncertainties: variance and entropy, as well as the volume‐level uncertainty score.

Specifically, the DSC measures the similarity between the predicted segmentation Yp and "ground‐truth" Yg by evaluating their spatial overlap, as follows:

DSC=2YpYgYp+Yg, (11)

where DSC = 1 indicates that Yp and Yg are identical, and DSC = 0 means that Yp and Yg have no overlap.

Among the asymmetric surface distance between the segmentation boundary Yp and ‘ground‐truth’ Yg, HD95 takes the 95th percentile of all boundary distances by:

HD95=maxP95dyp,Yg,P95dyg,Yp, (12)

where P95 is the 95‐percentile operator, and dyp,Yg=minypYpypYg is the distance from a predicted segmentation voxel yp to the "ground‐truth" Yg. Similarly, d(yg,Yp) is the distance from a "ground‐truth" segmentation voxel yg to the predicted segmentation Yp.

The p‐values for the evaluation metrics (DSC, HD95) are computed using the Wilcoxon signed‐rank test with Bonferroni correction to assess the statistical significance of our method compared to three comparison methods (W/o Adaptation, SIFA, and FS‐UDA). First, the Wilcoxon signed‐rank test statistic W is calculated as follows:

W=sgnxiyi·Ri, (13)

where xi​ and yi denote paired evaluation metrics, Ri represents their corresponding ranks, and sgn(·) is the sign function.

Then, the p‐value is calculated based on the standard normal distribution of W:

p=m1PZ|Zs, (14)

where Zs=Wμwσw, and m=3 is the number of comparison methods for Bonferroni correction. If p<α, the result is considered statistically significant. Here, α=0.05 represents the significance level.

3. RESULTS

3.1. Ablation study

To validate the effectiveness of the two key components (adversarial learning and histogram matching) in our method, we conducted ablation studies on a subset of the AMOS training dataset. We randomly selected 48 samples for training (24 MRI and 24 CT scans) and 12 samples for testing (6 MRI and 6 CT scans). Figure 4 shows the comparison results of four different configurations: (1) W/o Adaptation: used as the baseline method, (2) AL‐UDA: used adversarial learning for UDA, (3) HM‐UDA: applied histogram matching to the source domain data, then trained with the resulting data, and (4) HMeAL‐UDA.

FIGURE 4.

FIGURE 4

Segmentation results in each ablation experimental setting for different methods: (a) DSC for MRI‐to‐CT, (b) HD95 for MRI‐to‐CT, (c) DSC for CT‐to‐MRI, and (d) HD95 for CT‐to‐MRI. DSC, Dice similarity coefficient.

Without domain adaptation, the model yielded a mean (± s.d.) DSC of 28.61% ± 17.05% and HD95 of 47.45 ± 14.41 mm for CT‐to‐MRI adaptation, and DSC of 41.73% ± 23.75% and HD95 of 22.69 ± 8.98 mm for MRI‐to‐CT adaptation. By incorporating adversarial learning, AL‐UDA demonstrated a significant improvement, with the DSC increasing to 81.80% ± 1.75% and the HD95 decreasing to 9.38 ± 3.65 mm for CT‐to‐MRI adaptation, and the DSC increasing to 78.68% ± 2.41% and the HD95 decreasing to 9.29 ± 1.72 mm for MRI‐to‐CT adaptation, which demonstrated that adversarial learning between the source and target domains could reduce the domain gap and enhance the model's generalizability performance. Image style transfer through histogram matching raised the DSC to 85.06% ± 1.36% and reduced the HD95 to 7.92 ± 2.88 mm for CT‐to‐MRI adaptation, and increased the DSC to 83.07% ± 3.21% and reduced the HD95 to 4.23 ± 1.74 mm for MRI‐to‐CT adaptation, indicating that image style transformations effectively narrowed the domain gap between CT and MRI scans, even without adversarial learning. Finally, by integrating histogram matching into adversarial learning, the proposed HMeAL‐UDA model improved domain adaptation performance compared to all other models in both DSC and HD95, achieving a mean (± s.d.) DSC of 89.34% ± 1.11% and HD95 of 6.52 ± 3.32 mm for CT‐to‐MRI adaptation, and DSC of 85.27% ± 3.07% and HD95 of 2.53 ± 0.86 mm for MRI‐to‐CT adaptation. These results showed that combining different image and feature alignment strategies effectively yielded higher accuracy for cross‐modality medical image segmentation with large domain gaps.

3.2. Comparison with other UDA methods

3.2.1. Liver segmentation

We compared the liver segmentation performance of HMeAL‐UDA with four other methods. The results are presented in Table 2. Statistical significance was evaluated using Wilcoxon signed‐rank tests between HMeAL‐UDA and the other methods.

TABLE 2.

Mean (± s.d.) DSC (%), HD95 (mm), and Wilcoxon signed‐rank tests of different methods on the liver datasets. Bolded values indicate the best among the comparison methods (Supervision benchmark excluded). Arrows are pointing in the direction of improved accuracy. For Supervision, the training used 22 labeled samples from the target domain and testing used 8 samples from the same domain. For W/o Adaptation, the training used 22 labeled samples from the source domain and testing used 8 samples from the target domain. For domain adaptation methods, all training used 22 labeled samples from the source domain and 22 unlabeled samples from the target domain, and all testing used 8 samples from the target domain.

CT (source) to MRI (target) MRI‐to‐CT
Methods DSC (%) ↑ p‐value HD95 (mm) ↓ p‐value DSC (%) ↑ p‐value HD95 (mm) ↓ p‐value
Supervision 91.51 ± 1.25 4.78 ± 2.53 89.90 ± 2.37 7.19 ± 4.98
W/o Adaptation 30.08 ± 8.94 1e–2 59.90 ± 10.25 1e–2 53.21 ± 5.13 1e–2 43.25 ± 13.94 1e–2
SIFA 82.67 ± 3.28 1e–2 11.75 ± 4.82 2e–2 82.95 ± 2.86 1e–2 16.99 ± 4.21 1e–2
FS‐UDA 85.85 ± 2.74 1e–2 10.60 ± 5.14 2e–2 83.38 ± 3.77 6e–2 13.04 ± 5.10 2e–2
HMeAL‐UDA 90.56 ± 1.67 6.35 ± 2.26 87.26 ± 2.89 8.37 ± 4.01

Abbreviations: HMeAL, histogram matching‐enhanced adversarial learning; MC, Monte Carlo; SIFA, synergistic image and feature alignment; UDA, Unsupervised domain adaptation.

Table 2 presents the mean DSC, HD95, and corresponding p‐values for the evaluated methods. Supervision, which involved supervised training with labeled target‐domain data, served as the benchmark. Without domain adaptation, the model trained on CT and tested directly on MRI scans only achieved a mean DSC of 30.08% ± 8.94% and an HD95 of 59.90 ± 10.25 mm. In the reverse direction, the model achieved a mean DSC of 53.21% ± 5.13% and an HD95 of 43.25 ± 13.94 mm. Compared to Supervision, a significant performance gap exists due to the domain feature shifts between MRI and CT scans, demonstrating the challenge of cross‐modality segmentation when labeled data is only available for one modality. SIFA and FS‐UDA, two state‐of‐the‐art UDA methods, generated better segmentation results than W/o Adaptation. However, by combining adversarial learning with histogram matching, HMeAL‐UDA achieved better performance compared to SIFA and FS‐UDA, even with a simpler network design. For CT scans, HMeAL‐UDA achieved a mean DSC of 90.56% ± 1.67% and an HD95 of 6.35 ± 2.26 mm; and for MRI scans, the mean DSC reached 87.26% ± 2.89% with an HD95 of 8.37 ± 4.01 mm. Additionally, all statistical tests comparing HMeAL‐UDA with other methods (except for the DSC for FS‐UDA for MRI‐to‐CT adaptation) yielded p‐values below 0.05, indicating that the improvements by HMeAL‐UDA were statistically significant.

Figures 5 and 6 visually compared liver segmentation results in bidirectional adaptations. HMeAL‐UDL outperformed other approaches, with segmentation maps closely aligned with the “ground truth”. False positives, especially those small islands shown in SIFA and FS‐UDA methods, were effectively removed. The integration of adversarial learning with histogram matching effectively reduced the domain gap, enhancing the HMeAL‐UDA's ability to capture domain‐invariant features to generate high‐quality segmentations.

FIGURE 5.

FIGURE 5

Segmentation maps of different methods for UDA from CT to MRI on the liver datasets. UDA, Unsupervised domain adaptation.

FIGURE 6.

FIGURE 6

Segmentation maps of different methods for UDA from MRI to CT on the liver datasets. UDA, Unsupervised domain adaptation.

3.2.2. Abdominal multi‐organ segmentation

We evaluated the multi‐organ segmentation performance of HMeAL‐UDA on the AMOS dataset. The mean DSC and HD95 for the test set are summarized in Tables 3, 4, 5, 6. Statistical significance was assessed through Wilcoxon signed‐rank tests to compare HMeAL‐UDA with other methods.

TABLE 3.

Mean (± s.d.) DSC and Wilcoxon signed‐rank test results of different methods for CT‐to‐MRI cross‐modality segmentation on the AMOS dataset. Bolded values indicate the best among the comparison methods (Supervision benchmark excluded). Arrows are pointing in the direction of improved accuracy. For Supervision, the training used 48 labeled samples from the target domain and testing used 12 samples from the same domain. For W/o Adaptation, the training used 192 labeled samples from the source domain and testing used 12 samples from the target domain. For domain adaptation methods, all training used 192 labeled samples from the source domain and 48 unlabeled samples from the target domain, and all testing used 12 samples from the target domain.

CT‐to‐MRI / DSC (%) ↑
Methods Spleen Left kidney Right kidney Liver Mean p‐value
Supervision 89.34 ± 5.76 87.99 ± 6.21 89.98 ± 5.33 91.98 ± 4.11 89.82 ± 1.66
W/o Adaptation 26.45 ± 19.05 11.75 ± 11.55 12.10 ± 14.06 60.99 ± 9.97 27.82 ± 23.15 <1e‐3
SIFA 83.56 ± 6.03 91.39 ± 2.04 79.75 ± 8.55 91.74 ± 2.83 86.61 ± 5.93 <1e‐3
FS‐UDA 90.48 ± 1.69 89.02 ± 6.31 87.80 ± 5.23 91.32 ± 2.55 89.66 ± 1.56 <1e‐3
HMeAL‐UDA 90.30 ± 4.61 92.55 ± 1.11 90.25 ± 3.05 92.24 ± 1.82 91.34 ± 1.23

Abbreviations: AMOS, abdominal multi‐organ segmentation challenge; DSC, Dice similarity coefficient; HMeAL, histogram matching‐enhanced adversarial learning; SIFA, synergistic image and feature alignment; UDA, Unsupervised domain adaptation.

TABLE 4.

Mean HD95 and Wilcoxon signed‐rank test results of different methods for CT‐to‐MRI cross‐modality segmentation on the AMOS dataset. Bolded values indicate the best among the comparison methods (Supervision benchmark excluded). Arrows are pointing in the direction of improved accuracy. For Supervision, the training used 48 labeled samples from the target domain and testing used 12 samples from the same domain. For W/o Adaptation, the training used 192 labeled samples from the source domain and testing used 12 samples from the target domain. For domain adaptation methods, all training used 192 labeled samples from the source domain and 48 unlabeled samples from the target domain, and all testing used 12 samples from the target domain.

CT‐to‐MRI / HD95 (mm) ↓
Methods Spleen Left kidney Right kidney Liver Mean p‐value
Supervision 11.11 ± 4.63 4.40 ± 3.89 10.07 ± 4.27 6.70 ± 2.18 8.07 ±  ± 3.09
W/o Adaptation 57.32 ± 26.06 73.20 ± 15.08 59.21 ± 17.28 22.11 ± 16.53 52.96 ± 21.75 <1e‐3
SIFA 9.13 ± 7.81 6.11 ± 4.88 7.58 ± 5.79 7.04 ± 6.29 7.47 ± 4.27 6e‐3
FS‐UDA 13.84 ± 6.74 3.86 ± 2.94 11.82 ± 8.21 5.01 ± 7.68 8.63 ± 4.94 <1e‐3
HMeAL‐UDA 10.66 ± 5.80 2.44 ± 0.87 3.36 ± 1.74 8.27 ± 3.02 6.18 ± 2.93

Abbreviations: AMOS, abdominal multi‐organ segmentation challenge; HMeAL, histogram matching‐enhanced adversarial learning; SIFA, synergistic image and feature alignment; UDA, Unsupervised domain adaptation.

TABLE 5.

Mean (± s.d.) DSC and Wilcoxon signed‐rank test results of different methods for MRI‐to‐CT cross‐modality segmentation on the AMOS dataset. Bolded values indicate the best among the comparison methods (Supervision benchmark excluded). Arrows are pointing in the direction of improved accuracy. For Supervision, the training used 192 labeled samples from the target domain and testing used 48 samples from the same domain. For W/o Adaptation, the training used 48 labeled samples from the source domain and testing used 48 samples from the target domain. For domain adaptation methods, all training used 48 labeled samples from the source domain and 192 unlabeled samples from the target domain, and all testing used 48 samples from the target domain.

MRI‐to‐CT / DSC (%) ↑
Methods Spleen Left kidney Right kidney Liver Mean p‐value
Supervision 88.39 ± 8.15 90.24 ± 4.45 88.63 ± 5.11 94.01 ± 1.42 90.32 ± 2.60
W/o Adaptation 60.25 ± 13.60 20.73 ± 15.24 18.73 ± 14.72 65.18 ± 8.01 41.22 ± 24.91 <1e‐3
SIFA 82.02 ± 7.65 79.11 ± 1321 84.64 ± 8.39 88.52 ± 4.25 83.57 ± 4.00 <1e‐3
FS‐UDA 83.84 ± 11.00 83.04 ± 6.11 84.37 ± 8.77 90.77 ± 2.86 85.51 ± 4.55 <1e‐3
HMeAL‐UDA 84.86 ± 6.89 87.94 ± 7.10 83.76 ± 5.24 91.94 ± 4.17 87.13 ± 3.67

Abbreviations: AMOS, abdominal multi‐organ segmentation challenge; DSC, Dice similarity coefficient; HMeAL, histogram matching‐enhanced adversarial learning; SIFA, synergistic image and feature alignment; UDA, Unsupervised domain adaptation.

TABLE 6.

Mean (± s.d.) HD95 and Wilcoxon signed‐rank test results of different methods for MRI‐to‐CT cross‐modality segmentation on the AMOS dataset. Bolded values indicate the best among the comparison methods (Supervision benchmark excluded). Arrows are pointing in the direction of improved accuracy. For Supervision, the training used 192 labeled samples from the target domain and testing used 48 samples from the same domain. For W/o Adaptation, the training used 48 labeled samples from the source domain and testing used 48 samples from the target domain. For domain adaptation methods, all training used 48 labeled samples from the source domain and 192 unlabeled samples from the target domain, and all testing used 48 samples from the target domain.

MRI‐to‐CT / HD95 (mm) ↓
Methods Spleen Left kidney Right kidney Liver Mean p‐value
Supervision 2.56 ± 1.03 2.88 ± 1.12 1.53 ± 1.38 0.78 ± 0.47 1.94 ± 0.96
W/o Adaptation 21.35 ± 14.35 18.48 ± 13.12 19.25 ± 8.30 32.00 ± 11.62 22.77 ± 6.27 <1e‐3
SIFA 7.70 ± 5.42 8.22 ± 4.70 7.99 ± 4.62 3.29 ± 1.94 6.80 ± 2.35 <1e‐3
FS‐UDA 4.93 ± 2.91 5.57 ± 3.55 6.23 ± 4.80 2.77 ± 1.36 4.88 ± 1.50 <1e‐3
HMeAL‐UDA 3.05 ± 2.68 2.44 ± 1.94 2.71 ± 2.04 1.72 ± 1.21 2.48 ± 1.56

Abbreviations: AMOS, abdominal multi‐organ segmentation challenge; HMeAL, histogram matching‐enhanced adversarial learning; SIFA, synergistic image and feature alignment; UDA, Unsupervised domain adaptation.

Segmentation results for the spleen, left kidney, right kidney, and liver, for CT‐to‐MRI and MRI‐to‐CT unsupervised segmentation adaptations, are shown in Tables 3, 4, 5, 6. We used Supervision as a reference for performance benchmarking of different methods. Without adaptation, the model trained on CT and tested on MRI scans achieved a mean DSC of only 27.82% ± 23.15%, and similarly, 22.77% ± 6.27% when trained on MRI and tested on CT. These results are significantly lower than those of the supervised training model (89.82% ± 1.66% for the MRI domain and 90.32% ± 2.60% for the CT domain), highlighting the severe domain shift problem within the cross‐modality AMOS dataset. HMeAL‐UDA achieved the highest segmentation accuracy in both adaptation directions. Compared to W/o Adaptation, our method increased the mean DSC by 63.52% (in absolute terms, same below) and reduced the HD95 by 46.78 mm for CT‐to‐MRI adaptations. For MRI‐to‐CT adaptations, the DSC improved by 45.91%, with an HD95 reduction of 21.5 mm. Notably, HMeAL‐UDA's segmentation accuracy even outperformed that of Supervision in the CT‐to‐MRI adaptation, further validating the effectiveness of the proposed method.

3.3. Uncertainty estimations of different methods

We ran 50 MC samples with a dropout rate of 0.2 during testing to assess the uncertainties of the different models, based on two voxel‐level and one volume‐level quantitative assessments of uncertainty on the AMOS dataset. In general, uncertainty from MC dropout samples is higher around organ boundaries and false positives, with different measures showing similar patterns.

Figures 7 and 8 present the uncertainty maps for five different models used in multi‐organ segmentation for MRI‐to‐CT and CT‐to‐MRI adaptations, respectively. We also included in the figures the mean segmentation maps of the MC samples. Figures 7a and 8a display the uncertainty estimates for the supervised training model. The results show that prediction uncertainty is especially pronounced around the edges of organs, which is expected. For the segmentation model without adaptation, as shown in Figures 7b and 8b, the results present considerable uncertainties in extensive areas, emphasizing the challenge of accurate segmentation across modalities. The UDA‐based methods show lower uncertainties in both variance and entropy measures. Compared to FS‐UDA (Figures 7d and 8d), SIFA exhibits higher predictive entropy around organ boundaries, indicating greater boundary‐related uncertainty (Figures 7c and 8c). HMeAL‐UDA demonstrates the lowest uncertainty, even along organ edges, as shown in Figures 7e and 8e. Its mean segmentation maps also closely match the “ground truth”, indicating the efficacy of combined adversarial learning and histogram matching in reducing the uncertainty of domain adaptation. Table 7 shows the volume‐based uncertainty scores of different methods, calculated via Equation (10). As expected, the supervised training model has the highest scores. The uncertainty scores of our method achieved mean (± s.d.) 0.936 ± 0.027 for CT‐to‐MRI adaptation and 0.922 ± 0.032 for MRI‐to‐CT adaptation, which are closest to those of Supervision. These results show that HMeAL‐UDA enhances the model's robustness and reliability for UDA‐based cross‐modality segmentation.

FIGURE 7.

FIGURE 7

Uncertainty maps of different methods for CT‐to‐MRI UDA: (a) Supervision, (b) W/o Adaptation, (c) SIFA, (d) FS‐UDA, and (e) HMeAL‐UDA. (f) shows the input image and (g) shows the "ground‐truth" segmentation. For (a)–(e), the first column shows variance uncertainty maps, the second column shows entropy uncertainty maps, the third column shows the mean segmentation results of multiple MC predictions, and the last column illustrates the error maps between predictions and "ground truth". HMeAL, histogram matching‐enhanced adversarial learning; MC, Monte Carlo; SIFA, synergistic image and feature alignment; UDA, Unsupervised domain adaptation.

FIGURE 8.

FIGURE 8

Uncertainty maps of different methods from MRI‐to‐CT UDA: (a) Supervision, (b) W/o Adaptation, (c) SIFA, (d) FS‐UDA, and (e) HMeAL‐UDA. (f) shows the input image and (g) shows the "ground‐truth" segmentation. For (a)–(e), the first column shows variance uncertainty maps, the second column shows entropy uncertainty maps, the third column shows the mean segmentation results of multiple MC predictions, and the last column illustrates the error maps between predictions and "ground truth". HMeAL, histogram matching‐enhanced adversarial learning; MC, Monte Carlo; SIFA, synergistic image and feature alignment; UDA, Unsupervised domain adaptation.

TABLE 7.

Mean (± s.d.) uncertainty scores of different methods, evaluated on the AMOS dataset. CT‐to‐MRI: For Supervision, the training used 48 labeled samples from the target domain and testing used 12 samples from the same domain. For W/o Adaptation, the training used 192 labeled samples from the source domain and testing used 12 samples from the target domain. For domain adaptation methods, all training used 192 labeled samples from the source domain and 48 unlabeled samples from the target domain, and all testing used 12 samples from the target domain. MRI‐to‐CT: For Supervision, the training used 192 labeled samples from the target domain and testing used 48 samples from the same domain. For W/o Adaptation, the training used 48 labeled samples from the source domain and testing used 48 samples from the target domain. For domain adaptation methods, all training used 48 labeled samples from the source domain and 192 unlabeled samples from the target domain, and all testing used 48 samples from the target domain.

Methods CT‐to‐MRI MRI‐to‐CT
Supervision 0.947 ± 0.035 0.951 ± 0.018
W/o Adaptation 0.674 ± 0.105 0.790 ± 0.099
SIFA 0.858 ± 0.073 0.873 ± 0.064
FS‐UDA 0.887 ± 0.060 0.870 ± 0.057
HMeAL‐UDA 0.936 ± 0.027 0.922 ± 0.032

Abbreviations: AMOS, abdominal multi‐organ segmentation challenge; HMeAL, histogram matching‐enhanced adversarial learning; SIFA, synergistic image and feature alignment; UDA, Unsupervised domain adaptation.

3.4. Model complexity analysis

To assess HMeAL‐UDA's reduction in model complexity, we compared the total number of trainable parameters (weights and biases), the training time per epoch, and the convergence rate (Dice loss curve) across different UDA methods for CT‐to‐MRI adaptation on the liver dataset. The results are shown in Table 8 and Figure 9.

TABLE 8.

Model parameters and training time of different UDA methods.

Methods Param. (×106) Training time per epoch (s)
SIFA 120.2 192
FS‐UDA 87.3 105
HMeAL‐UDA 47.7 45

Abbreviations: HMeAL, histogram matching‐enhanced adversarial learning; SIFA, synergistic image and feature alignment; UDA, Unsupervised domain adaptation.

FIGURE 9.

FIGURE 9

Comparison of training Dice loss curves of different UDA methods. (a) SIFA, (b) FS‐UDA, and (c) HMeAL‐UDA. HMeAL, histogram matching‐enhanced adversarial learning; SIFA, synergistic image and feature alignment; UDA, Unsupervised domain adaptation.

As shown in Table 8, compared to SIFA (with a CycleGAN architecture) and FS‐UDA (using multiple teacher networks), HMeAL‐UDA used fewer model parameters and required less training time, while achieving superior performance. Moreover, the convergence curves in Figure 9 demonstrate that HMeAL‐UDA attained the best convergence efficiency.

4. DISCUSSION

To address the challenges of domain shifts in medical images due to various scanners or imaging protocols, as well as the lack of annotation, we have developed an HMeAL‐UDA for cross‐modality medical image segmentation. In this framework, adversarial learning enables feature‐level adaptation through discrimination in the output space, while histogram matching, a straightforward style transfer technique, facilitates image‐level adaptation. These two components work complementarily and synergistically, collectively addressing the domain gap challenge in cross‐modality segmentation. Moreover, the proposed HMeAL‐UDA framework is an end‐to‐end single adversarial learning model. Unlike UDA methods that require multiple levels of adversarial training, additional regularization, or fine‐tuning, our approach simplifies the training process with a concise loss function, making it easier and more efficient to train (Figure 9 and Table 8). The HMeAL‐UDA framework substantially improves the segmentation accuracy for a label‐less target domain by leveraging the shared image features and the segmentation labels from another domain, paving the way for potential clinical translation in resource‐limited and labor‐demanding clinical scenarios.

We conducted extensive experiments to validate the performance of the proposed method, focusing on two segmentation tasks: liver segmentation and abdominal multi‐organ segmentation, with bidirectional domain adaptations between CT and MRI. An ablation study was performed to assess the effectiveness of each model component (Figure 4). Ablation tests demonstrate that both the adversarial learning and histogram matching components of HMeAL‐UDA play a crucial role in its performance. Compared to the baseline model without adaptation, applying each component individually yields notable improvements in the segmentation accuracy, while their combination allows the adversarial learning to tackle the residual domain gaps post histogram matching and yield better results. Compared to Supervision, W/o Adaptation, and two state‐of‐the‐art UDA approaches, HMeAL‐UDA consistently demonstrates superior performance in both quantitative results (Tables 2, 3, 4, 5, 6) and visual segmentation maps (Figures 5 and 6). Notably, in the CT‐to‐MRI domain adaptation scenario for abdominal multi‐organ segmentation (Tables 3, 4 and Figure 5), HMeAL‐UDA even outperformed fully‐supervised models. For this scenario, the performance advantage of HMeAL‐UDA could originate from the substantially larger size of the source‐domain data (240 CTs) when compared with the target‐domain data (60 MRIs). For CT‐to‐MRI adaptation, HMeAL‐UDA is able to combine a large amount of labeled CT data with unlabeled MRI data to better learn domain‐invariant features for segmentation and adapt them to the MRI target domain. In contrast, for supervised training, the smaller size of the MRI dataset potentially limits the accuracy and robustness of the resulting model. The potential of HMeAL‐UDA in improving few‐shot learning, by combining limited target‐domain data with a large labeled source‐domain dataset, can be highly relevant in many data‐scarce clinical scenarios and warrants future investigations.

In cross‐modality UDA segmentation algorithms, the model's ability to learn and align features between the source and target domains was influenced by various factors, such as intensity distribution, structural characteristics, and artifact interference. For MRI‐to‐CT UDA segmentation, MRI provided rich soft tissue contrast and detailed information, but the lower soft tissue contrast and limited detail representation in CT increased the difficulty of feature adaptation, potentially leading to information loss. The histogram‐matched MR images (to the CT domain) are expected to show a higher contrast than true CT domain images, thus the learned segmentation features and representations from these converted images may not be fully translatable to true CT domain images, affecting the segmentation accuracy. In contrast, for CT‐to‐MRI UDA segmentation, the histogram‐matched CT domain images (to the MR domain) may not be able to fully represent the rich and diverse contrast patterns seen in true MR images. The segmentation model, which is learned with histogram‐matched CT images, may neglect the additional contrast and boundaries shown in true MR images, and lead to over‐segmentation. Another potential challenge for CT‐to‐MRI adaptation is that MR images can have varying contrast due to sequence/scanner variations, while CT intensity is relatively stable across protocols/scanners. The intensity variations within the MRI dataset can be substantial and make it more challenging for the model to learn. The above‐mentioned challenges can be potentially addressed through improved feature alignment methods, enhanced boundary learning capabilities, and pseudo‐labeling strategies to improve segmentation performance, which are warranted in future studies.

Additionally, we evaluated the uncertainties of predictions across different models. As shown in Figures 7 and 8, segmentation models generally exhibit high uncertainties around organ boundaries, matching expectation. Compared to our method, other approaches demonstrate larger regions of high uncertainty, indicating greater instability. Table 7 further illustrates the differences in uncertainty scores among various methods. The supervised method achieved scores above 0.9, while the W/o Adaptation method (CT‐to‐MRI) scored below 0.7. Among different domain adaptation methods, HMeAL‐UDA's mean (± s.d.) uncertainty scores are 0.936 ± 0.027 for CT‐to‐MRI adaptation, and 0.922 ± 0.032 for MRI‐to‐CT adaptation, closely matching those of the supervised model. In comparison, the uncertainty scores for the two comparison methods, SIFA and FS‐UDA, are notably smaller (both < 0.9). Based on the experimental results, we may define a score of 0.9 as a reliable uncertainty threshold for segmentation predictions. However, due to the complexity and diversity of segmentation tasks, the threshold for the uncertainty scores should be interpreted within the context of specific clinical applications and task requirements. In real‐world clinical applications, uncertainty estimation not only offers interpretability of the model's predictions but serves as a tool to automatically highlight areas of high uncertainties to guide clinical decision‐making.

Despite the strengths of the HMeAL‐UDA framework in cross‐modality medical image segmentation, there are some limitations of the proposed model (supplementary material presented some low‐quality segmentation cases in S.1 and further validated the model's robustness in S.2). First, aligning complex, heterogeneous multi‐modal features remains challenging for adversarial learning in UDA, and further research is needed to thoroughly analyze the convergence of adversarial learning during segmentation. Second, although histogram matching is useful for image style transfer by aligning intensity distributions, it may not guarantee that the histogram‐matched source‐domain images will fully reflect the target domain's style. For image modalities with even larger domain gaps than CT‐MRI, such a strategy may fail to substantially reduce the domain gap for effective adversarial learning. Moreover, it lacks adaptability and cannot be further optimized within a deep learning framework for end‐to‐end training. The image translation potential of histogram matching is also limited by its fixed mapping scheme, which can be over‐simplified for complex medical images. Therefore, future research could combine theory‐driven traditional image translation methods with data‐driven learning methods to combine the advantages of both worlds. Third, since target domain data is unlabeled, special attention should be given to setting appropriate stopping criteria or validation strategies during training. In this study, we did not use a validation set; instead, the model from the final iteration was selected for inference, which may not capture the model's optimal performance. Finally, in addition to the prediction uncertainty, overconfident predictions can lead to biased interpretations of results, which are not investigated in this study while remaining an important area for future research. Addressing these limitations in future works could involve exploring advanced style transfer techniques, incorporating anatomical priors, or developing more efficient adversarial training strategies specifically tailored for cross‐modality medical image segmentation.(Supporting Information)

5. CONCLUSION

The proposed HMeAL‐UDA framework efficiently mitigated domain gaps by learning domain‐invariant features through HMeAL. It reduced the annotation burden and facilitated the use of 3D unpaired cross‐modality medical images to develop cross‐domain models. Extensive experiments on multiple datasets for bidirectional adaptation tasks demonstrated that our method surpassed existing state‐of‐the‐art UDA methods. The uncertainty estimation study further demonstrates that HMeAL‐UDA provides uncertainty maps that are closely matched to supervised models, offering more reliable information to guide clinical interference and decision making.

CONFLICT OF INTEREST STATEMENT

The authors declare no conflicts of interest.

Supporting information

Supporting Information

MP-52-4299-s004.JPG (34KB, JPG)

Supporting Information

MP-52-4299-s003.JPG (36.3KB, JPG)

Supporting Information

MP-52-4299-s005.JPG (41KB, JPG)

Supporting Information

MP-52-4299-s001.JPG (33.3KB, JPG)

Supporting Information

MP-52-4299-s002.doc (194KB, doc)

ACKNOWLEDGMENTS

The study was supported by the US National Institutes of Health (R01 CA240808, R01 CA258987, R01 CA280135, and R01 EB034691).

Qian X, Shao H‐C, Li Y, Lu W, Zhang Y. Histogram matching‐enhanced adversarial learning for unsupervised domain adaptation in medical image segmentation. Med Phys. 2025;52:4299–4317. 10.1002/mp.17757

DATA AVAILABILITY STATEMENT

The CHAOS, BTCV, and AMOS datasets used in this study are publicly available for download and use (https://chaos.grand‐challenge.org/; https://www.synapse.org/Synapse:syn3193805/; https://amos22.grand‐challenge.org/). The UTSW dataset was collected from our institution and is non‐public. According to HIPAA policy, access to the dataset will be granted on a case‐by‐case basis upon submission of a request to the corresponding authors and the institution.

REFERENCES

  • 1. Hendrycks D, Dietterich T. Benchmarking neural network robustness to common corruptions and perturbations. In: Proceedings of the International Conference on Learning Representations (ICLR); 2019.
  • 2. Krizhevsky A, Sutskever I, Hinton GE. Imagenet classification with deep convolutional neural networks. Adv Neural Inf Process Syst. 2012;25:1097‐1105. [Google Scholar]
  • 3. LeCun Y, Bengio Y, Hinton G. Deep learning. Nature. 2015;521(7553):436‐444. [DOI] [PubMed] [Google Scholar]
  • 4. Litjens G, Kooi T, Bejnordi BE, et al. A survey on deep learning in medical image analysis. Med Image Anal. 2017;42:60‐88. [DOI] [PubMed] [Google Scholar]
  • 5. Ma J, Zhang Y, Gu S, et al. Abdomenct‐1k: is abdominal organ segmentation a solved problem?. IEEE Trans Pattern Anal Mach Intell. 2021;44(10):6695‐6714. [DOI] [PubMed] [Google Scholar]
  • 6. Moreno‐Torres JG, Raeder T, Alaiz‐Rodríguez R, Chawla NV, Herrera F. A unifying view on dataset shift in classification. Pattern Recognit. 2012;45(1):521‐530. [Google Scholar]
  • 7. Zhou K, Liu Z, Qiao Y, Xiang T, Loy CC. Domain generalization: a survey. IEEE Trans Pattern Anal Mach Intell. 2022;45(4):4396‐4415. [DOI] [PubMed] [Google Scholar]
  • 8. AlBadawy EA, Saha A, Mazurowski MA. Deep learning for segmentation of brain tumors: impact of cross‐institutional training and testing. Med Phys. 2018;45(3):1150‐1158. [DOI] [PubMed] [Google Scholar]
  • 9. Prados F, Ashburner J, Blaiotta C, et al. Spinal cord grey matter segmentation challenge. Neuroimage. 2017;152:312‐329. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10. Ghafoorian M, Mehrtash A, Kapur T, et al. Transfer learning for domain adaptation in MRI: application in brain lesion segmentation. In: Medical Image Computing and Computer‐Assisted Intervention (MICCAI) 2017 Proceedings, Part III . Quebec City, QC, Canada; 2017:2020‐2027. Springer. [Google Scholar]
  • 11. Kouw WM, Loog M. A review of domain adaptation without target labels. IEEE Trans Pattern Anal Mach Intell. 2019;43(3):766‐785. [DOI] [PubMed] [Google Scholar]
  • 12. Li J, Yu Z, Du Z, Zhu L, Shen HT. A comprehensive survey on source‐free domain adaptation. IEEE Trans Pattern Anal Mach Intell. 2024;46:5743‐5762. [DOI] [PubMed] [Google Scholar]
  • 13. Guan H, Liu M. Domain adaptation for medical image analysis: a survey. IEEE Trans Biomed Eng. 2021;69(3):1173‐1185. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14. Evan MY, Iglesias JE, Dalca AV, Sabuncu MR. An auto‐encoder strategy for adaptive image segmentation. IEEE Trans Med Imaging. 2020;39(7):2327‐2338.31995477 [Google Scholar]
  • 15. Yao Y, Liu F, Zhou Z, et al. Unsupervised domain adaptation through shape modeling for medical image segmentation. In: International Conference on Medical Imaging with Deep Learning (MIDL) . Proceedings of Machine Learning Research; 2022. [Google Scholar]
  • 16. Ganin Y, Ustinova E, Ajakan H, et al. Domain‐adversarial training of neural networks. J Mach Learn Res. 2016;17(59):1‐35. [Google Scholar]
  • 17. Wang S, Yu L, Yang X, Fu C‐W, Heng P‐A. Patch‐based output space adversarial learning for joint optic disc and cup segmentation. IEEE Trans Med Imaging. 2019;38(11):2485‐2495. [DOI] [PubMed] [Google Scholar]
  • 18. Huo Y, Xu Z, Moon H, et al. Synseg‐net: synthetic segmentation without target modality ground truth. IEEE Trans Med Imaging. 2018;38(4):1016‐1025. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19. Dou Q, Ouyang C, Chen C, et al. Pnp‐adanet: plug‐and‐play adversarial domain adaptation network at unpaired cross‐modality cardiac segmentation. IEEE Access. 2019;7:99065‐99076. [Google Scholar]
  • 20. Zhu J‐Y, Park T, Isola P, Efros AA. Unpaired image‐to‐image translation using cycle‐consistent adversarial networks. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV); 2017:2242‐2251. IEEE. [Google Scholar]
  • 21. Park T, Efros AA, Zhang R, Zhu J‐Y, Contrastive learning for unpaired image‐to‐image translation. In: Computer Vision–ECCV 2020: 16th European Conference on Computer Vision; 2020; Glasgow, UK. Proceedings, Part IX. Springer. [Google Scholar]
  • 22. Tomar D, Lortkipanidze M, Vray G, Bozorgtabar B, Thiran J‐P. Self‐attentive spatial adaptive normalization for cross‐modality domain adaptation. IEEE Trans Med Imaging. 2021;40(10):2926‐2938. [DOI] [PubMed] [Google Scholar]
  • 23. Tsai Y‐H, Hung W‐C, Schulter S, Sohn K, Yang M‐H, Chandraker M. Learning to adapt structured output space for semantic segmentation. IEEE Trans Pattern Anal Mach Intell. 2018;40(4):1045‐1058.28391189 [Google Scholar]
  • 24. Yan W, Wang Y, Xia M, Tao Q. Edge‐guided output adaptor: highly efficient adaptation module for cross‐vendor medical image segmentation. IEEE Signal Process Lett. 2019;26(11):1593‐1597. [Google Scholar]
  • 25. Chen C, Dou Q, Chen H, Qin J, Heng PA. Unsupervised bidirectional cross‐modality adaptation via deeply synergistic image and feature alignment for medical image segmentation. IEEE Trans Med Imaging. 2020;39(7):2494‐2505. [DOI] [PubMed] [Google Scholar]
  • 26. Han X, Qi L, Yu Q, et al. Deep symmetric adaptation network for cross‐modality medical image segmentation. IEEE Trans Med Imaging. 2021;41(1):121‐132. [DOI] [PubMed] [Google Scholar]
  • 27. Xie Q, Li Y, He N, et al. Unsupervised domain adaptation for medical image segmentation by disentanglement learning and self‐training. IEEE Trans Med Imaging. 2022;43(1):4‐14. [DOI] [PubMed] [Google Scholar]
  • 28. Sun Y, Dai D, Xu S. Rethinking adversarial domain adaptation: orthogonal decomposition for unsupervised domain adaptation in medical image segmentation. Med Image Anal. 2022;82:102623. [DOI] [PubMed] [Google Scholar]
  • 29. Jähne B. Digital Image Processing. Springer Science & Business Media; 2005. [Google Scholar]
  • 30. Burger W, Burge MJ. Digital Image Processing: An Algorithmic Introduction. Springer Nature; 2022. [Google Scholar]
  • 31. Roy S, Bhalla K, Patel R. Mathematical analysis of histogram equalization techniques for medical image enhancement: a tutorial from the perspective of data loss. Multimedia Tools Appl. 2024;83(5):14363‐14392. [Google Scholar]
  • 32. Goceri E. Medical image data augmentation: techniques, comparisons and interpretations. Artif Intell Rev. 2023;56(11):12561‐12605. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33. Mittal H, Pandey AC, Saraswat M, Kumar S, Pal R, Modwel G. A comprehensive survey of image segmentation: clustering methods, performance parameters, and benchmark datasets. Multimedia Tools Appl. 2022;81:35001‐35026. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34. Yaras C, Kassaw K, Huang B, Bradbury K, Malof JM. Randomized histogram matching: a simple augmentation for unsupervised domain adaptation in overhead imagery. IEEE J Sel Top Appl Earth Observ Remote Sens. 2023;16:3456‐3466. [Google Scholar]
  • 35. Liu S, Yin S, Qu L, Wang M. Reducing domain gap in frequency and spatial domain for cross‐modality domain adaptation on medical image segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence; 2023. AAAI Press. [Google Scholar]
  • 36. Nair T, Precup D, Arnold DL, Arbel T. Exploring uncertainty measures in deep networks for multiple sclerosis lesion detection and segmentation. Med Image Anal. 2020;59:101557. [DOI] [PubMed] [Google Scholar]
  • 37. Yang Z, Lafata K, Vaios E, et al. Quantifying U‐Net uncertainty in multi‐parametric MRI‐based glioma segmentation by spherical image projection. Med Phys. 2024;51(3):1931‐1943. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38. Gawlikowski J, Tassi CRN, Ali M, et al. A survey of uncertainty in deep neural networks. Artif Intell Rev. 2023;56(Suppl 1):1513‐1589. [Google Scholar]
  • 39. Maruccio FC, Eppinga W, Laves M‐H, et al. Clinical assessment of deep learning‐based uncertainty maps in lung cancer segmentation. Phys Med Biol. 2024;69(3):035001. doi: 10.1088/1361-6560/ad1a26 [DOI] [PubMed] [Google Scholar]
  • 40. Hüllermeier E, Waegeman W. Aleatoric and epistemic uncertainty in machine learning: an introduction to concepts and methods. Machine Learning. 2021;110(3):457‐506. [Google Scholar]
  • 41. Gal Y, Ghahramani Z. In: Proceedings of the 33rd International Conference on Machine Learning (ICML);2016:1050‐1059. JMLR.org. [Google Scholar]
  • 42. Mehta R, Filos A, Baid U, et al. QU‐BraTS: mICCAI BraTS 2020 challenge on quantifying uncertainty in brain tumor segmentation‐analysis of ranking scores and benchmarking results. J Mach Learn Biomed Imaging. 2022;1:1‐54. doi: 10.59275/j.melba.2022-354b [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43. Kavur AE, Gezer NS, Barış M, et al. CHAOS challenge‐combined (CT‐MR) healthy abdominal organ segmentation. Med Image Anal. 2021;69:101950. [DOI] [PubMed] [Google Scholar]
  • 44. Landman B, Xu Z, Igelsias J, Styner M, Langerak T, Klein A, Miccai multi‐atlas labeling beyond the cranial vault–workshop and challenge. In: Proceedings of the MICCAI Multi‐Atlas Labeling Beyond Cranial Vault—Workshop and Challenge; 2015.Springer. [Google Scholar]
  • 45. Ji Y, Bai H, Ge C, et al. Amos: a large‐scale abdominal multi‐organ benchmark for versatile medical image segmentation. Adv Neural Inf Process Syst. 2022;35:36722‐36732. [Google Scholar]
  • 46. Milletari F, Navab N, Ahmadi S‐A. In: Proceedings of the International Conference on 3D Vision (3DV); 2016:565‐571.
  • 47. Liu S, Yin S, Qu L, Wang M, Song Z. A structure‐aware framework of unsupervised cross‐modality domain adaptation via frequency and spatial knowledge distillation. IEEE Trans Med Imaging. 2023;42(12):3919‐3931. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supporting Information

MP-52-4299-s004.JPG (34KB, JPG)

Supporting Information

MP-52-4299-s003.JPG (36.3KB, JPG)

Supporting Information

MP-52-4299-s005.JPG (41KB, JPG)

Supporting Information

MP-52-4299-s001.JPG (33.3KB, JPG)

Supporting Information

MP-52-4299-s002.doc (194KB, doc)

Data Availability Statement

The CHAOS, BTCV, and AMOS datasets used in this study are publicly available for download and use (https://chaos.grand‐challenge.org/; https://www.synapse.org/Synapse:syn3193805/; https://amos22.grand‐challenge.org/). The UTSW dataset was collected from our institution and is non‐public. According to HIPAA policy, access to the dataset will be granted on a case‐by‐case basis upon submission of a request to the corresponding authors and the institution.


Articles from Medical Physics are provided here courtesy of Wiley

RESOURCES