Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2023 Nov 1.
Published in final edited form as: IEEE Trans Med Imaging. 2022 Oct 27;41(11):3445–3453. doi: 10.1109/TMI.2022.3186698

Dual Adversarial Attention Mechanism for Unsupervised Domain Adaptive Medical Image Segmentation

Xu Chen 1, Tianshu Kuang 2, Hannah Deng 3, Steve H Fung 4, Jaime Gateno 5,6, James J Xia 7,8, Pew-Thian Yap 9
PMCID: PMC9748599  NIHMSID: NIHMS1846015  PMID: 35759585

Abstract

Domain adaptation techniques have been demonstrated to be effective in addressing label deficiency challenges in medical image segmentation. However, conventional domain adaptation based approaches often concentrate on matching global marginal distributions between different domains in a class-agnostic fashion. In this paper, we present a dual-attention domain-adaptative segmentation network (DADASeg-Net) for cross-modality medical image segmentation. The key contribution of DADASeg-Net is a novel dual adversarial attention mechanism, which regularizes the domain adaptation module with two attention maps respectively from the space and class perspectives. Specifically, the spatial attention map guides the domain adaptation module to focus on regions that are challenging to align in adaptation. The class attention map encourages the domain adaptation module to capture class-specific instead of class-agnostic knowledge for distribution alignment. DADASeg-Net shows superior performance in two challenging medical image segmentation tasks.

Keywords: Attention Mechanism, Unsupervised Domain Adaptation, Adversarial Learning, Medical Image Segmentation

I. INTRODUCTION

Automatic segmentation of medical images has attracted increasing attention in modern computer-aided diagnosis (CAD) [1], [2]. Although great advancements have been made with the emergence of deep learning techniques, building a reliable segmentation model for medical images remains challenging. The appearance of medical images can vary significantly with imaging modalities or acquisition parameters. Moreover, manually annotating ground truth for medical images is typically challenging and requires professional skills or knowledge. The lack of pixel-wise annotated data is a longstanding issue in medical image segmentation.

To address this problem, a feasible solution is to exploit the annotations from another modality via domain adaptation at the image level. By learning an image-to-image translation model between the target modality and a label-rich source modality, realistic data can be synthesized to train the segmenter for the target modality without pixel-wise annotations [3]–[5]. For instance, Huo et al. [3] presented SynSeg-Net to learn image-level translations between a target domain without annotations and a source domain with annotations. Synthetic data in the target domain are coupled with annotations in the source domain to train the segmentation model. Li et al. [5] proposed a simplified unsupervised image translation method (SUIT) for domain adaptation in semantic segmentation. Another promising solution is to align the hidden representation spaces of different domains via adversarial learning, i.e., feature-level domain adaptation [6]–[9]. For example, Dou et al. [6] introduced PnP-AdaNet to adapt segmentation networks between different modalities with a plug-and-play module.

Most of the above-mentioned domain adaptation methods concentrate on matching the global marginal distributions of different domains at the image or feature level. Such strategy treats all spatial positions equally in a class-agnostic fashion and (i) ignores the fact that some regions can be more easily misaligned than others; (ii) disadvantages minor classes; and (iii) discards domain-specific information that preserves discriminability and contains meaningful information for downstream tasks.

In this paper, we present a dual-attention domain-adaptative segmentation network (DADASeg-Net) to adapt the knowledge learned from a source domain with annotations to a target domain without annotations. This is realized by aligning the feature-level distributions of the source and target domains via spatial and class attention mechanisms, which regularize the domain adaptation module by adaptively weighting the adversarial loss at the feature level. The former forces the domain adaptation module to focus on the spatial regions that tend to be misaligned. The latter encourages the capturing of class-specific rather than class-agnostic knowledge for adaptation. We confirmed the efficacy of our method based on two challenging cross-modality medical image segmentation tasks: skull segmentation from MRI and cardiac substructure segmentation from CT.

II. RELATED WORK

A. Unsupervised Domain Adaptation

Unsupervised domain adaptation (UDA) is broadly used to address the distributional shift issue between a source domain, where labels are available, and a target domain, where labels are unavailable. UDA transfers knowledge between two domains, with the assumption that the class labels across domains are consistent. UDA is of particular significance in medical image segmentation, where annotations are typically scarce. Earlier UDA methods include reweighting methods for reweighting source samples by their significance [10], [11], iterative methods for iteratively generating pseudo labels for target samples [12]–[14], common-representation methods for building a common feature space for two domains, and hierarchical Bayesian methods for factorizing domain-dependent latent representations [15], [16]. Inspired by generative adversarial networks (GANs) [17], recent UDA methods focus on distribution alignment via adversarial learning at data-level [3], [4], [18], [19] or feature-level [6], [9], [20], [21]. For example, Huo et al. [3] proposed to adapt the visual appearance of two domains with a CycleGAN [22]. Ouyang et al. [20] learned a common domain-agnostic representation space to obtain distribution-matched latent semantic features. Distribution alignment performed jointly at both image and feature levels further improves performance [7], [19]. Chen et al. [21] proposed to hierarchically harmonize transferability and discriminability through feature reweighting for cross-domain object detection. Luo et al. [23] introduced a category-level adversarial learning framework for semantic consistent domain adaptation.

B. Attention Mechanism

Attention mechanism, originally developed for natural language modeling [24], has been widely adopted for computer vision tasks, e.g., image captioning [25]–[28], semantic segmentation [23], [29]–[32], and object detection [33]–[35]. Attention mechanism is usually realized by selective reweighting of features based on their relevance to a specific task. For instance, Chen et al. [25] proposed SCA-CNN to incorporate spatial-wise and channel-wise attention in a CNN for image captioning. Fu et al. [32] presented a dual attention network to adaptively integrate local features with their global dependencies for scene segmentation. Wang et al. [36] introduced a non-local operation to capture the long-range spatial dependencies by computing an attention map. Huang et al. [29] proposed CCNet to adaptively capture long-range contextual information from full-image dependencies with a criss-cross attention module. Zhao et al. [37] introduced PSANet to leverage the contextual information for each individual position in a hidden representation space.

III. METHOD

A. Overview

For a source domain image set 𝒳S with annotation 𝒴S and a target domain image set 𝒳T without annotation, our goal is to construct an auto-segmentation model for the target domain by leveraging the knowledge learned from the source domain.

To this end, a widely used strategy is to adapt the features captured from target domain images to those from source domain images by matching the entire hidden representation spaces of two domains [6], [7], [38], [39]. However, such strategy often extracts domain agnostic representations and discards domain specific information that not only preserves discriminability but is also meaningful for downstream tasks [23]. In this work, we encourage the domain adaptation module to capture meaningful information by calculating two attention maps: (i) a spatial attention map, and (ii) a class attention map. The former assigns high weights for spatial positions that appear to be poorly aligned in domain adaptation. The latter adjusts the weights for each class individually for discriminative distribution alignment. These two attention maps are finally combined and incorporated into a domain adaptation adversarial loss function, as introduced in Section III-B.

Based on proposed dual adversarial attention mechanism, we train our cross-modality segmentation model with source domain annotations, which will be detailed in Section III-C. The network architecture and training details will be elaborated in Section III-D and III-E, respectively.

B. Dual Adversarial Attention Mechanism

The proposed dual adversarial attention mechanism is illustrated in Figure 1. We use a segmenter to perform semantic segmentation for images from either the source or target domain, as well as a discriminator to match the distributions of semantic representations for different domains.

Fig. 1:

Fig. 1:

The proposed dual adversarial attention mechanism.

Specifically, the segmenter consists of two encoders SES and SET to respectively extract latent representations from the source and target domain images, and a shared decoder SD to produce the segmentation masks from extracted features. The discriminator D adapts the semantic features produced by SD to a common feature space via adversarial learning. However, unlike conventional domain adaptation methods that treat all elements of semantic features equally, our domain adaptation module focuses on specific regions determined by the spatial and class attention mechanisms.

The spatial attention mechanism encourages the domain adaptation module to focus on the regions that tend to be misaligned in the spatial dimension. To realize this, we incorporate a CycleGAN [22] into our framework to learn the image-level translations between two domains. Ideally, the semantic features extracted from a real image in one domain and its synthetic counterpart in another domain should be consistent. We realize this requirement with the following feature consistency loss function:

Lfc=SES(xs)SET(x^st)22+SET(xt)SES(x^ts)22, (1)

where xs𝒳S and xt𝒳T denote the real images of source domain and target domain, respectively. x^st and x^ts are the synthetic images translated from xs and xt by CycleGAN, respectively. In practice, however, the semantic features of real and synthetic images can still differ substantially at some spatial positions. We thus calculate the spatial attention map based on the inconsistency of semantic features captured from real and synthetic images at all spatial positions. The larger the inconsistency, the higher the attention weight. Let SD2(●) denote the feature map generated by the layer before the output layer of segmenter, we calculate the spatial attention map with the operator fspatial defined as

fspatial(xt)=ϕ(SD2(SET(xt)),SD2(SES(x^ts)))), (2)

where xt𝒳T is a real target domain image, and ø(●,●) measures the dissimilarity between two semantic features across all spatial positions. In this work, ø(●,●) is defined as the cosine distance.

The class attention mechanism encourages a more discriminative distribution alignment. It produces an attention map for each class individually, which is calculated based on the probability of class occurrence. This is realized by using the operator fclass defined as

fclass(xt)=Softmax(SD(SET(xt))). (3)

Specifically, for each class, we calculate a class attention map based on the Softmax prediction of segmenter output across all spatial positions. The greater the likelihood a position is estimated as belonging to the c-th class, the higher the weight at this position is in the c-th attention map. In this way, the discriminator is guided to capture class-specific rather than class-agnostic knowledge in adaptation, which encourages more discriminative feature alignment as well as reduces the bias caused by inter-class discrepancy.

The spatial and class attention maps are combined and incorporated into our domain adaptation adversarial loss function. Specifically, the spatial attention map is tiled and then element-wise multiplied with the class attention map to generate a final activation map. This activation map is then multiplied with the intermediate output of D to produce the final output with a size of C × H × W, where C denotes the number of classes, and H and W indicate the height and width of training images, respectively. The output of discriminator can be interpreted as C individual slices, where the values of c-th slice represent whether the corresponding voxel of input features are from the source or target domains for c-th class. Following [40], we measure the discrimination and adversarial domain adaptation losses with the least squares loss function as follows:

Ldisf(D)=[D(SD2(SES(xs)))1]2+[D(SD2(SET(xt)))+1]2, (4)
Ladvf(SET,SD)=[fspatial(xt)fclass(xt)][D(SD2(SET(xt)))1]2, (5)

where represents the element-wise multiplication operator.

C. Domain Adaptive Segmentation

The domain adaptation module forces the segmenter encoders SES and SET to embed the images of either the source or target domain to a common feature space. Naturally, the source domain annotations can be used to train the SES and SD with the following loss function:

Lsegs(SES,SD)=CE(SD(SES(xs)),ys), (6)

where xs𝒳S and ys𝒴S are respectively the source domain image and the corresponding annotation, and CE represents the cross-entropy loss function.

To avoid the segmenter being biased to the source domain, the synthetic target domain images are coupled with the source domain annotations to learn the SET and SD. The loss function is formulated as

Lsegt(SET,SD)=CE(SD(SET(x^st)),ys), (7)

where x^st is the synthetic target domain image that is translated from the source domain image xs.

D. Network Architecture

Our domain adaptation module consists of two segmenter encoders (SES and SET), a shared segmenter decoder SD and a discriminator D. The segmenter encoder contains three encoding blocks. Each encoding block consists of three convolutional layers, where the final one is with the stride of 2 to gradually reduce the resolution of feature maps. This is followed by three residual blocks [41] to increase non-linear modeling capability. The segmenter decoder is comprised of three residual blocks, three decoding blocks and an additional convolutional layer to produce the final output. Each decoding block is comprised of a transposed convolutional layer with a stride of 2, followed by two convolutional layers. For discriminator D, we adopt the U-Net [42] architecture. In addition, we append group normalization [43] after each convolutional layer and transposed convolutional layer. For activation function, we choose ReLU for the segmenter and Lealy ReLU with slope 0.2 for the discriminator as suggested in [44].

E. Training Details

Our framework consists of a CycleGAN [22] (which consists of two generators and two image discriminators), a segmenter and a feature discriminator. We split all subnetworks into 3 groups and trained them alternately. The training strategy is detailed in Algorithm 1. Specifically, at each iteration we first trained all discriminators with loss function LdisI+Ldisf, where LdisI is the image discrimination loss in the CycleGAN. Then we trained the segmenter with loss function Lsegs+λ1Lsegt+λ2Ladvf. Finally, the generators in the CycleGAN were trained with loss function Lcyc+λ3LadvI+λ4Lfc where Lcyc and LadvI are the cycle-consistency and image adversarial losses in the CycleGAN.

We trained the model with mini-batches. Each batch contain 4 source domain images and 4 target domain images that randomly selected from the 𝒳S and 𝒳T, respectively. We adopted the Adam algorithm [45] to train our model with a learning rate of 0.0001. Other hyper-parameters were set as λ1 = 0.5, λ2 = 0.01, λ3 = 0.01, and λ4 = 0.1.

Algorithm 1.

Training strategy

Input: Image sets 𝒳S and 𝒳T, and annotation set 𝒴S
1: for t = 1 to T do
2:  Train the discriminators with loss LdisI+Ldisf.
3:  Train the segmenter with loss Lsegs+λ1Lsegt+λ2Ladvf.
4:  Train the generators with loss Lcyc+λ3LadvI+λ4Lfc.
5: end for

IV. EXPERIMENTS

A. Experimental Settings

To evaluate the efficacy of DADASeg-Net, we first tested it on a skull segmentation task based on MRI. In this task, we aim to automatically annotate bony structures for MR images by utilizing the bony annotations from CT. To this end, we built the training dataset with 50 CT and 50 MRI scans, which were respectively collected from the CQ500 database [46] and the Alzheimer’s Disease Neuroimaging Initiative (ADNI) database [47]. Besides, the testing dataset includes 8 pairs of CT-MRI scans that collected from the ADNI database. For data preprocessing, we linearly aligned all CT and MRI scans by using FMRIB’s Linear Image Registration Tool (FLIRT) [48]. The aligned scans were then cropped to a same size of 176×184×146. Then, we linearly scaled the intensity values of all scans into [–1, 1]. The skull annotations of both training and testing data were acquired from CT images by intensity-based thresholding, and followed by manual corrections as needed. For paired CT-MRI data in the testing dataset, the bony annotations of CT scans can be shared with their MRI counterparts for quantitative evaluation.

In addition, we also tested DADASeg-Net on cardiac substructure segmentation based on CT. In this task, we aim to automatically delineate four cardiac substructures [the ascending aorta (AA), the left atrium blood cavity (LAC), the left ventricle blood cavity (LVC) and the myocardium of the left ventricle (MYO)] in CT images, while the annotations are defined in MRI modality. The experimental data came from the Multi-Modality Whole Heart Segmentation (MMWHS) Challenge 2017 dataset [49], which includes 20 CT and 20 MRI unpaired scans. We adopted the preprocessed data as described in [6]1. Specifically, the training dataset contains 9,600 CT and 8,400 MRI samples, which were respectively cropped from 16 randomly selected CT and MRI scans. The rest 4 CT scans were used to build the testing dataset. The intensity values of all experimental data were z-score normalized, i.e., with zero-mean and unit variance.

To confirm the efficacy of our DADASeg-Net, we compared it with 5 state-of-the-art unsupervised domain-adaptative medical image segmentation methods, including SynSeg-Net [3], PnP-AdaNet [6], Ouyang et al. [20], SIFA [7] and ARL-GAN [19]. Segmentation performance was quantitatively measured with the Dice similarity coefficient (DSC) and the average symmetric surface distance (ASSD).

B. Segmentation Results

For skull segmentation, we first trained our method and all competing methods with 50 unpaired CT and MRI scans. Then all trained models were tested on 8 testing MRI scans. Table I shows the quantitative performance of all comparison methods. DADASeg-Net yields the best performance, outperforming the second best by 3.02 and 0.09 in terms of DSC and ASSD, respectively. Note that SIFA and ARL-GAN have similar frameworks with our method. They also perform the domain adaptations in image level and feature level simultaneously with a cross-modality image synthesis submodule and an adversarial feature discriminator. However, the feature space alignment of our framework is guided by the proposed dual adversarial attention mechanism, which is a key distinction over existing methods. The performance improvement shown in Table I clearly indicates the efficacy of DADASeg-Net in comparison with other methods.

TABLE I:

Quantitative results of skull segmentation. Best results are marked in bold.

Methods DSC(%) ASSD

SynSeg-Net 78.32 ± 2.58 1.48 ± 0.17
PnP-AdaNet 74.01 ± 2.31 1.79 ± 0.15
Ouyang et al. 78.64 ± 3.27 1.52 ± 0.23
SIFA 76.60 ± 1.74 1.67 ± 0.09
ARL-GAN 81.05 ± 1.13 1.27 ± 0.08
DADASeg-Net 84.07 ± 1.51 1.18 ± 0.11

Fig. 2 shows some representative results of all comparison methods (b)-(h), as well as the (a) input images and (g) ground truth. Compared with other methods, the skull masks produced by DADASeg-Net are more complete and with less artifacts.

Fig. 2:

Fig. 2:

Visual results for skull segmentation. Compared with other methods, DADASeg-Net produces smoother skull segmentation mask.

We also compared DADASeg-Net with other methods in cardiac substructure segmentation. In this task, we trained all methods on unpaired cardiac CT-MRI data with manual annotations that only defined in MRI modality. Then we tested the trained models on 4 testing CT scans. In addition, we trained a U-Net [42] in a fully supervised fashion (i.e., with manual annotations for CT) for reference.

Table II summarizes the quantitative comparison results. Our DADASeg-Net yields the best performance in most cases and on average. However, the superior performance of U-Net implies that the unsupervised domain adaptive segmentation models still have room for improvement. The representative segmentation results shown in Fig. 3 also exhibit the superiority of DADASeg-Net. Compared with other algorithms, DADASeg-Net is able to produce more complete and smooth segmentation masks.

TABLE II:

Quantitative results of cardiac segmentation. Best results (except reference) are marked in bold.

SynSeg-Net Ouyang et al. PnP-AdaNet SIFA ARL-GAN Proposed U-Net
DSC (%)↑ AA 71.6 ± 13.9 82.4 ± 6.3 74.0 ± 20.9 81.3 ± 27.6 71.3 ± 8.6 87.0 ± 2.1 69.2 ± 36.8
LAC 69.0 ± 28.0 74.2 ± 16.2 68.9 ± 14.4 79.5 ± 9.0 80.6 ± 9.3 80.7 ± 9.1 91.7 ± 3.3
LVC 51.6 ± 22.7 61.1 ± 26.9 61.9 ± 15.2 73.8 ± 6.4 69.5 ± 18.1 77.9 ± 9.0 89.4 ± 6.0
MYO 40.8 ± 17.1 75.8 ± 8.7 50.8 ± 17.6 61.6 ± 4.8 81.6 ± 7.9 61.2 ± 9.1 90.6 ± 5.1
AVG 58.2 ± 22.3 73.4 ± 21.4 63.9 ± 18.6 74.1 ± 18.8 75.7 ± 12.9 76.7 ± 12.4 85.2 ± 21.0

ASSD (mm)↓ AA 11.7 ± 5.0 3.3 ± 1.1 12.8 ± 12.6 7.9 ± 9.3 6.3 ± 2.4 4.5 ± 3.0 4.4 ± 4.6
LAC 7.8 ± 2.1 3.9 ± 1.8 6.3 ± 3.5 6.2 ± 1.9 5.9 ± 2.6 5.6 ± 2.7 2.0 ± 0.6
LVC 7.0 ± 1.8 9.8 ± 5.3 17.4 ± 4.5 5.5 ± 1.3 6.7 ± 3.9 4.7 ± 1.6 2.6 ± 1.3
MYO 9.2 ± 4.3 5.6 ± 1.2 14.7 ± 4.5 8.5 ± 0.7 6.5 ± 1.7 5.5 ± 2.3 1.9 ± 0.6
AVG 8.9 ± 3.8 5.7 ± 3.9 12.8±7.6 7.0 ± 5.7 6.4 ± 2.0 5.1 ± 2.5 2.7 ± 2.6

Fig. 3:

Fig. 3:

Visual results for cardiac segmentation. The objects with colors yellow, green, blue and red respectively indicate AA, LAC, LVC, MYO.

C. Ablation Study

We conducted the ablation study to further analyze the role of several key components of proposed framework. Specifically, we trained 2 variants of the our model by removing the feature consistency loss Lconsisf (V1) and the feature adversarial learning losses Ladvf+Ldisf (V2). In addition, to measure the effect of proposed dual adversarial attention mechanism, we trained 3 additional variants by removing the class attention (V3), spatial attention (V4) and both of them (V5), respectively.

Table III summarizes the evaluation results. We can observe that V2 achieves the worst performance, which confirms the effectiveness of feature-level adaptation in cross-modality segmentation. Moreover, the feature consistency constraint also contributes to the performance improvement by preserving the anatomical structure information in image-level adaptation. On the other hand, the results also suggest that conducting spatial or class attention separately (V3 and V4) achieves only marginal improvements. However, by integrating two attention mechanisms together, our attention-based feature adversarial learning improves the segmentation accuracy effectively, which further confirms the efficacy of our dual adversarial attention mechanism.

TABLE III:

Effects of loss functions and proposed dual adversarial attention mechanism.

Methods DSC(%)↑ ASSD(mm)↓

V1 (w/o Lconsisf) 81.72 ± 1.85 1.33 ± 0.14
V2 (w/o Ladvf+Ldisf) 79.81 ± 1.60 1.34 ± 0.13
V3 (w/o class attention) 82.08 ± 2.84 1.24 ± 0.19
V4 (w/o spatial attention) 81.53 ± 1.60 1.42 ± 0.12
V5 (w/o spatial & class attention) 81.03 ± 1.65 1.35 ± 0.13
Proposed 84.07 ± 1.51 1.18 ± 0.11

D. Attention Map Analysis

To further study the role of proposed dual adversarial attention mechanism, we visually analyzed the generated attention maps. Figure 4 and 5 show the representative attention maps calculated in skull segmentation and cardiac substructure segmentation tasks, respectively. The higher the attention weight, the warmer the color. We can observe that the high weights in the spatial attention map are mainly distributed around the boundaries of segmentation targets. These regions have the tendency to be misclassified, which deserve more attention in adaptation. On the other hand, the class attention maps basically cover the segmentation targets, which encourage the domain adaptation module to capture class-specific information from the region of interest for each class individually. Such class-specific knowledge not only preserve the discriminability, but also contains meaningful information that would be beneficial to downstream tasks.

Fig. 4:

Fig. 4:

Representative attention maps in skull segmentation.

Fig. 5:

Fig. 5:

Representative attention maps in cardiac substructure segmentation.

E. Model Analysis

In this section, we further analyze several properties of proposed method.

  1. Parameter Analysis: As shown in Algorithm 1, there are four hyper-parameters involved in model training. We empirically set λ1 = 0.5, λ3 = 0.01, and investigated the selection of λ2 and λ4 on the validation set of cardiac substructure segmentation task. Specifically, we searched for the optimal λ2 and λ4 from {10−1, 10−2, 10−3}. Figure 6 shows the grid search results, according to which we chose λ2 = 0.01 and λ4 = 0.1.

  2. Sample Complexity Analysis: We performed sample complexity analysis by training the proposed method by using only a fraction of the training samples. As shown in Figure 7, the skull segmentation performance of DADASeg-Net stabilizes with 40%∼50% training samples. In contrast, the performance for cardiac segmentation keeps increasing as more training samples are used.

  3. Feature Visualization: To further study the effectiveness of DADASeg-Net in reducing the domain gap between different imaging modalities, we extracted the source and target domain features respectively using SES and SET, and visualized the feature distribution in 2D space with t-SNE [50]. As shown in Figure 8, the domain discrepancy between the source and target modalities is effectively reduced, and the distribution of features from different domains is gradually aligned during model training.

Fig. 6:

Fig. 6:

Grid search results for hyper-parameters λ2 and λ4.

Fig. 7:

Fig. 7:

Performance of DADASeg-Net on skull and cardiac segmentation tasks with respect to training samples.

Fig. 8:

Fig. 8:

Evolution of source (red) and target (blue) domain features during model training, visualized using t-SNE.

V. CONCLUSION

In this work, we presented an unsupervised domain adaptive medical image segmentation framework. The major contribution that differentiates our work from others is its novel dual adversarial attention mechanism. Unlike the commonly used global alignment strategy [6]–[9], DADASeg-Net forces the domain adaptation module to focus on meaningful regions rather than treating all positions equally. This is realized by adaptively weighting the adversarial loss based on the proposed dual attention mechanism. Compared with prior methods with dual attention in spatial and channel dimensions [25], [32], DADASeg-Net aims to adaptively reduce the domain gap between the source and target domains by regularizing the domain adaptation module space-wise and class-wise. Specifically, the spatial attention module concentrates on spatial positions that tend to be misaligned in adaptation, judging based on the inconsistency of semantic features during image-level adaptation. The class attention module encourages the domain adaptation module to capture class-specific knowledge for more discriminative feature-level adaptation. The efficacy of DADASeg-Net is confirmed with experimental results based on challenging cross-modality skull and cardiac substructure segmentation tasks.

Acknowledgments

This work was supported in part by NIH/NIDCR grants DE021863, DE027251, and DE022676.

Footnotes

Contributor Information

Xu Chen, Department of Radiology and Biomedical Research Imaging Center (BRIC), University of North Carolina at Chapel Hill, NC 27599, USA..

Tianshu Kuang, Department of Oral and Maxillofacial Surgery, Houston Methodist Research Institute, TX 77030, USA..

Hannah Deng, Department of Oral and Maxillofacial Surgery, Houston Methodist Research Institute, TX 77030, USA..

Steve H. Fung, Department of Radiology, Houston Methodist Hospital, TX 77030, USA.

Jaime Gateno, Department of Oral and Maxillofacial Surgery, Houston Methodist Research Institute, TX 77030, USA.; Department of Surgery (Oral and Maxillofacial Surgery), Weill Medical College, Cornell University, New York, NY, USA.

James J. Xia, Department of Oral and Maxillofacial Surgery, Houston Methodist Research Institute, TX 77030, USA. Department of Surgery (Oral and Maxillofacial Surgery), Weill Medical College, Cornell University, New York, NY, USA.

Pew-Thian Yap, Department of Radiology and Biomedical Research Imaging Center (BRIC), University of North Carolina at Chapel Hill, NC 27599, USA..

REFERENCES

  • [1].Chen X, Hu Y, Zhang Z, Wang B, Zhang L, Shi F, Chen X, and Jiang X, “A graph-based approach to automated eus image layer segmentation and abnormal region detection,” Neurocomputing, 2018. [Google Scholar]
  • [2].Cheng B, Liu M, Zhang D, Munsell BC, and Shen D, “Domain transfer learning for mci conversion prediction,” IEEE Transactions on Biomedical Engineering, vol. 62, no. 7, pp. 1805–1817, 2015. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [3].Huo Y, Xu Z, Moon H, Bao S, Assad A, Moyo TK, Savona MR, Abramson RG, and Landman BA, “Synseg-net: Synthetic segmentation without target modality ground truth,” IEEE Transactions on Medical Imaging, 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [4].Zhang Z, Yang L, and Zheng Y, “Translating and segmenting multi-modal medical volumes with cycle-and shapeconsistency generative adversarial network,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 9242–9251. [Google Scholar]
  • [5].Li R, Cao W, Jiao Q, Wu S, and Wong H-S, “Simplified unsupervised image translation for semantic segmentation adaptation,” Pattern Recognition, vol. 105, p. 107343, 2020. [Google Scholar]
  • [6].Dou Q, Ouyang C, Chen C, Chen H, Glocker B, Zhuang X, and Heng P-A, “Pnp-adanet: Plug-and-play adversarial domain adaptation network at unpaired cross-modality cardiac segmentation,” IEEE Access, vol. 7, pp. 99065–99076, 2019. [Google Scholar]
  • [7].Chen C, Dou Q, Chen H, Qin J, and Heng P, “Unsupervised bidirectional cross-modality adaptation via deeply synergistic image and feature alignment for medical image segmentation.” IEEE Transactions on Medical Imaging, 2020. [DOI] [PubMed] [Google Scholar]
  • [8].Dou Q, Ouyang C, Chen C, Chen H, and Heng P-A, “Unsupervised cross-modality domain adaptation of convnets for biomedical image segmentations with adversarial loss,” in Proceedings of the 27th International Joint Conference on Artificial Intelligence, 2018, pp. 691–697. [Google Scholar]
  • [9].Chen X, Lian C, Wang L, Deng H, Kuang T, Fung SH, Gateno J, Shen D, Xia JJ, and Yap P-T, “Diverse data augmentation for learning image segmentation with cross-modality annotations,” Medical Image Analysis, vol. 71, p. 102060, 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [10].Xia R, Hu X, Lu J, Yang J, and Zong C, “Instance selection and instance weighting for cross-domain sentiment classification via pu learning,” in Twenty-Third International Joint Conference on Artificial Intelligence, 2013. [Google Scholar]
  • [11].Liu B, Lee WS, Yu PS, and Li X, “Partially supervised classification of text documents,” in ICML, vol. 2, no. 485. Sydney, NSW, 2002, pp. 387–394. [Google Scholar]
  • [12].Gallego A-J, Calvo-Zaragoza J, and Fisher RB, “Incremental unsupervised domain-adversarial training of neural networks,” IEEE Transactions on Neural Networks and Learning Systems, 2020. [DOI] [PubMed] [Google Scholar]
  • [13].Arief-Ang IB, Salim FD, and Hamilton M, “Da-hoc: semi-supervised domain adaptation for room occupancy prediction using co2 sensor data,” in Proceedings of the 4th ACM International Conference on Systems for Energy-Efficient Built Environments, 2017, pp. 1–10. [Google Scholar]
  • [14].Arief-Ang IB, Hamilton M, and Salim FD, “A scalable room occupancy prediction with transferable time series decomposition of co2 sensor data,” ACM Transactions on Sensor Networks (TOSN), vol. 14, no. 3–4, pp. 1–28, 2018. [Google Scholar]
  • [15].Hajiramezanali E, Dadaneh SZ, Karbalayghareh A, Zhou M, and Qian X, “Bayesian multi-domain learning for cancer subtype discovery from next-generation sequencing count data,” arXiv preprint arXiv:1810.09433, 2018. [Google Scholar]
  • [16].Wen J, Zheng N, Yuan J, Gong Z, and Chen C, “Bayesian uncertainty matching for unsupervised domain adaptation,” arXiv preprint arXiv:1906.09693, 2019. [Google Scholar]
  • [17].Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, and Bengio Y, “Generative adversarial nets,” in Advances in Neural Information Processing Systems, 2014, pp. 2672–2680. [Google Scholar]
  • [18].Chen X, Lian C, Wang L, Deng H, Fung SH, Nie D, Thung K-H, Yap P-T, Gateno J, Xia JJ et al. , “One-shot generative adversarial learning for mri segmentation of craniomaxillofacial bony structures,” IEEE transactions on medical imaging, vol. 39, no. 3, pp. 787–796, 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [19].Chen X, Lian C, Wang L, Deng H, Kuang T, Fung S, Gateno J, Yap P-T, Xia JJ, and Shen D, “Anatomy-regularized representation learning for cross-modality medical image segmentation,” IEEE Transactions on Medical Imaging, vol. 40, no. 1, pp. 274–285, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [20].Ouyang C, Kamnitsas K, Biffi C, Duan J, and Rueckert D, “Data efficient unsupervised domain adaptation for cross-modality image segmentation,” in International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, 2019, pp. 669–677. [Google Scholar]
  • [21].Chen C, Zheng Z, Ding X, Huang Y, and Dou Q, “Harmonizing transferability and discriminability for adapting object detectors,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 8869–8878. [Google Scholar]
  • [22].Zhu J-Y, Park T, Isola P, and Efros AA, “Unpaired image-to-image translation using cycle-consistent adversarial networks,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 2223–2232. [Google Scholar]
  • [23].Luo Y, Zheng L, Guan T, Yu J, and Yang Y, “Taking a closer look at domain shift: Category-level adversaries for semantics consistent domain adaptation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 2507–2516. [Google Scholar]
  • [24].Sutskever I, Vinyals O, and Le QV, “Sequence to sequence learning with neural networks,” in Advances in neural information processing systems, 2014, pp. 3104–3112. [Google Scholar]
  • [25].Chen L, Zhang H, Xiao J, Nie L, Shao J, Liu W, and Chua T-S, “Sca-cnn: Spatial and channel-wise attention in convolutional networks for image captioning,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 5659–5667. [Google Scholar]
  • [26].Xu K, Ba J, Kiros R, Cho K, Courville A, Salakhudinov R, Zemel R, and Bengio Y, “Show, attend and tell: Neural image caption generation with visual attention,” in International conference on machine learning. PMLR, 2015, pp. 2048–2057. [Google Scholar]
  • [27].Li L, Tang S, Deng L, Zhang Y, and Tian Q, “Image caption with global-local attention,” in Thirty-first AAAI conference on artificial intelligence, 2017. [Google Scholar]
  • [28].Liu M, Li L, Hu H, Guan W, and Tian J, “Image caption generation with dual attention mechanism,” Information Processing & Management, vol. 57, no. 2, p. 102178, 2020. [Google Scholar]
  • [29].Huang Z, Wang X, Huang L, Huang C, Wei Y, and Liu W, “Ccnet: Criss-cross attention for semantic segmentation,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 603–612. [Google Scholar]
  • [30].Li H, Xiong P, An J, and Wang L, “Pyramid attention network for semantic segmentation,” arXiv preprint arXiv:1805.10180, 2018. [Google Scholar]
  • [31].Li X, Zhong Z, Wu J, Yang Y, Lin Z, and Liu H, “Expectation-maximization attention networks for semantic segmentation,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 9167–9176. [Google Scholar]
  • [32].Fu J, Liu J, Tian H, Li Y, Bao Y, Fang Z, and Lu H, “Dual attention network for scene segmentation,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 3146–3154. [Google Scholar]
  • [33].Oliva A, Torralba A, Castelhano MS, and Henderson JM, “Top-down control of visual attention in object detection,” in Proceedings 2003 International Conference on Image Processing (Cat. No. 03CH37429), vol. 1. IEEE, 2003, pp. I–253. [Google Scholar]
  • [34].Ying X, Wang Q, Li X, Yu M, Jiang H, Gao J, Liu Z, and Yu R, “Multi-attention object detection model in remote sensing images based on multi-scale,” IEEE Access, vol. 7, pp. 94508–94519, 2019. [Google Scholar]
  • [35].Chen S, Tan X, Wang B, and Hu X, “Reverse attention for salient object detection,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 234–250. [Google Scholar]
  • [36].Wang X, Girshick R, Gupta A, and He K, “Non-local neural networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 7794–7803. [Google Scholar]
  • [37].Zhao H, Zhang Y, Liu S, Shi J, Loy CC, Lin D, and Jia J, “Psanet: Point-wise spatial attention network for scene parsing,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 267–283. [Google Scholar]
  • [38].Javanmardi M. and Tasdizen T, “Domain adaptation for biomedical image segmentation using adversarial training,” in 2018 IEEE 15th International Symposium on Biomedical Imaging (ISBI 2018). IEEE, 2018, pp. 554–558. [Google Scholar]
  • [39].Chen C, Xie W, Huang W, Rong Y, Ding X, Huang Y, Xu T, and Huang J, “Progressive feature alignment for unsupervised domain adaptation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 627–636. [Google Scholar]
  • [40].Mao X, Li Q, Xie H, Lau RY, Wang Z, and Paul Smolley S, “Least squares generative adversarial networks,” in Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 2794–2802. [Google Scholar]
  • [41].He K, Zhang X, Ren S, and Sun J, “Identity mappings in deep residual networks,” in European Conference on Computer Vision. Springer, 2016, pp. 630–645. [Google Scholar]
  • [42].Ronneberger O, Fischer P, and Brox T, “U-net: Convolutional networks for biomedical image segmentation,” in International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, 2015, pp. 234–241. [Google Scholar]
  • [43].Wu Y. and He K, “Group normalization,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 3–19. [Google Scholar]
  • [44].Radford A, Metz L, and Chintala S, “Unsupervised representation learning with deep convolutional generative adversarial networks,” arXiv:1511.06434, 2015. [Google Scholar]
  • [45].Kingma DP and Ba J, “Adam: A method for stochastic optimization,” in 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7–9, 2015, Conference Track Proceedings, Bengio Y. and LeCun Y, Eds., 2015. [Online]. Available: http://arxiv.org/abs/1412.6980 [Google Scholar]
  • [46].Chilamkurthy S, Ghosh R, Tanamala S, Biviji M, Campeau NG, Venugopal VK, Mahajan V, Rao P, and Warier P, “Deep learning algorithms for detection of critical findings in head ct scans: A retrospective study,” The Lancet, vol. 392, no. 10162, pp. 2388–2396, 2018. [DOI] [PubMed] [Google Scholar]
  • [47].Trzepacz PT, Yu P, Sun J, Schuh K, Case M, Witte MM, Hochstetler H, Hake A, Initiative ADN et al. , “Comparison of neuroimaging modalities for the prediction of conversion from mild cognitive impairment to alzheimer’s dementia,” Neurobiology of Aging, vol. 35, no. 1, pp. 143–151, 2014. [DOI] [PubMed] [Google Scholar]
  • [48].Jenkinson M, Bannister P, Brady M, and Smith S, “Improved optimization for the robust and accurate linear registration and motion correction of brain images,” Neuroimage, vol. 17, no. 2, pp. 825–841, 2002. [DOI] [PubMed] [Google Scholar]
  • [49].Zhuang X. and Shen J, “Multi-scale patch and multi-modality atlases for whole heart segmentation of mri,” Medical Image Analysis, vol. 31, pp. 77–87, 2016. [DOI] [PubMed] [Google Scholar]
  • [50].Laurens VDM and Hinton G, “Visualizing data using t-sne,” Journal of Machine Learning Research, vol. 9, no. 2605, pp. 2579–2605, 2008. [Google Scholar]

RESOURCES