Abstract
Multimodal image registration (MIR) is a fundamental procedure in many image-guided therapies. Recently, unsupervised learning-based methods have demonstrated promising performance over accuracy and efficiency in deformable image registration. However, the estimated deformation fields of the existing methods fully rely on the to-be-registered image pair. It is difficult for the networks to be aware of the mismatched boundaries, resulting in unsatisfactory organ boundary alignment. In this paper, we propose a novel multimodal registration framework, which elegantly leverages the deformation fields estimated from both: (i) the original to-be-registered image pair, (ii) their corresponding gradient intensity maps, and adaptively fuses them with the proposed gated fusion module. With the help of auxiliary gradient-space guidance, the network can concentrate more on the spatial relationship of the organ boundary. Experimental results on two clinically acquired CT-MRI datasets demonstrate the effectiveness of our proposed approach.
Index Terms—: Multimodal image registration, gradient guidance, unsupervised registration
1. INTRODUCTION
Deformable image registration (DIR) is an essential procedure in various clinical applications such as radiotherapy, image-guided interventions and preoperative planning. Recent years have witnessed that deep learning-based registration methods make magnificent progress in terms of accuracy and efficiency. In general, the learning-based methods can be categorized into the supervised [1, 2, 3] and the unsupervised settings [4, 5, 6]. Among the methods, the unsupervised strategy becomes dominant recently since the supervised strategy always struggles with limited ground-truth landmarks or deformation fields. Within the unsupervised learning-based pipeline, the network can be trained under the guidance of dissimilarity between the warped images and the fixed images. However, previous works mainly focus on unimodal DIR, and multimodal DIR remains a challenging task due to non-functional intensity mapping across different modalities and limited multimodal similarity metrics.
To generalize the applications to multimodal scenarios, some methods introduce the multimodal similarity metrics, e.g., mutual information (MI) [7] and modality independent neighborhood descriptor (MIND) [8]. However, such measures may be limited by lacking spatial prior knowledge, resulting in good global registration but relatively poor local boundary alignment. Another trend is using Generative Adversarial Network (GAN) [9] to convert multimodal problem to an unimodal one [10, 11, 12]. However, being a challenging task itself, the training of GANs is greatly time-consuming and hard to be controlled, and the synthetic features may lead to mismatched problems.
In this work, we propose a novel unsupervised multimodal image registration approach with auxiliary gradient guidance. Specifically, we exploit the gradient maps of images that highlight the biological structural boundary (shown in Fig.1) to guide the high-quality image registration. In particular, distinct from the typical mono-branch unsupervised image registration network, our approach leverages the deformation fields estimated from two following branches: (i) image registration branch aiming to align the original moving and fixed images, (ii) gradient map registration aiming to align the corresponding gradient intensity maps. In other words, the auxiliary gradient-space registration imposes second-order supervision on the image registration. The network further adaptively fuses them via the proposed gated fusion module towards achieving the best registration performance. Contributions and advantages of our approach can be summarized as follows:
Our approach leverages the deformation fields estimated from the image registration branch and gradient map registration branch to concentrate more on organ boundary and achieve better registration accuracy.
The proposed gated dual-branch fusion module can learn how to adaptively fuse the information from two distinct branches, which can also be extended to other multi-branch fusion tasks.
Fig. 1.

Example CT and MR image and their gradient intensity maps.
Quantitative and qualitative experimental results on two clinically acquired CT-MRI datasets demonstrate the effectiveness of our proposed approach.
2. METHODS
In this section, we firstly introduce the overall framework. Then we present the details of the multimodal image registration branch, the auxiliary gradient map registration branch and the gated dual-branch fusion module in Sec.2.2. Accordingly, the loss function of the framework is presented in Sec.2.3.
2.1. Overview
The whole pipeline of our method is depicted in Fig.2. Given a moving CT and a fixed MR, we calculate their corresponding gradient maps gCT and gMR firstly to obtain the auxiliary to-be-registered gradient intensity information. Then, the image registration branch estimates the primary deformation field ϕi for CT and MR while the gradient map registration branch takes gCT and gMR as inputs to produce the deformation field ϕg. On top of ϕi and ϕg, a gated dual-branch fusion module with output ϕig is proposed to adaptively fuse the estimated deformation fields, followed by the Spatial Transformation Network (STN) [13]. It is noteworthy that the method is in a fully unsupervised manner without requiring any ground-truth deformation or segmentation label.
Fig. 2.

Illustration of (a) our proposed framework and (b) the gated fusion module.
2.2. Dual-branch Image and Gradient Registration Networks
2.2.1. Image Registration Branch
The blue box in Fig.2 illustrates the general image registration branch with inputs of CT and MR. The CNN architecture follows that of VoxelMorph [4]. Specifically, CT and MR are concatenated as a single 2-channel 3D image input, and downsampled by four 3 × 3 × 3 convolutions with stride of 2 as the encoder. The decoder consists of several 32-filter convolutions and four upsampling operations, which can bring the image back to full resolution. Skip connections are also applied to concatenate features in both encoder and decoder. Additionally, another four convolutions are leveraged to refine the 3-channel full-resolution deformation field ϕi.
2.2.2. Gradient Map Registration Branch
The target of the gradient map registration branch is to estimate the auxiliary deformation ϕg between the moving gradient map of CT (gCT) and the fixed gradient map of MR (gMR). Specifically, the gradient map of a 3D volume V can be easily obtained by computing the difference between adjacent pixels via a fixed convolutional layer. The operation can be formulated as followed:
| (1) |
where elements of G(V) are the gradient lengths for pixels with coordinates x = (x, y, z). In other words, we adopt the gradient intensity maps as the gradient maps without considering the gradient direction since such gradient intensity maps are sufficient to delineate the organic edge sharpness, and can be regarded as another image modality. Thus, the registration process of the gradient map registration branch is equivalent to the spatial transformation between the CT edge sharpness and the MR edge sharpness. Furthermore, most areas of the gradient maps are close to zero, so that the subnetwork can pay more attention to the spatial relationship of the organic profiles. Therefore, the branch is more sensitive in capturing the structural dependency and can be utilized as a second constraint to provide outline-alignment supervision to the image registration branch in turn. The magnitude of the gradient outline deformation field ϕg can implicitly reflect whether each to-be-registered region has large or slight local deformation.
In practice, the subnetwork of this branch adopts the same architecture as in the image registration branch but takes the corresponding gradient maps as inputs, directly followed by an STN to apply ϕg to the moving gCT.
2.2.3. Gated Dual-branch Fusion Module
Since our main task is to register the original moving CT image and the fixed MR image, effectively fusing the complementary deformations ϕi and ϕg from the two different branches plays an important role in final registration. If not considered carefully, the fused deformation may otherwise yield performance degradation. As for the dual-branch deformation fusion, the most related approach [14] uses average operation. Hard-coding average operation considers two deformations equally, which may disregard the relative importance of each voxel from two distinct streams. Therefore, apart from using hard-coding operation, we further propose a gated dual-branch fusion mechanism to automatically learn how to adaptively fuse the two distinct deformations. As shown in Fig.2 (b), gated fusion firstly uses a 6-channel 3D convolution with the sigmoid activation, which can obtain the 6-channel gating attention weight matrix ranging from 0 to 1. Then, the matrix is split into two separate gating weight maps go and gs. Next, two separate element-wise multiplications are imposed, which re-weight the ϕi and ϕg as and . Then, and are concatenated and fed to a bottleneck of 1 × 1 × 1 convolution with 3 channels, where the final deformation field ϕig is obtained.
Intuitively, the learning-based fusion module can be more adaptative than hard-coding average operation. The proposed fusion mechanism can be easily integrated into other multi-branch fusion networks.
2.3. Loss Function
The loss function of the previous mono-branch registration networks generally includes two terms: i) the (dis)similarity between the warped image and the fixed image, ii) the regularization term for the deformation field. Distinct from the mono-branch networks, the loss function of our proposed dual-branch framework consists of four parts:
| (2) |
where represents the (dis)similarity between the warped CT image and the fixed MR image, while is to measure that between the warped gCT and gMR. Likewise, the regularization terms and are for the deformation fields ϕig and ϕg respectively. α, β and γ indicate the relative importance of , and . In the experiments, we adopt the L2-norm of gradients of deformation fields as shown below:
| (3) |
The similarity metrics for multimodal image registration are not as flexible as in unimodal tasks. Among the limited selection of multimodal metrics, Modality Independent Neighborhood Descriptor (MIND) [8] has demonstrated its promising performance for describing the invariant structural representations across various modalities. Specifically, MIND is defined as follows:
| (4) |
where M represents MIND features, I is an image, x is a location in the image, r is a distance vector, V (I, x) is an estimate of the local variance, and Dp(I, x, x + r) denotes the L2 distance between two image patches p respectively.
Given the warped CT image (CT ◦ ϕig), fixed MR image and their corresponding gradient maps gCT and gMR, we minimize the difference of their MIND features:
| (5) |
where N denotes the number of voxels in input images, R is a non-local region around voxel x.
3. EXPERIMENTS
3.1. Implementation Details
3.1.1. Datasets and Evaluation Metrics
Under the IRB approved study, we obtained two intra-subject CT-MR datasets containing paired CT and MR images.
1). Pig Ex-vivo Kidney Dataset.
This dataset contains 18 pairs of ex-vivo kidney CT-MR scans with segmentation labels for evaluation. We carried out the standard preprocessing steps, including resampling, spatial normalization and cropping for each scan. The images were processed to 144 × 80 × 256 subvolumes with 1mm isotropic voxels and divided into two groups for training (13 cases) and testing (5 cases).
2). Abdomen Dataset.
This intra-patient CT-MR dataset containing 50 pairs was collected from a local hospital and annotated with the segmentation of liver, kidney and spleen. Similarly, the 3D images were preprocessed by the standard steps and cropped into 144×144×128 subvolumes with the same resolution (1mm3). The dataset was divided into two groups for training (40 cases) and testing (10 cases).
3.1.2. Training Strategies
The proposed framework was implemented on Keras with the Tensorflow backend and trained on an NVIDIA Titan X (Pascal) GPU. For both datasets, we adopted Adam as optimizer with a learning rate of 1e-5, and set the batch size to 1. α, β and γ were set to 0.4, 0.8, 0.3 for pig kidney dataset, and 0.5, 1, 0.5 for abdomen dataset.
3.2. Experimental Results
3.2.1. Quantitative Comparison
Within our framework, two fusion mechanisms (denoted as Ours(average) and Ours(gated)) were adopted for comparison. In addition, we compared our approaches with the top-performing conventional method SyN [15] and two VoxelMorph-based [4] mono-branch unsupervised networks (denoted as VM and VM(concat)) with MIND-based similarity metric. Specifically, VM concatenates moving image and fixed image as a 2-channel input, while VM(concat) additionally concatenates their gradient maps as a 4-channel input.
We quantitatively evaluated our method over two criteria: Dice score of the organ segmentation masks (DSC) and the average surface distance (ASD). The results are presented in Table 1. For the pig ex-vivo kidney dataset, the intensity distributions of the ex-vivo organ scan are simpler. It can be seen that the two variants of our methods both slightly outperform other baselines in terms of DSC and the performances of the two fusion mechanisms are close. Turning to the more complex abdomen dataset, our approaches both significantly achieve higher DSC and lower ASD than the conventional and other learning-based approaches. Not surprisingly, VM(concat) shows slight improvement compared to VM, which demonstrates that the gradient-space information can provide useful information for image alignment. With the adaptative dual-branch learning fashion, image registration can further benefit from the gradient information. In particular, the gated fusion mechanism obtains more improvements in most organ alignments than the average fusion. Furthermore, the conventional approach SyN costs more than 5 minutes to estimate a transformation, while all the learning-based methods only cost less than a second with a GPU.
Table 1.
Comparison of DSC and ASD on different methods. The best and the second results are shown in bold and underline respectively.
| Dataset | Pig | Abdomen | |||||
|---|---|---|---|---|---|---|---|
| Method | Dice(%) | Dice(%) | ASD(mm) | ||||
| Kidney | Liver | Spleen | Kidney | Liver | Spleen | Kidney | |
| Moving | 84.29 | 77.18 | 78.24 | 80.14 | 4.95 | 1.97 | 2.01 |
| SyN(MI) | 85.17 | 79.18 | 80.21 | 82.91 | 4.81 | 1.54 | 1.92 |
| VM | 91.24 | 84.17 | 82.76 | 83.51 | 3.92 | 1.47 | 1.72 |
| VM(concat) | 91.51 | 85.72 | 84.23 | 83.96 | 3.03 | 1.27 | 1.44 |
| Ours(average) | 92.09 | 86.25 | 85.03 | 84.39 | 2.94 | 1.22 | 1.37 |
| Ours(gated) | 92.13 | 87.66 | 86.83 | 85.07 | 2.73 | 1.31 | 1.25 |
3.2.2. Qualitative Comparison
Fig.3 visualizes a registration example of the abdomen dataset within our framework using the gated fusion module. In abdominal scans, livers with tumors usually have large local deformation due to progressed disease, patient motion and insufflation during surgery, while the deformations of surrounding organs are less obvious.
Fig. 3.

Example registration results within our framework.
Given the original to-be-registered images and their corresponding gradient maps, the dual-branch networks can not only effectively register the CT image to the MR image, but also align the gradient intensity maps in the gradient map registration branch. The zoom-in images show that the liver and its gradient outlines are more accurately aligned.
Following the example in Fig.3, we visually compared our proposed method with other aforementioned baselines as shown in Fig.4. Consistent with the quantitative results, our methods with the two fusion mechanisms both achieve more accurate organ boundary alignments, especially in the abdominal case. The visual comparison further proves that our proposed approach can indeed benefit from the gradient-space alignment, and the auxiliary gradient branch helps the network be sensitive to the hard-to-aligned local regions by considering geometric organ structures.
Fig. 4.

Qualitative comparison with other methods. The segmentation of the organs are depicted as outlines (in pig kidney dataset) and masks (in abdomen dataset).
4. CONCLUSION
In this paper, we propose a novel unsupervised multimodal image registration method with auxiliary gradient guidance to further improve the performance of organ boundary alignment. Distinct from the typical mono-branch unsupervised image registration network, our approach leverages not only the original to-be-registered images but also their corresponding gradient intensity maps in a dual-branch adaptative registration fashion. Quantitative and qualitative experimental results on two clinically acquired CT-MR datasets demonstrate the effectiveness of our proposed approach.
6. ACKNOWLEDGEMENTS
This project was supported by the National Institutes of Health (Grant No. R01EB025964, R01DK119269, and P41EB015898), the National Key R&D Program of China (No. 2020AAA0108303), NSFC 41876098 and the Overseas Cooperation Research Fund of Tsinghua Shenzhen International Graduate School (Grant No. HW2018008).
Footnotes
5. COMPLIANCE WITH ETHICAL STANDARDS
All procedures performed in studies involving human participants were in accordance with the ethical standards of the institutional and/or national research committee and with the 1964 Helsinki Declaration and its later amendments or comparable ethical standards.
7. REFERENCES
- [1].Lv Jun, Yang Ming, Zhang Jue, and Wang Xiaoying, “Respiratory motion correction for free-breathing 3d abdominal mri using cnn-based image registration: a feasibility study,” The British journal of radiology, vol. 91, no. 20170788, 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [2].Hu Yipeng, Modat Marc, Gibson Eli, Li Wenqi, Ghavami Nooshin, Bonmati Ester, Wang Guotai, Bandula Steven, Moore Caroline M, Emberton Mark, et al. , “Weakly-supervised convolutional neural networks for multimodal image registration,” Medical image analysis, vol. 49, pp. 1–13, 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [3].Hu Xiaojun, Kang Miao, Huang Weilin, Scott Matthew R, Wiest Roland, and Reyes Mauricio, “Dual-stream pyramid registration network,” in International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, 2019, pp. 382–390. [Google Scholar]
- [4].Balakrishnan Guha, Zhao Amy, Sabuncu Mert R, Guttag John, and Dalca Adrian V, “An unsupervised learning model for deformable medical image registration,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 9252–9260. [Google Scholar]
- [5].Xu Zhe, Luo Jie, Yan Jiangpeng, Li Xiu, and Jayender Jagadeesan, “F3rnet: Full-resolution residual registration network for multimodal image registration,” arXiv preprint arXiv:2009.07151, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [6].Zhao Shengyu, Lau Tingfung, Luo Ji, Eric I, Chang Chao, and Xu Yan, “Unsupervised 3d end-to-end medical image registration with volume tweening network,” IEEE journal of biomedical and health informatics, 2019. [DOI] [PubMed] [Google Scholar]
- [7].Wells William M III, Viola Paul, Atsumi Hideki, Nakajima Shin, and Kikinis Ron, “Multi-modal volume registration by maximization of mutual information,” Medical image analysis, vol. 1, no. 1, pp. 35–51, 1996. [DOI] [PubMed] [Google Scholar]
- [8].Heinrich Mattias P, Jenkinson Mark, Bhushan Manav, Matin Tahreema, Gleeson Fergus V, Brady Michael, and Schnabel Julia A, “Mind: Modality independent neighbourhood descriptor for multi-modal deformable registration,” Medical image analysis, vol. 16, no. 7, pp. 1423–1435, 2012. [DOI] [PubMed] [Google Scholar]
- [9].Goodfellow Ian, Pouget-Abadie Jean, Mirza Mehdi, Xu Bing, David Warde-Farley Sherjil Ozair, Courville Aaron, and Bengio Yoshua, “Generative adversarial nets,” in Advances in neural information processing systems, 2014, pp. 2672–2680. [Google Scholar]
- [10].Fan Jingfan, Cao Xiaohuan, Wang Qian, Yap Pew-Thian, and Shen Dinggang, “Adversarial learning for mono-or multi-modal registration,” Medical image analysis, vol. 58, pp. 101545, 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [11].Qin Chen, Shi Bibo, Liao Rui, Mansi Tommaso, Rueckert Daniel, and Kamen Ali, “Unsupervised deformable registration for multi-modal images via disentangled representations,” in International Conference on Information Processing in Medical Imaging. Springer, 2019, pp. 249–261. [Google Scholar]
- [12].Xu Zhe, Luo Jie, Yan Jiangpeng, Pulya Ritvik, Li Xiu, Wells William III, and Jagadeesan Jayender, “Adversarial uni-and multi-modal stream networks for multi-modal image registration,” in International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, 2020, pp. 222–232. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [13].Jaderberg Max, Simonyan Karen, Zisserman Andrew, et al. , “Spatial transformer networks,” in Advances in neural information processing systems, 2015, pp. 2017–2025. [Google Scholar]
- [14].Wei Dongming, Ahmad Sahar, Huo Jiayu, Peng Wen, Ge Yunhao, Xue Zhong, Yap Pew-Thian, Li Wentao, Shen Dinggang, and Wang Qian, “Synthesis and inpainting-based mr-ct registration for image-guided thermal ablation of liver tumors,” in International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, 2019, pp. 512–520. [Google Scholar]
- [15].Avants Brian B., Epstein Charles L., Grossman Murray, and Gee James C., “Symmetric diffeomorphic image registration with cross-correlation: Evaluating automated labeling of elderly and neurodegenerative brain,” Medical image analysis, vol. 12 1, pp. 26–41, 2008. [DOI] [PMC free article] [PubMed] [Google Scholar]
