Skip to main content
PLOS ONE logoLink to PLOS ONE
. 2022 Nov 28;17(11):e0276196. doi: 10.1371/journal.pone.0276196

Is image-to-image translation the panacea for multimodal image registration? A comparative study

Jiahao Lu 1,2, Johan Öfverstedt 1, Joakim Lindblad 1,*, Nataša Sladoje 1
Editor: Dzung Pham3
PMCID: PMC9704666  PMID: 36441754

Abstract

Despite current advancement in the field of biomedical image processing, propelled by the deep learning revolution, multimodal image registration, due to its several challenges, is still often performed manually by specialists. The recent success of image-to-image (I2I) translation in computer vision applications and its growing use in biomedical areas provide a tempting possibility of transforming the multimodal registration problem into a, potentially easier, monomodal one. We conduct an empirical study of the applicability of modern I2I translation methods for the task of rigid registration of multimodal biomedical and medical 2D and 3D images. We compare the performance of four Generative Adversarial Network (GAN)-based I2I translation methods and one contrastive representation learning method, subsequently combined with two representative monomodal registration methods, to judge the effectiveness of modality translation for multimodal image registration. We evaluate these method combinations on four publicly available multimodal (2D and 3D) datasets and compare with the performance of registration achieved by several well-known approaches acting directly on multimodal image data. Our results suggest that, although I2I translation may be helpful when the modalities to register are clearly correlated, registration of modalities which express distinctly different properties of the sample are not well handled by the I2I translation approach. The evaluated representation learning method, which aims to find abstract image-like representations of the information shared between the modalities, manages better, and so does the Mutual Information maximisation approach, acting directly on the original multimodal images. We share our complete experimental setup as open-source (https://github.com/MIDA-group/MultiRegEval), including method implementations, evaluation code, and all datasets, for further reproducing and benchmarking.

1 Introduction

We are witnessing a growing popularity of approaches which combine information from different imaging modalities [1] to maximise extracted information about an object of interest, particularly within life sciences. Different imaging techniques offer complementary information about structure, function, dynamics, and molecular composition of a sample. To efficiently utilise and fuse such heterogeneous information, acquired images have to be spatially aligned, a process known as image registration.

Numerous methods have been proposed for automatic monomodal (intramodal) registration [24], typically based on comparison of intensity patterns in images (intensity-based approaches), or on finding correspondence between characteristic details in the images (feature-based methods). However, multimodal (intermodal) image registration is much more challenging. In cases when image appearances differ significantly between modalities, e.g., due to different underlying physical principles of imaging, assessing image similarity and finding correspondences across modalities are, in general, very difficult tasks.

Multimodal registration is often required and performed in medical and satellite imagery, where valuable information is in many situations acquired using several different sensors [5]. Due to the inherent discrepancy in intensity values in images acquired by different modalities, it is difficult to define generally applicable methods. Existing approaches are, therefore, often specific to a particular application (e.g., relying on anatomical properties of specific organs), and typically restricted to only a few combinations of modalities. Applications in biomedical image analysis bring in further challenges: (i) there exists a much greater variety of modalities (e.g., tens of different types of microscopies), (ii) specimens exhibit much wider variety, (iii) imaged objects (e.g., tissue) often lack distinctive structures (as opposed to, e.g., organs) which could be used for establishing correspondences, (iv) acquired data often consist of very large images (easily reaching TB-size), necessitating fast methods with low memory requirements and also limiting the feasibility of high-quality manual annotation. Direct application of conventional methods is most often not sufficient, even in relatively simple scenarios. As a consequence, biomedical image registration is nowadays still often performed manually by specialists [6].

Deep learning-based methods have, since their renaissance, revolutionised the field of computer vision. Significant results with this type of methods have also been reached in the field of medical image registration, as summarised in recent surveys [69]. The main considered approaches include learning of similarity measures, direct prediction of the transformation, and image-to-image (I2I) translation approaches. Methods based on direct prediction of the transformation have so far mostly provided an improvement of the run-time, while exhibiting registration performance that is (at best) on par with traditional methods [10, 11]. A related, but different recent method for multimodal rigid registration [12] uses a large neural network as a feature extractor, training a network to recover a transformation from a single image. This method, however, fails in case a canonical space does not exist, which holds for a majority of the imaging scenarios considered in this study. Generative Adversarial Network (GAN)-based methods have shown impressive performance in I2I translation tasks, primarily on natural images, [1317]. They are offering the possibility of converting multimodal registration into a, presumably less challenging, monomodal problem which can be addressed by a wide range of monomodal registration methods; this idea is illustrated in Fig 1. This concept has been shown promising in many multimodal registration tasks, even before the GAN-based methods became a standard tool [1820]. GAN-based approaches are receiving increasing attention in medical image registration [7, 8, 21], and the progression to bioimage analysis seems to be a natural step. A recent survey [22] overviews several usage scenarios of GANs in the field of biomedical image analysis, primarily targeting data generation aiming to overcome challenges of annotation in digital pathology. However, quantitative results related to image registration, and particularly, in multimodal settings on multiple datasets, are still lacking. The potential of GANs in these scenarios is still to be explored, as pointed out in a recent overview of the application of deep learning in bioimage analysis [23].

Fig 1. Method overview.

Fig 1

Direct multimodal registration of distinctly different modalities is a challenging task (middle path). In this study, we evaluate if, instead, performing I2I translation of one modality to another may lead to a simpler monomodal registration problem (either of the two peripheral paths).

To fill this gap, and to facilitate knowledge transfer from the domain of multimodal natural image registration, we present an empirical study of the applicability of modern I2I translation methods for the task of multimodal medical and biomedical image registration. We compare the performance of four GAN-based methods, combined subsequently with two representative monomodal registration methods, to judge the effectiveness of modality translation on rigid registration of medical and biomedical images. We evaluate these method combinations on four publicly available multimodal datasets. To explore the effects of different types of data and their complexity, w.r.t. the task of registration, we include one relatively easy non-biomedical dataset, one Cytological dataset of medium difficulty, and one more challenging Histological dataset, all in 2D. Furthermore, we observe a 3D medical Radiological dataset containing T1- and T2-weighted MR images of brains. To establish reference performance, we include one classic intensity-based multimodal registration method, Mutual Information (MI) maximisation [24, 25], as well as two popular more recent multimodal approaches, Modality Independent Neighbourhood Descriptor (MIND) [26] and maximisation of similarity of Normalised Gradient Fields (NGF) [27]. On the Histological dataset, we also compare with a recent task-specific method [28] developed particularly for that type of data. In addition, we evaluate one recently proposed deep learning-based method for multimodal registration which uses a different approach to transform multimodal registration into a monomodal case [29]. We apply and evaluate this method on 3D (medical) data, which has not been done previously. We share our complete experimental setup as open-source, including method implementations, evaluation code, and all datasets to facilitate further reproducing and benchmarking. The framework can easily be extended by including additional methods, and can support further development and evaluation of approaches for registration of 2D and 3D medical and biomedical images.

2 Background and related work

Image registration methodologies can roughly be classified into intensity-based and feature-based methods, however with no strict border between the two; several modern approaches estimate and align more or less dense feature fields, aiming to capture good properties of both intensity- and feature-based methods. Learning-based approaches may, to different extents, be utilised for directly solving the overall registration task (end-to-end), or for addressing different sub-tasks.

Intensity-based approaches typically formulate image registration as an optimisation problem, for which an image similarity measure (or inversely stated, a distance measure) is used to assess the quality of the alignment. Common similarity measures include negative Sum of Squared Differences (SSD) and Cross-Correlation (CC) for monomodal registration; Mutual Information (MI) [24] and Normalised Mutual Information [30] for both monomodal and multimodal registration; as well as measures which incorporate relevant additional information (e.g., gradient information or spatial displacement) extracted from the images [31, 32].

Several approaches for learning a suitable similarity measure from data have been proposed. A method to implicitly learn a multimodal similarity measure for the relatively easy multimodal scenario of registering T1 and T2-weighted Magnetic Resonance (MR) images is presented in [33], whereas [34] suggests training a Deep Convolutional Neural Network (DCNN) to predict correspondence of Computed Tomography and MR image patches and then use the classification score as a similarity measure.

Intensity-based approaches, which rely on iterative local optimisation of an image similarity measure, are often relatively slow [8], and the existence of many local optima in the search space reduces applicability when the displacement between images to be registered is large [32]. Several learning-based approaches have been proposed to speed up the optimisation task; representative examples include VoxelMorph [35], where an encoder-decoder network is used to directly predict a deformation field which maximises an image similarity measure, and Quicksilver [33], where a deep encoder-decoder network is used for patch-wise prediction of the momentum-parameterisation of a Large Deformation Diffeomorphic Metric Mapping model.

Feature-based methods detect salient structures in the images and find correspondence between them, to guide the registration. Popular methods for feature point detection and description include SIFT [36] and ORB [37]. Learning may be utilised to suitably adapt feature-based registration to data. Pioneering unsupervised approaches for learning features for image registration are presented in [38] (for MR data) and [39] (for natural scenes).

Compared to the iterative optimisation process of intensity-based methods, the feature-based approaches can be much faster and also more robust to variations in brightness and contrast, making this type of methods popular for computer vision and video tracking tasks. However, in multimodal setting, images typically convey complementary information; features that are present in one modality may be absent in another. Additionally, feature-based methods struggle with the general lack of salient structures in biomedical datasets. These shortcomings reduce the applicability of feature-based registration methods in multimodal biomedical scenarios.

Modality Independent Neighbourhood Descriptor (MIND) [26] combines feature- and intensity-based notions and extracts a dense set of multi-dimensional descriptors of each of the images of different modalities, utilising the concept of self-similarity. A standard similarity-based (monomodal) registration framework is then used with a standard monomodal similarity measure to establish correspondence between the generated representations.

Common representation of the images acquired by different modalities can also be learned; successful approaches include contrastive representation learning [29] and representation disentanglement [21]. Monomodal registration methods can then be applied to the resulting representations, as a suitable “common ground” between the modalities.

I2I translation (also known as image style transfer) refers to mapping images from a source domain to a target domain, while preserving important content properties of the source image. I2I translation can be used to translate images from one modality into images appearing as if acquired in another modality. This enables to transform the multimodal registration problem into a, possibly easier, monomodal one, as illustrated in Fig 1. Prior to the tide of deep learning, I2I, also referred to as image synthesis, has been applied tentatively to multimodal registration problems [18, 40]. Despite the potential demonstrated in their promising performance, the full capability of this approach was still limited by the synthesising algorithms [19, 20]. I2I translation methods based on GANs [41] have, in recent years, shown impressive results [1317].

I2I translation has found broad usage in biomedical areas [22], including virtual staining [42], stain style transfer [43], immunofluorescence marker inference [44], medical diagnosis [45, 46], and surgical training [47]. Usage for 3D medical images is reported for 3D brain tumour segmentation [48], and MRI modality generation [49] (which is conducted on the 2D slices of 3D volumes). In the recent CrossMoDA 2021 challenge, which involved domain adaptation of different MR modalities for tumour segmentation, methods based on I2I translation were the top performers [50]. A recent review [51], however, concludes that there is still a lack of 3D I2I translation methods and datasets suitable for their training. Even though studies utilising I2I translation in multimodal registration exist, proposing to either directly align the translated images [5254] or indirectly incorporate the dissimilarity of modalities as a loss term in GANs [55, 56], these works are mostly application dependent due to the high variability of GANs’ output. The full potential of I2I translation methods, and in particular their application in biomedical multimodal image registration, is yet to be explored, as also pointed out by recent overviews [57, 58].

Unlike registration of medical images, e.g., brain [21, 59], cardiac [60], or whole-body [61], where methods can rely on the outline of shapes (organs), biomedical images rarely convey shape outlines, making automatic registration much more difficult. One recent example where the biological structural content has been used to align microscopy images is presented in [28]; this highly specialised method for registration of Second Harmonic Generation (SHG) and Brightfield (BF) images relies on a priori knowledge of the sample properties and the segmented image content.

3 Considered methods

We select four state-of-the-art GAN-based I2I translation methods and one representation learning method designed specifically for the task of multimodal image registration. We subsequently combine these five methods with two monomodal registration approaches (one feature-based and one intensity-based). We compare the performance of these combinations with four selected reference registration methods, applicable directly to multimodal data.

3.1 Reference methods

To provide reference performance, we include three general multimodal image registration methods, as well as one special-purpose approach in our evaluation.

3.1.1 Maximisation of Mutual Information

The Mutual Information between images A and B is defined as

I(A,B)=aAbBpAB(a,b)log(pAB(a,b)pA(a)pB(b)) (1)

where A and B denote the range of image A and B respectively, pA(a) denotes the marginal probability of value a occurring in image A, and pAB(a, b) denotes the joint probability of values a and b co-occurring in overlapping points of A and B. The probability mass functions pA, pB, and pAB can be estimated empirically as the normalised frequencies of image intensities and pairwise image intensities.

Intuitively, MI measures the information that is (pixelwise) shared between the images, and is high when the images A and B are well aligned. Maximisation of MI is widely used and has shown good performance in a number of multimodal registration tasks.

3.1.2 Maximisation of Similarity of Normalised Gradient Fields [27]

Registration based on Normalised Gradient Fields (NGF) relies on the assumption that parts of images are in correspondence (i.e., similar) if the orientation of changes (gradients) of the intensities at the corresponding locations in the two images is the same. Furthermore, in multimodal scenarios, the sign of the orientation is rarely of interest, since a single part of a specimen may have e.g. inverted intensities in two modalities. NGF-based registration is performed through maximisation of the squared dot-product of the normalised gradients of overlapping pixels. This optimisation problem can be solved by gradient-based optimisation, can be computed very efficiently, and has shown to perform well on medical images.

3.1.3 MIND [26]

Modality Independent Neighbourhood Descriptors (MIND) are multi-spectral structural similarity representations based on self-similarity of small patches in a single image, computed in local neighbourhoods around the centre of each patch. The main assumption behind MIND is that the degree of self-similarity is similar across modalities, implying that the descriptors are similar as well, enabling the usage of optimisation methods suitable mainly for monomodal scenarios. The number of dimensions of the descriptors is equal to the number of displacements considered for each local patch; this number is selected by the user. The sizes of the local patches are determined by the standard deviation of the Gaussian kernel, σ, used to assign weights to the voxels of each patch.

3.1.4 CurveAlign

CurveAlign is the first automatic registration method designed specifically to register BF and SHG images of whole tissue micro-array (TMA) cores [28]. The registration is based on the segmentation of the collagen structures in BF images, which are then utilised to guide the alignment with the corresponding structures in the SHG images, by maximising MI between the modalities. We include this method in the evaluation performed on the considered Histological dataset (Sec. 4.3).

3.1.5 VoxelMorph

VoxelMorph is a recently proposed learning-based framework for deformable registration [35], which provides fast end-to-end transformation prediction. In our considered rigid multimodal registration scenario, its output deformation field is replaced by a transformation matrix and its smoothing term used to penalise local spatial variations is set to 0. MI is used as the loss function. However, VoxelMorph consistently under-performed in our rigid registration task (similar is also observed by [62]) and we decided to not include the related results, for clarity.

3.2 Modality translation methods

We introduce four I2I-translation methods and one representation learning approach, which are in focus of this study. The methods are selected considering their popularity in the community, generalisability to different applications, performance in general, diversity in terms of supervision, scalability to more modalities, and ease of usage.

3.2.1 pix2pix [13]

pix2pix provides a general-purpose solution for I2I translation utilising conditional GANs (cGANs) [63], which use aligned image pairs during training. pix2pix consists of a generator G and a discriminator D which are trained in an adversarial manner: the generator G is trained to generate fake images that cannot be distinguished from real ones by D, and D is trained to best separate the fake and the real images. The objective function is:

G*=argminGmaxDLcGAN(G,D)+λLL1(G), (2)

where the cGAN loss LcGAN is to be minimised by G and maximised by D during training, and the pixel-wise regression loss LL1 aims to make the output images closer to the ground truth. Based on the assumption that the structures in one image should be roughly aligned between the input and output, the architecture of the generator G is designed in a “U-Net” style [64].

pix2pix is included in our evaluation not only because of its pioneering role in the field, but also because it is regarded as a strong baseline framework [65, 66].

3.2.2 CycleGAN [14]

Using cycle-consistency constraint has been shown to be an effective approach to overcome the need for aligned image pairs during training [67]. Cycle-consistent adversarial network (CycleGAN) is a well-known representative of this approach [68, 69]. Its modified version has demonstrated satisfying performance on the biomedical task of robust tissue segmentation [70] through stain transformation. Further modifications have also shown successes in other biomedical applications [46, 47]. The option of unsupervised training of CycleGANs makes them very attractive in biomedical applications.

The main idea behind CycleGANs is to enforce the image translation to be “cycle-consistent”, i.e., if an image is translated from domain X to Y and then inversely translated from Y to X, the output should be the same as the original image. To achieve this property, another generator F:YX is introduced to couple with G such that G and F should be inverse mappings of each other. The used loss function combines two kinds of loss terms: the adversarial loss LGAN to encourage the mapped domain to be close enough to the target [41], and the cycle-consistency loss Lcyc to enforce the invertibility of the mapping.

3.2.3 DRIT++ [17]

Aiming for more diverse output from a single input, DRIT++ is a recently proposed GAN of higher complexity, that performs unsupervised I2I translation via disentangled representations. More specifically, DRIT++ learns to encode the input image into a domain-invariant content representation and a domain-specific attribute representation in latent spaces. The domain-invariant content representation captures the information that the two domains have in common, while the domain-specific attribute representation captures the typical features in each domain. The output images are then synthesised by combining the disentangled representations, where feature-wise transformation provides an option to enable larger shape variation.

For an I2I translation task across domains X and Y, the framework of DRIT++ is comprised of a content encoder Ec, an attribute encoder Ea, a generator G and a domain discriminator D for each domain respectively, and a content discriminator Dc. To achieve the intended disentanglement of information, the weights between the last layer of EXc and EYc and the first layer of GX and GY are shared, following the assumption that the two images share some information in a common latent space [71]. The content encoders {EXc,EYc} are trained to make the encoded contents of the two domains indistinguishable from each other by the adversarial content discriminator Dc. DRIT++ also uses cross-cycle consistency loss, domain adversarial loss, self-reconstruction loss, and latent regression loss to encourage better quality of the generated images.

DRIT++ is included in this study because of its explicit extraction of shared information from the domains into a common latent space; notwithstanding its different purpose for output diversity, it is highly in accord with what is required for image registration and may therefore conceptually be likely to lead to good performance.

3.2.4 StarGANv2 [15]

StarGANv2 is recently proposed to not only generate diverse output images, but also address the scalability to multiple domains. StarGANv2 comprises four modules: a style encoder Ey to extract the style code s = Ey(xref) from a reference image xref, or alternatively, a mapping network Fy to generate a style code s = Fy(z) from an arbitrarily sampled vector z in the latent space; a generator G to translate an input image x and a domain-specific style s into an output image G(x, s); and a multi-task discriminator D with multiple output branches Dy to classify whether an output image is real or fake for each domain y. In addition to the diversity loss Lds that enforces G to discover meaningful style features from the image space, it also uses cycle-consistency loss Lcyc to preserve the content of the input image, style reconstruction loss Lsty and adversarial loss Ladv to ensure the output quality.

StarGANv2 is of specific interest due to the reported remarkably good visual quality of the generated images compared to the baseline models [17, 72], as well as the framework’s increasing popularity in the community. Its ability to inject domain-specific style into a given input image could potentially simplify the registration problem when more than two modalities are considered.

3.2.5 CoMIR [29]

CoMIR is a recently proposed representation learning method that produces (approximately) rotation equivariant image representations (typically of the same size as the input images), using an objective function based on noise-contrastive estimation (InfoNCE) [73].

For an arbitrary pair of images (x1, x2) in the dataset D, the InfoNCE loss is given by

Lopt(D)=-E(x1,x2)D[logp(x1,x2)p(x1)p(x2)p(x1,x2)p(x1)p(x2)+xiD\{x}p(xi1,xi2)p(xi1)p(xi2)]. (3)

The unknown ratio p(x1,x2)p(x1)p(x2) is approximated with the exponential eh(yi1,yi2)/τ, where y1=fθ1(x1) and y2=fθ2(x2) are the resp. modality translated images, and where h(y1, y2) is a positive, symmetric ‘critic’ with a global maximum for y1 = y2 (e.g., negative mean squared difference (MSD), or cosine similarity) and τ is a scaling parameter. To equip the model with rotational equivariance, a loss term that implicitly enforces equivariance on C4 (the finite, cyclic, symmetry group of multiples of 90° rotations) is used:

h(T1-1(fθ1(T1(xi1))),T2-1(fθ2(T2(xi2)))), (4)

where T1,T2C4 are randomly sampled during training.

Given a batch of image pairs, the images of each modality are transformed using DCNNs (fθ1 and fθ2, one model per modality without weight-sharing) to a learned embedding space. The objective function serves to train the models to generate similar mappings for the paired images, while generating dissimilar mappings compared to all the other images in the batch. CoMIR requires pairs of aligned images for its training procedure, similarly to pix2pix. One advantage of the approach taken by CoMIR is that the requirement is not to learn to reproduce the appearance of either modality, but rather to learn a representation of the content (signal) present in both modalities, mapping corresponding image pairs into images with similar structure and appearance.

Combined with a monomodal registration framework which is applied to the learned representations, CoMIR has shown to exhibit good performance in the registration of (2D) multimodal images [29]. It is included in this study as a reference method based on a modality translation approach alternative to I2I, and its application is extended to 3D (medical) data.

3.3 Monomodal registration

We select two monomodal registration methods which have exhibited excellent performance in a range of applications: (i) SIFT—a feature-based approach which has for a long time dominated the field, demonstrating versatility and robustness in a large number of studies, and (ii) a recent method based on minimisation of a distance between images (α-AMD), which has shown to constitute a top choice for medical and biomedical data. The selection of these generally applicable and well performing monomodal registration methods, to be combined with the modality translation approaches, enables to put focus particularly on the performance of the I2I translation methods in registration.

3.3.1 SIFT-based registration

Scale-invariant feature transform (SIFT) [36], is a 2D feature detector and descriptor which is invariant to image scaling, translation, rotation, and partially invariant to illumination changes and affine or 3D projection. It has found wide usage in a range of image matching and object recognition applications. Due to its popularity and generally good performance, SIFT is included in this study and evaluated on the 2D registration task (in combination with a suitable feature point matching and transformation estimation) on the original and generated images and their representations. The primary interest is in its performance in the (generated) monomodal scenarios.

3.3.2 Registration based on α-AMD

A monomodal image registration method based on iterative gradient descent optimisation of α-AMD has shown very good and robust performance [32, 74]. α-AMD is a symmetric distance measure between images, which combines intensity and spatial information. Its properties enable to achieve a larger convergence region around the global optimum, compared to commonly used similarity measures that are based on statistics or intensity differences of overlapping points. Registration based on minimisation of α-AMD has been successfully applied in a variety of monomodal scenarios [29, 32, 75], which motivates our decision to include it in this study for the registration of the generated monomodal image pairs.

4 Data

We use three publicly available 2D datasets and one 3D dataset in this study, to illustrate different applications (remote sensing, cytological, histological, and radiological imaging), combining different imaging modalities (Near-Infrared and RGB, Quantitative Phase Imaging and Fluorescence Microscopy, Second Harmonic Generation Microscopy and Bright Field Microscopy, and finally T1- and T2-weighted Magnetic Resonance Imaging) and introducing different levels of difficulty of image registration.

4.1 Zurich data

The Zurich dataset comprises 20 QuickBird-acquired images (with side lengths ranging from 622 to 1830 px) of the city of Zurich [76, 77]. Each image is composed of 4 channels, Near-Infrared (NIR), and three colour channels (R,G,B). Each channel is globally re-scaled to the range [0, 255]. The NIR channel is extracted as Modality A (Fig 2A), and the R-G-B channels are extracted jointly as Modality B (Fig 2B).

Fig 2. Examples of pairs of original (2D) images acquired by different modalities considered in this study (contrast enhanced for visualisation).

Fig 2

Zurich (remote sensing) dataset: (A) modality A: NIR, (B) modality B: RGB. Cytological data: (C) modality A: Fluorescence microscopy, (D) modality B: QPI. Histological data: (E) modality A: SHG, (F) modality B: BF.

4.2 Cytological data

The Cytological dataset is composed of correlative time-lapse Quantitative Phase Images (QPI) and Fluorescence Images of prostatic cell lines acquired by the multimodal holographic microscope Q-PHASE (TESCAN, Brno, Czech Republic) [78, 79]. Three cell lines (DU-145, PNT1A, LNCaP) are exposed to cell death-inducing compounds (staurosporine, doxorubicin, and black phosphorus) and captured at 7 different fields of view. The data consist of time-lapse stacks of several hundred 600 × 600 px frames. Pixel values represent cell dry mass density in pg/μ2 in QPI images, and the amount of caspase-3/7 product accumulation in the fluorescence images (visualised using FITC 488 nm filter) [80]. For each time-lapse stack, the intensity values are globally re-scaled to the range [0, 255]. We visually inspected the image frames and confirmed that the signals of most cells are visible in the last 60 frames in each stack. We sparsely and evenly sample these frames such that the selected images are different enough from each other. Frames with indices 0, 15, 30, 45, 60 backwards from the last are extracted. We use the fluorescence images indicating the caspase-3/7 level as Modality A (Fig 2C) and corresponding QPI images as Modality B (Fig 2D).

4.3 Histological data

The Histological dataset comprises 206 aligned Second Harmonic Generation (SHG) and Bright-Field (BF) tissue micro-array (TMA) image pairs of size 2048 × 2048 px [81]. The tissue sections are from patients with breast cancer, pancreatic cancer, and kidney cancer. BF images are acquired using an Aperio CS2 Digital Pathology Scanner (Leica Biosystems, Wetzlar, Germany) at 40× magnification. SHG images are acquired using a custom-built integrated SHG/BF imaging system described in [28]. Each SHG-BF image pair is registered by aligning manually marked landmark pairs. SHG and BF images are referred to as Modality A (Fig 2E) and B (Fig 2F) of the Histological dataset, respectively.

4.4 Radiological data

The Radiological dataset contains 3D magnetic resonance (MR) images of healthy brains, and is derived from the publicly available Retrospective Image Registration Evaluation (RIRE)-dataset [82]. The RIRE dataset includes CT, MR, and PET images of 18 patients in total [83]. We have focused on the provided pairs of T1- and T2-weighted MR images, due to non-available ground truth alignments of other combinations of modalities. After inspecting the alignments, we discarded 6 poorly aligned pairs. Therefore, here considered Radiological dataset includes 12 T1- and T2-weighted volume pairs, acquired using a Siemens SP 1.5 Tesla scanner. The volumes consist of 20–52 slices of 256 × 256 px in x- and y-directions. Voxel sizes are 0.78–1.25 mm in x and y, and 3–4 mm in z-directions. Further details about image acquisition and subsequent processing, performed to prepare the data for evaluation of registration, can be found in [82]. In our study, the intensity values are re-scaled within each volume to the range [0, 255]. The T1 modality is referred to as Modality A (Fig 3A), and the T2 modality is referred to as Modality B (Fig 3B).

Fig 3. Example of pair of (slices of) original 3D radiological images from the RIRE dataset acquired by different modalities considered in this study.

Fig 3

(A) Modality A: T1 MR image, (B) modality B: T2 MR image.

4.5 Ground truth alignment of the image pairs

Alignment of multimodal image pairs is, in general, a very difficult task, for a human, as well as for an automated system, which makes it challenging to create ground truth data required for benchmarking, as well as for training the registration algorithms. Hybrid imaging—simultaneous acquisition of images of a sample (or a scene) by more than one modality—may provide such aligned pairs directly. Hybrid imaging is, however, available only to a limited extent. The Zurich, Cytological, and Radiological datasets are acquired by hybrid imaging, while aligned pairs of the Histological images are, as mentioned, generated semi-automatically.

5 Experiments

5.1 Implementation of learning-based modality transfer methods

Aiming for a fair comparison, we use the same data augmentation for all the learning-based methods. For each 2D dataset, 5120 square non-rotated patches of size 362 × 362 px are randomly cropped from the training set. All models are trained with on-the-fly augmentation on top of the sampled patches. The augmentation includes, in a sequence: random horizontal flip (p = 0.5); random rotation (p = 1, θ ∈ [−180, 180°)) using either (p = 0.33) nearest neighbour, linear, or cubic interpolation; random Gaussian blur (p = 0.5) with a standard deviation σ ∈ [0, 2.0]; and centre-crop (p = 1) to size 256 × 256 px (thus, no padding required).

Apart from the shared augmentation, we make as few changes as possible compared to the default settings of the respective official released codes. After limited exploration of the parameter values close to the default ones, we have applied the following minor changes to the original implementations: (i) mini-batch sizes are (for reasons of processing speed) maximised under the limitation of 12 GB GPU memory, i.e., 8 for CoMIR, 64 for pix2pix and DRIT++, 4 for CycleGAN and StarGANv2; (ii) DRIT++ is set to use feature-wise transformation in the I2I translation of the Cytological, Histological and Radiological data, to enable the needed shape variation; this is not used on the Zurich dataset, where the shapes present in the images vary comparably little. (iii) For CoMIR, MSD is selected as the critic function. Its default scaling parameter τ = 0.5 is used for all 2D datasets, while τ = 0.1 is used for the 3D Radiological data.

Trainings are terminated after a certain number of iterations following the default settings of the models’ official implementations, i.e., 200 epochs for pix2pix and CycleGAN, 1200 epochs for DRIT++, 100000 iterations for StarGANv2, 12800 iterations for CoMIR. All models are converged when training terminated.

During inference, to fit the network architectures of pix2pix, CycleGAN and StarGANv2, input image sizes are padded to minimum multiples of 256 pixels using “reflect” mode to reduce artefacts, e.g., the input image size 834 × 834 px of Histological data is padded to 1024 × 1024 px. The padded areas of the images resulting from I2I translation are cropped off before further processing. For DRIT++, to better fit the image registration context, a modified version of the guided translation is used; instead of using the disentangled attribute representation of a random image from the other modality, the generator uses the disentangled attribute representation of the image that the current image is to be registered with. Similarly for StarGANv2, the style code of the reference image guiding the translation is taken from the corresponding image in the target modality. Due to the restriction imposed by StarGANv2’s architecture that the reference image can only be of size 256 × 256 px, only the central areas of reference images are used to extract style code, to avoid resolution mismatch with training.

For the 3D Radiological data, all the modality transfer methods are trained with, and inference is performed on, axial slices along the z-axis. Each volume slice is resampled (B-spline interpolation) to physical space with 1 mm2 square pixels (since the original volumes have different resolution). For training, each resampled slice is padded to size 362 × 362 px with the edge values. All models are trained with the same on-the-fly augmentation setting as for the 2D datasets, except for CoMIR where the final centre-crop (p = 1) was reduced to size 128 × 128 px so that the internal structures of the brains are better weighted against the skull outlines during training. After inference, the modality-transferred slices of the same volume are stacked back together to form a modality-transferred volume.

5.2 Implementation of image registration methods

Four of the six used registration methods—those based on MI, NGF, SIFT and α-AMD—expect greyscale data. RGB images (Modality B of Zurich and the Histological dataset) are therefore converted to greyscale, using L = 0.299R + 0.587G + 0.114, before registration. The Euler angle rotation model (angles represented as radians) is used for all the registration methods. The parameters of each method are selected based on the values suggested in the original publications, suitably adjusted to account for differences in image sizes, general contrast levels, as well as level of detail of the image contents, relying on previous experience with the involved methods. For MI, NGF, MIND, and α-AMD we employ a multi-start approach for the 2D registrations; three rotations, 0 and ±0.4 radians, are used as starting points, and the transformation reaching the lowest final distance is selected as the final registration output. For the 3D registrations, single-start is used throughout.

5.2.1 Iterative registration by Mutual Information maximisation

MI-based registration is performed using SimpleElastix—an industry-standard image registration library [84]. Adaptive stochastic gradient descent (ASGD) [85] is used as the optimiser and the maximum number of iterations is set to 1024. To avoid local minima, a multi-resolution strategy with a 4-level Gaussian scale space is used for Zurich, Cytological and Radiology data, while 6 levels are used for Histological data, considering the larger image size. In all experiments 32 histogram bins and bilinear/trilinear interpolation are used.

5.2.2 Iterative registration by maximisation of similarity of Normalised Gradient Fields

For optimisation of similarity of NGF we use the Autograd Image Registration Laboratory (AIRLab) framework [86], which supports a large number of objective functions (including similarity of NGF) and optimisation methods, multi-channel images, and multi-start optimisation. We use the ADAM optimizer, 4 pyramid levels with downsampling (8, 4, 2, 1) and smoothing with σ = (15, 9, 5, 1), the iteration count was selected to be (2000, 1000, 500, 200). The step-size was chosen to be 0.01. Bilinear/trilinear interpolation is used.

5.2.3 Iterative registration by MIND

We evaluate optimisation of MIND both using MSD objective function (as is commonly done), and, in a novel combination of techniques, using the α-AMD measure, where each channel is treated independently and the results are aggregated by summation. The standard deviation of the weighting Gaussian kernel is set to σ = 0.5. Local 4 neighborhoods are selected for 2D data, and local 6 neighborhoods are selected for 3D data. For the MSD optimisation, the AIRLab framework is used with the same settings as for NGF concerning the number of pyramid levels, smoothing, iteration counts and starts. Bilinear/trilinear interpolation is used.

5.2.4 CurveAlign

The used implementation of CurveAlign follows the default settings in its V4.0 Beta version, except for restricting its default affine transformation to rigid.

5.2.5 SIFT feature detection and matching

SIFT feature detector is implemented using OpenCV [87]. The maximal number of retained feature points is limited to 500. Brute-force matching with cross-check is used to produce the best matches with the minimal number of outliers when there are enough matches. Other parameters remain at default settings. The rigid transformation between the two sets of matched coordinates is estimated using RANdom SAmple Consensus (RANSAC) algorithm [88] via the implementation in scikit-image [89]. A data point is considered an inlier if its residuals calculated for the estimated model is smaller than 2 px. For each image pair, the random sample selection procedure is iterated 100 times, and the final model is estimated using all inliers of the best model resulting from the performed iterations.

5.2.6 α-AMD

α-AMD is implemented based on the Py-Alpha-AMD Registration Framework (https://github.com/MIDA-group/py_alpha_amd_release). The following modifications are made compared to the settings in the example script: (i) Three resolution levels are used in a pyramid (coarse-to-fine) registration strategy for the 2D datasets, and four resolution levels for the 3D Radiological dataset, with sub-sampling factors set to (4, 2, 1), (for 2D data) and (8, 4, 2, 1) (for 3D data); Gaussian blur σ is set to (12.0, 5.0, 1.0) for the 2D datasets and (5.0, 1.0, 1.0, 0.0) for the 3D Radiological data. The numbers of iterations per level are set to (900, 300, 60) for Zurich and Cytological data, to (3000, 1000, 200) for the larger Histological data, and (3000, 3000, 1000, 20) for the 3D Radiological data; (ii) Stochastic gradient descent (SGD) with momentum α = 0.9 and a sampling fraction 0.01 is used. Step sizes for the first two resolution levels are set to 2 for the 2D datasets, while decreasing linearly from 2 to 0.2 for the last resolution level, and set to 1 for the first three resolution levels for the 3D dataset, while decreasing linearly from 1 to 0.1 for the last resolution level.

5.3 Evaluation

Our main objective is to evaluate the success of registration of multimodal images, utilising images translated from one modality to another as an intermediate result (to which monomodal registration is applied). We find it relevant to also evaluate the performance of the I2I translation methods on the observed datasets, and explore the effect of the quality of modality translation on the success of subsequent registration.

5.3.1 Evaluation sets

The Zurich dataset is divided into three sub-groups, enabling 3-folded cross-validation. The three groups are formed of the images with IDs: {7, 9, 20, 3, 15, 18}, {10, 1, 13, 4, 11, 6, 16}, {14, 8, 17, 5, 19, 12, 2}. Since the images vary in size, each image is subdivided into the maximal number of equal-sized non-overlapping regions such that each region can contain exactly one 300 × 300 px image patch. Then one 300 × 300 px image patch is extracted from the centre of each region. The particular 3-folded grouping followed by splitting leads to that each evaluation fold contains 72 test samples.

The Cytological data contain images from three different cell lines; we use all images from one cell line as one fold in 3-folded cross-validation. Each image in the dataset is subdivided from 600 × 600 px into 2 × 2 patches of size 300 × 300 px, so that there are 420 test samples in each evaluation fold.

For the Histological data, to avoid too easy registration relying on the circular border of the TMA cores, the evaluation images are created by cutting 834 × 834 px patches from the centres of the original 134 TMA image pairs. This dataset has a defined split into training and test sets which we adhere to in our evaluation.

The evaluation set created from each of the three observed 2D image datasets consists of images undergone uniformly-distributed rigid transformations of increasing sizes of displacement. Each image patch is randomly rotated by an angle θ ∈ [−20, 20] degrees (with bi-linear interpolation), followed by translations in x and y directions by tx and ty pixels respectively, where tx and ty are randomly sampled within [−28, 28] for Zurich and Cytological data, and within [−80, 80] for the Histological data. To minimise border artefacts, the transformed patches are, for Zurich and Cytological data, padded using “reflect” mode, or, for Histological data, cropped at the appropriate position directly from the original larger images.

The 3D Radiological dataset is divided into three sub-groups, enabling 3-folded cross-validation. The groups are formed of the patients with IDs: {109, 106, 003, 006}, {108, 105, 007, 001}, {107, 102, 005, 009}. Since the Radiological dataset is non-isotropic (and also of varying resolution), we resample it using B-spline interpolation to 1 mm3 cubic voxels before performing registration, taking explicit care to not resample twice; displaced volumes are transformed and resampled in one step. Reference sub-volumes of size 210 × 210 × 70 voxels are cropped directly from centres of the (non-displaced) resampled volumes. Similarly as for the aforementioned 2D datasets, random (uniformly-distributed) transformations are composed of rotations θx, θy ∈ [−4, 4] degrees around the x- and y-axes, rotation θz ∈ [−20, 20] degrees around the z-axis, translations tx, ty ∈ [−19.6, 19.6] voxels in x and y directions and translation tz ∈ [−6.5, 6.5] voxels in z direction. 40 rigid transformations of increasing sizes of displacement are applied to each volume. Transformed sub-volumes, of size 210 × 210 × 70 voxels, are cropped from centres of the transformed and resampled volumes.

5.3.2 Metrics

Modality translation. Fréchet Inception Distance (FID) [90], a measure of similarity between two sets of images, has been widely used as an objective metric to assess the quality of GAN-generated images and has shown a high correlation with the subjective human judgement [15, 17, 58]. Lower FID (being a distance measure) indicates higher similarity of the images, corresponding to higher quality of the image translation. We use a PyTorch implementation of FID [91] to evaluate the quality of images generated by the considered modality translation methods.

Registration. We evaluate the performance of the considered registration approaches in terms of their success in recovering rigid transformations of varying size (amount of displacement) applied to one of the images in each of the aligned pairs. To quantify the success of registration, we define the spatial distance D(I1, I2) between two image patches I1 and I2 as

D(I1,I2)=1ni=1nCi2-Ci12, (5)

where Ci1 and Ci2 respectively denote positions of the n corner points of I1 and I2 (n = 4 for 2D datasets and n = 8 for the 3D dataset), when mapped into a common coordinate system.

The initial displacement dInit of a synthetic rigid transformation is the distance between a reference patch IRef and its corresponding initially transformed patch IInit: dInit = D(IRef, IInit). The absolute registration error ϵ is the (residual) distance between the reference patch IRef and the transformed patch after registration IReg: ϵ = D(IRef, IReg). The relative registration error δ is calculated as the percentage of absolute error to the width and height of the image patches: δ = (ϵ/w) × 100%, with w = 300 px for Zurich and Cytological data, w = 834 px for Histological data, and w=13(210+210+70)voxel for Radiological data. To summarise the competence of the registration of a sample, it is considered successful when the relative registration error δ < 2%. The success rate λ is calculated as the ratio of succeed cases to total cases.

5.4 A framework for evaluation of multimodal registration methods

We share our complete experimental setup, including method implementations, evaluation code, and all datasets in order to facilitate further reproducing and benchmarking (https://github.com/MIDA-group/MultiRegEval).

Our created Datasets for Evaluation of Multimodal Image Registration is released on Zenodo with open access [92]. In total, it contains 864 image pairs created from the Zurich dataset, 5040 image pairs created from the Cytological dataset, 536 image pairs created from the Histological dataset, and metadata with scripts to create the 480 volume pairs from the Radiological dataset. Each image/volume pair consists of a reference patch IRef and its corresponding initially transformed patch IInit, in two modalities, along with the ground truth transformation parameters to recover the transformation. Scripts to compute the registration performance, to plot the overall results, and to generate more evaluation data with different setting are also included.

The framework can easily be extended by including additional methods, and can support further development and evaluation of approaches for registration of 2D and 3D medical and biomedical images. The provided datasets are large enough to enable training of deep learning based approaches, both those which require aligned image pairs and those which do not. Considering the observed lack of suitable benchmarks particularly designed for biomedical and medical applications and, at the same time, the increasing number of newly proposed registration methods, an option to perform comparative analysis under standardised conditions is highly beneficial and is expected to contribute to promotion of generally applicable and robust solutions.

6 Results

6.1 Performance of the modality translation methods

To summarise the modality translation performance, we present in Table 1, the computed FID values measured on the four datasets, for modality translations performed by each of the four I2I translation methods, in two directions (A to B, and B to A). The values in a row method_A indicate the distance between the distribution of images in the method-generated Modality A, and of those in the real Modality A.

Table 1. Success of modality translation methods expressed in terms of Fréchet Inception Distance (FID).

Smaller is better. Standard deviations are taken over the 3 folds for Zurich, Cytological and Radiological data. cyc, drit, p2p, star, comir denote the methods CycleGAN, DRIT++, pix2pix, StarGANv2, and CoMIR, respectively. Suffix _A (resp. _B) denotes generated Modality A (resp. B). The best result achieved by an I2I translation method on each of the datasets is bolded. FID values between the initial considered multimodal image datasets (B2A), as well as between the training and testing splits within each modality (A2A and B2B) for each dataset, are included as references. FID values between the generated CoMIR representations are not directly comparable to those of the I2I translations, since the method generates a different (artificial) modality. Comparison with the values for the original multimodal data (B2A) confirms considerable reduction of FID.

Method Zurich Data Cytological Data Histological Data Radiological Data
Dataset
cyc_A 232.4±69.7 35.1±11.9 433.4 110.5±4.6
cyc_B 91.5±27.0 65.4±16.9 156.1 105.0±7.5
drit_A 182.9±3.3 63.1±17.4 125.3 244.8±16.0
drit_B 144.1±9.1 192.3±50.4 123.7 233.9±23.5
p2p_A 93.7±4.6 61.8±18.5 116.8 251.3±10.6
p2p_B 94.4±3.3 169.3±6.9 153.0 206.3±6.1
star_A 165.5±10.6 99.1±53.5 174.2 108.8±5.7
star_B 135.6±21.5 140.6±24.4 142.7 110.7±3.9
comir 17.7±8.0 61.5±11.1 91.0 89.9±20.7
B2A 155.3±15.4 145.0±17.6 341.0 134.4±8.7
A2A 113.7±0.9 48.8±27.1 99.1 76.9±2.2
B2B 104.3±3.7 82.8±20.4 102.6 86.4±4.0

We observe that the FID values reflect: (i) different levels of asymmetry of the I2I translation methods in treating translations in one, or the other directions (e.g., cyc, compared to p2p on Zurich dataset); (ii) rather varying stability of the results for the different methods, and the different datasets, as indicated by the large spread of standard deviations.

FID is also computed for CoMIR, indicating the difference between the distributions of the representations generated from Modality A and B. Although not directly comparable with the other values (computed on a rather different type of representation), these FID values, increasing from Zurich to the Histological dataset, are still informative in reflecting an increase in dataset difficulty.

The FID values in the row B2A indicate the difference between the original acquired multimodal images, for each of the four datasets. They can be seen as a reference value of the dissimilarity (FID) of the original images that, ultimately, are to be registered. We note that in some cases FID becomes higher after modality translation, compared to the initial B2A value.

Finally, the FID values computed within each of the modalities (between training and test splits), A2A and B2B, indicate the general scale of FID values for the respective modality. In most cases FID values for modalities A and B are rather similar (indicating that observed asymmetries are due to the I2I translations) except for the Cytological dataset, where FID values for modality A are considerably lower than for modality B.

To complement these quantitative results, we show in Figs 47 a few examples of modality-translated images, by the evaluated methods, for each of the observed datasets. The examples illustrate some of the above-commented behaviours of the different methods.

Fig 4. Modality-translated image samples of the Zurich dataset by different evaluated methods (contrast-enhanced for visualisation).

Fig 4

Each row shows the results on one random image from each fold. Images in Columns 2–6 are generated from the images in (the corresponding row of) Column 1. Top block: Translations generated from Modality B. Bottom block: Translations generated from Modality A. The arrows indicate what to compare for visual inspection of the level of achieved similarity (pointing from generated images to the corresponding target of the learning process).

Fig 7. Modality-translated image samples of the Radiological dataset by different evaluated methods.

Fig 7

Each row shows the results on one random slice from each fold. Images in Columns 2–6 are generated from the images in the corresponding row of Column 1. Top block: Translations generated from Modality B. Bottom block: Translations generated from Modality A. The arrows indicate what to compare for visual inspection of the level of achieved similarity (pointing from generated images to the corresponding target of the learning process).

Fig 4 shows examples from the Zurich dataset, with a clear correlation between the observed modalities. All the observed I2I translation methods reach reasonable to very good results. pix2pix generates the most realistic images, in both directions; it not only preserves the structural information but also maps the local intensities as desired. This is also reflected in the low FID values in Table 1. Other examples point out some issues: CycleGAN is successful in translating Modality A to B, but not the other way around. DRIT++ preserves structures, but is less successful regarding intensity mapping, whereas StarGANv2 fails to preserve geometry (straight lines) and generates a number of (colourful) noise pixels.

As can be seen in Figs 57, all the I2I translation methods exhibit more or less degraded performance when applied to the biomedical and medical datasets. Results on the Cytological dataset (Fig 5) indicate that the performance depends a lot on the “direction” (as indicated by FID as well); while all the methods give reasonably good output when translating from Modality B to A, the quality when translating Modality A to B tends to be much lower. This is particularly visible for pix2pix.

Fig 5. Modality-translated image samples of the Cytological dataset by different evaluated methods (contrast-enhanced for visualisation).

Fig 5

Each row shows the results on one random image from each fold. Images in Columns 2–6 are generated from the images in (the corresponding row of) Column 1. Top block: Translations generated from Modality B. Bottom block: Translations generated from Modality A. The arrows indicate what to compare for visual inspection of the level of achieved similarity (pointing from generated images to the corresponding target of the learning process).

The Histological dataset (Fig 6), with distinctly different imaging modalities, elevates further issues and differences in the performance of the observed methods. Translated images exhibit non-realistic intensities and structures as well as “invented” details, particularly in translations from Modality A to B, where pix2pix completely fails. However, when instead translating from Modality B to A, pix2pix surprisingly captures the most structures, which is matched with the lowest FID value among the I2I translation methods on the Histological data.

Fig 6. Modality-translated image samples of the Histological dataset by different evaluated methods (contrast-enhanced for visualisation).

Fig 6

Each row shows the results on one random image from each fold. Images in Columns 2–6 are generated from the images in (the corresponding row of) Column 1. Top block: Translations generated from Modality B. Bottom block: Translations generated from Modality A. The arrows indicate what to compare for visual inspection of the level of achieved similarity (pointing from generated images to the corresponding target of the learning process).

On the 3D Radiological dataset (Fig 7), CycleGAN and StarGANv2 generate the most visually realistic output slices. This is also in accord with their low FID values in Table 1. pix2pix, although exhibiting unnatural artefacts and distortions, best captures the relative intensity of the eyeballs. CoMIR generates representations with a good balance of the extracted dominant structures (corresponding to the skull) and textural details (corresponding to soft tissue), while (similar to pix2pix) occasionally expressing artefacts in the surrounding background area.

We conclude that the visual assessment of the generated images most often supports the quantitative evaluation by FID, and clearly shows that none of the observed I2I translation methods exhibits stable performance, nor consistently outperforms the others. Even though we can not exclude the possibility that individually exploring the hyperparameter space and designing data augmentation schemes for each I2I translation method could lead to its slightly improved performance, we believe that such changes would not affect our main conclusion.

Performance of CoMIR, the non-I2I (but rather representation learning) modality translation method, is visually evaluated by assessing the similarity of the corresponding learned representations. We observe relatively high similarity for all four datasets, while slightly decreasing with increasing modality differentiation exhibited from Zurich to Histological dataset, as also observed in the FID values. Interestingly, despite relatively low intermodal diversity of the Radiological dataset (expressed by low B2A FID in Table 1), the different modality translation methods perform comparably poorly in the task of reducing it further; also here the best result is reached by CoMIR.

6.2 Registration performance

Plots showing registration success rate λ for increasing initial displacement dInit for the different evaluated combinations of methods are presented in Figs 8 and 9. The reference methods (MI, NGF, MIND, and CurveAlign) are included in both left and right columns of Fig 8 to facilitate easier comparison. We evaluated MIND both in combination with MSD and with α-AMD minimisation and concluded that the novel combination of MIND and α-AMD consistently performs better than the standard combination of MIND and MSD. For sake of visual clarity we therefore omit the latter from the graph. As already commented, we also exclude results obtained by VoxelMorph [35] due to its demonstrated consistently poor performance, which we attribute to the constraint of rigid registration (as also observed by [62]). In Table 2, we summarise the aggregated performance, over all the considered displacements. Here, we include the results obtained by MIND optimised both by MSD and α-AMD.

Fig 8. Success rate of the observed registration approaches.

Fig 8

(A) SIFT on Zurich data, (B) α-AMD on Zurich data, (C) SIFT on Cytological data, (D) α-AMD on Cytological data, (E) SIFT on Histological data, (F) α-AMD on Histological data. x-axis: initial displacement dInit between moving and fixed images, discretised into 10 equally sized bins (marked by vertical dotted lines). y-axis: success rate λ within each bin (averaged over 3 folds for Zurich and Cytological data). In the legend, cyc, drit, p2p, star and comir denote CycleGAN, DRIT++, pix2pix, StarGANv2, and CoMIR methods respectively. Suffix _A (resp. _B) denotes that generated Modality A (resp. B) is used for the (monomodal) registration. B2A denotes registration of the original multimodal images, without using any modality translation. MI, MIND, NGF and CA represent using MI maximisation, MIND, NGF and CurveAlign for registration, respectively.

Fig 9. Success rate of the observed 3D registration approaches on Radiological data.

Fig 9

x-axis: initial displacement dInit between moving and fixed images, discretised into 10 equally sized bins (marked by vertical dotted lines). y-axis: success rate λ within each bin (averaged over 3 folds). In the legend, cyc, drit, p2p, star and comir denote CycleGAN, DRIT++, pix2pix, StarGANv2, and CoMIR methods respectively. Suffix _A (resp. _B) denotes that generated Modality A (resp. B) is used for the (monomodal) registration. B2A denotes registration of the original multimodal images, without using any modality translation. MI, MIND and NGF represent using MI maximisation, MIND and maximisation of the similarity of NGF for registration, respectively.

Table 2. Overall registration success rate (in percent) for the evaluated methods on four datasets.

Larger is better. The success rate λ is aggregated over all transformation levels for each dataset. Standard deviations are taken over the 3 folds for Zurich, Cytological and Radiological data. cyc, drit, p2p, star, comir, MIND(α-AMD),MIND(MSD), NGF, MI and CA denote the methods CycleGAN, DRIT++, pix2pix, StarGANv2, CoMIR, MIND+α-AMD-based registr., MIND+MSD-based registr., NGF, MI maximisation and CurveAlign, respectively. _A (resp. _B) denotes using generated Modality A (resp. B) for registration. B2A refers to the multimodal registration performance on the acquired images without modality translation. MI, MIND and NGF provide reference performance of good conventional multimodal registration methods. For each dataset, the best I2I-based approach, as well as the overall best performing (multimodal) approach, are bolded.

Dataset Zurich Data Cytological Data Histological Data Radiological Data
Method α-AMD SIFT α-AMD SIFT α-AMD SIFT α-AMD
Registration with modality translation
I2I-based approaches cyc_A 4.9±2.1 66.4±18.8 71.1±5.8 24.4±6.2 0 0 6.2±3.9
bcyc_B 65.0±8.4 83.2±3.1 19.2±2.8 17.6±2.5 13.8 0 0.6±0.0
drit_A 34.8±5.4 38.0±7.9 61.6±16.2 21.6±3.6 1.7 0 0.0±0.0
drit_B 18.1±3.1 35.4±3.5 21.0±9.0 4.6±1.3 4.7 0 0.0±0.0
p2p_A 80.2±3.9 98.3±0.5 57.9±7.4 8.6±1.2 28.4 0 26.2±5.3
p2p_B 61.5±4.7 85.0±5.0 0.1±0.1 3.8±2.0 0.4 0 24.2±5.7
star_A 64.0±7.5 6.5±2.7 57.4±13.0 10.9±2.2 2.6 0 0.6±0.9
star_B 41.1±3.6 5.9±0.5 17.8±4.9 5.8±0.6 19.6 0 19.2±12.5
comir 91.8±7.7 100.0±0.0 68.0±14.0 72.5±7.1 81.3 59.3 95.8±5.0
Baseline without modality translation
B2A 12.8±3.5 72.5±4.8 21.9±10.5 20.8±2.0 0 0 3.1±4.4
Reference direct multimodal registration methods
MIND(α-AMD) 93.2±2.8 33.8±5.2 29.1 91.0±3.6
MIND(MSD) 41.4±1.6 30.6±0.5 9.5 15.0±0.5
NGF 47.3±7.4 30.7±5.2 2.6 38.3±0.8
MI 80.9±3.5 98.1±0.8 56.5 100.0±0.0
CA - - 3.7 -

A number of observations can be made from these results. Reference performance is established by iterative MI maximisation (black dashed line MI). We observe that, for smaller initial displacements, MI maximisation delivers outstanding performance on all four datasets. However, it is also apparent that the performance of MI (optimised by adaptive stochastic gradient descent) decreases fast as the initial displacement increases on the structure-rich Zurich and Histological datasets. This is not surprising, since MI is known to have a small region of attraction. On the other hand, its performance on the other two datasets is less impaired by the increasing initial displacement.

The yellow-green dotted curve B2A indicates the performance of monomodal registration approaches (SIFT and α-AMD, in the respective subplots), applied directly to multimodal images. As expected, these methods underperform MI maximisation in all cases except for SIFT with relatively large transformations on the Zurich dataset.

MIND combined with α-AMD minimisation shows better performance than MI on the structure-rich Zurich data, while falling behind on the other datasets. NGF performs comparably to MIND (combined with α-AMD) on the Cytological data, while less favourably on the other datasets. MI consistently outperforms NGF.

The main focus of this study is to analyse the positioning of the remaining curves (with markers) presented in Figs 8 and 9, in comparison with the reference methods. First, we observe that, for each of the four datasets, there are a number of solid-line curves above the yellow-green dotted curve B2A, for all observed displacements. This demonstrates that some I2I translation methods indeed can help to approach a “monomodal case”, and make the registration task easier when using the observed (monomodal) registration methods, SIFT and α-AMD.

We also note that, similar as for the modality translation, the performances of the different registration methods show significant data-dependency. On the structure-rich Zurich data, SIFT performs rather well on modality translated images (and even on the original multimodal pairs). One exception is the StarGANv2 generated Zurich data, star_A and star_B in Fig 8A, where SIFT is completely failing. On the biomedical data, where features are less salient, α-AMD, being an intensity-based method, tends to come out ahead. Fig 8E particularly indicates very poor performance of SIFT-based registration on the Histological data.

In general, the I2I translation methods followed by either SIFT (on 2D data) or α-AMD (on 2D and 3D data) registration appear as far less sensitive to the initial displacement, compared to MI; in particular, several of them outperform MI on Zurich dataset for larger displacements.

What stands out in both Table 2 and in Figs 8 and 9, is the CoMIR (representation learning-based) modality translation method, which consistently outperforms the I2I translation approaches. In combination with SIFT, only CoMIR manages on the Histological data (Fig 8E), while at the same time clearly outperforming the baseline for medium and large displacements. (Note that MI, MIND and NGF are not combined with SIFT, but are included in all the plots as reference methods.) On the Histological dataset, CoMIR combined with α-AMD is the clear winner, while on the Cytological data MI is taking the top position, followed by CoMIR combined with either SIFT or α-AMD, or CycleGAN combined with α-AMD. On the Zurich dataset, CoMIR again performs best when combined with SIFT, and similarly well as MIND when combined with α-AMD. On the 3D Radiological dataset, CoMIR (combined with α-AMD) outperforms all I2I translation methods and demonstrates robustness to initial displacement, confirming to be a reliable approach to multimodal registration in 3D as well. While performing on par with MIND for smaller displacements, it better handles the task of recovering large transformations. However, MI maximisation is the top performing approach for this dataset.

Fig 8E and 8F also include the performance of the CurveAlign method, specially designed for registration of BF and SHG images in our Histological dataset. We observe that this method shows some limited success only for relatively small displacements, and falls behind MI, MIND and CoMIR, as well as several I2I translation methods when combined with α-AMD.

6.3 Correlation between modality translation and registration

Intuitively, the more successful the modality translation is, the more accurate subsequent registration will be. Our results to a high extent confirm this: the observed performance of most combinations of methods is consistent with the appearance of the image samples in Figs 47. To visualise this consistency, we plot, in Fig 10, the relation between overall success rate λ, for each of the observed combinations of modality translation and registration, and the average quality of the used modality translation, as quantified by FID, for the four observed datasets. In Fig 10A, 10B and 10D, the error-bars show the standard deviations over the three folds of Zurich, Cytological and Radiological datasets.

Fig 10. Relation between average FID reached by a modality translation method and the success rate λ of the subsequent registration.

Fig 10

(A) On Zurich data, (B) on Cytological data, (C) on Histological data, (D) on Radiological data. In the legend, cyc, drit, p2p, star and comir denote the methods CycleGAN, DRIT++, pix2pix, StarGANv2 and CoMIR, respectively. Suffix _A (resp. _B) denotes that generated Modality A (resp. B) is used in (monomodal) registration. The marker style indicates whether α-AMD (aAMD) or SIFT (SIFT) is used for the registration. The error-bars correspond to standard deviation computed over 3 folds for Zurich, Cytological and Radiological data.

The results show a clear common trend that lower FID (higher quality of the generated image) generally correlates with higher registration success rate. The large error-bars, in particular for cyc_A in Fig 10A and drit_B and star_A in Fig 10B, indicate that the corresponding GAN-based methods are highly unstable in modality translation. In spite of the relatively large error-bar of comir in Fig 10D, possibly caused by the artefacts visible in Fig 7, CoMIR still reaches the lowest average FID value and highest registration success rate (among the modality translation based approaches) on the Radiological dataset. Also here we observe that the difference in performance between the two directions of modality translation may be rather high. Whereas the two modalities in the Cytological dataset visually differ less than those in Zurich and the Histological datasets, our results (summarised in Fig 10B) clearly show that all the four observed I2I translation methods have much more success in generating Modality A, than Modality B. This difference further propagates to registration, which is evidently more successful if based on generated Modality A, for both SIFT and α-AMD.

6.4 Overview of the results

The presented results allow for the following summary:

  • The observed I2I translation methods, followed by monomodal registration approaches, show to be applicable in less challenging multimodal registration scenarios, such as on the remote sensing (Zurich) dataset where the modalities display a relatively high degree of similarity in structure and appearance. However, the methods exhibit high instability and data dependence. The four observed I2I translation methods, with different properties and of different complexity, show varying performance on the observed datasets, including asymmetry w.r.t. to the direction of translation, without any of the methods standing out as a generally preferred choice.

  • The unstable performance of I2I translation methods could be attributed to the images’ geometry not being well preserved in the translated output. This problem is observed to be alleviated by utilising alignment information (such as pix2pix) or contrastive training (such as CoMIR).

  • Among the observed modality translation methods, pix2pix and CoMIR require aligned image pairs during training. These methods also show superior performance, in particular on the highly structured Zurich dataset, compared to the remaining three.

  • FID, as a measure of performance of I2I translation methods, shows to be a reasonably reliable predictor of the success of subsequent monomodal registration; I2I translation methods which manage to reach lower FID measured between the originally acquired and modality translated images can be expected to provide a better basis for a successful registration. I2I translation methods which are trained on non-aligned image pairs.

  • The traditional registration approach based on MI maximisation exhibits very good performance, leaving behind all the observed I2I methods for sufficiently small displacements and excels on the Cytological and Radiological data, where there are more distinct objects to be registered. The combination of modality translation and monomodal registration, on the other hand, shows better stability w.r.t. size of displacement on the more structure rich Zurich and Histology datasets.

  • Several I2I translation methods followed by α-AMD outperform CurveAlign, a highly specialised method developed for registration of here observed Histological dataset.

  • The novel combination of MIND and minimisation of α-AMD as an objective function outperformed the original combination of MIND and minimisation of MSD in all observed scenarios.

  • CoMIR, a representation-learning method developed for multimodal registration, here extended also to 3D data, exhibits overall best performance among the modality translation approaches, showing stability w.r.t. data, size of displacement, as well as the choice of subsequent monomodal registration method.

7 Conclusion

In this study, we investigate whether and to what extent I2I translation methods may facilitate multimodal biomedical image registration. We focus on 2D and 3D rigid transformations, finding the task challenging enough, while highly relevant, in particular for biomedical multimodal image data. We have selected four popular and widely used I2I translation methods with diverse properties and complexity. We believe that this selection gives a good insight into the potential of modality translation as a general approach to aid multimodal registration. Openly available multimodal biomedical datasets suitable for the evaluation of registration methods are very scarce. We use three 2D datasets (two of them published in connection with this study) and one 3D dataset, of varying complexity and different combinations of modalities, introducing a range of challenges relevant for performance evaluation.

From our experiments, we observe that I2I translation methods appear less successful in the context of multimodal registration than in some other relevant biomedical use-case scenarios (such as virtual staining, or image segmentation). However, a representation-learning approach (CoMIR), which maps the modalities to their established (learned) “common ground” instead of mapping one of them all the way to the other, shows to be a highly promising approach. We expect that its demonstrated successful debut on 3D medical data will inspire further development and applications. We also note that maximisation of Mutual Information still provides good multimodal registration performance in many situations under our experiment settings. However, other studies have shown that the effectiveness of such traditional multimodal registration methods degrades under real-world situations [21, 61].

This comparative study adds to the understanding of multimodal biomedical image registration methods from an empirical perspective. Last but not least, it establishes an open-source quantitative evaluation framework for multimodal biomedical registration based on publicly available datasets, which can be easily utilised for benchmarking, whereby we also hope to contribute to the openness and reproducibility of future scientific research.

Supporting information

S1 Appendix. List of abbreviations (and method names) used in the paper.

(PDF)

Acknowledgments

We thank Kevin Eliceiri (University of Wisconsin-Madison) and his team for kindly providing the data used to create the Histological dataset. We thank Jaromír Gumulec (Masaryk University) and his collaborators for kindly providing the Cytological dataset. We thank Michele Volpi and his collaborators for kindly providing the Zurich Summer Dataset. The RIRE dataset (images and the standard transformations were provided as part of the project “Retrospective Image Registration Evaluation”, National Institutes of Health, Project Number 8R01EB002124-03, PI J. Michael Fitzpatrick, Vanderbilt University, Nashville, TN.

Data Availability

All used datasets are openly shared on Zenodo: https://zenodo.org/record/5557568 (DOI: 10.5281/zenodo.5557568).

Funding Statement

The authors are financially supported by the Wallenberg AI, Autonomous Systems and Software Program (WASP) AI-Math initiative (NS, JL), Vinnova, Sweden’s Innovation Agency (projects 2017-02447 and 2020-03611) (JL, NS, JO), and the Swedish Research Council (project 2017-04385) (JL). This article was facilitated by the COST Action COMULIS (CA17121), supported by COST (European Cooperation in Science and Technology) (NS, NL). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

  • 1. Walter A, Paul-Gilloteaux P, Plochberger B, Sefc L, Verkade P, Mannheim JG, et al. Correlated Multimodal Imaging in Life Sciences: Expanding the Biomedical Horizon. Front Phys. 2020;8:47. doi: 10.3389/fphy.2020.00047 [DOI] [Google Scholar]
  • 2. Brown LG. A Survey of Image Registration Techniques. ACM Comput Surv. 1992;24(4):325–376. doi: 10.1145/146370.146374 [DOI] [Google Scholar]
  • 3. Zitová B, Flusser J. Image Registration Methods: A Survey. Image and Vision Computing. 2003;21(11):977–1000. doi: 10.1016/S0262-8856(03)00137-9 [DOI] [Google Scholar]
  • 4. Viergever MA, Maintz JBA, Klein S, Murphy K, Staring M, Pluim JPW. A Survey of Medical Image Registration—under Review. Medical Image Analysis. 2016;33:140–144. doi: 10.1016/j.media.2016.06.030 [DOI] [PubMed] [Google Scholar]
  • 5. Hill DL, Batchelor PG, Holden M, Hawkes DJ. Medical Image Registration. Phys Med Biol. 2001;46(3):R1–45. doi: 10.1088/0031-9155/46/3/201 [DOI] [PubMed] [Google Scholar]
  • 6. Haskins G, Kruger U, Yan P. Deep Learning in Medical Image Registration: A Survey. Machine Vision and Applications. 2020;31(1):8. doi: 10.1007/s00138-020-01060-x [DOI] [Google Scholar]
  • 7. Fu Y, Lei Y, Wang T, Curran WJ, Liu T, Yang X. Deep Learning in Medical Image Registration: A Review. Phys Med Biol. 2020;65(20):20TR01. doi: 10.1088/1361-6560/ab843e [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8. Boveiri HR, Khayami R, Javidan R, Mehdizadeh A. Medical Image Registration Using Deep Neural Networks: A Comprehensive Review. Computers & Electrical Engineering. 2020;87:106767. doi: 10.1016/j.compeleceng.2020.106767 [DOI] [Google Scholar]
  • 9. Chen X, Diaz-Pinto A, Ravikumar N, Frangi A. Deep learning in medical image registration. Progress in Biomedical Engineering. 2021;3(1):012003. doi: 10.1088/2516-1091/abd37c [DOI] [Google Scholar]
  • 10.Sloan JM, Goatman KA, Siebert JP. Learning Rigid Image Registration—Utilizing Convolutional Neural Networks for Medical Image Registration:. In: Proceedings of the 11th International Joint Conference on Biomedical Engineering Systems and Technologies. Funchal, Madeira, Portugal: SCITEPRESS—Science and Technology Publications; 2018. p. 89–99.
  • 11. Siebert H, Hansen L, Heinrich MP. Learning a Metric for Multimodal Medical Image Registration without Supervision Based on Cycle Constraints. Sensors. 2022;22(3):1107. doi: 10.3390/s22031107 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12. Islam KT, Wijewickrema S, O’Leary S. A deep learning based framework for the registration of three dimensional multi-modal medical images of the head. Scientific Reports. 2021;11(1):1–13. doi: 10.1038/s41598-021-81044-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Isola P, Zhu JY, Zhou T, Efros AA. Image-To-Image Translation With Conditional Adversarial Networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; 2017. p. 1125–1134.
  • 14.Zhu JY, Park T, Isola P, Efros AA. Unpaired Image-to-Image Translation Using Cycle-Consistent Adversarial Networks. In: 2017 IEEE International Conference on Computer Vision (ICCV). Venice: IEEE; 2017. p. 2242–2251.
  • 15.Choi Y, Uh Y, Yoo J, Ha JW. StarGAN v2: Diverse Image Synthesis for Multiple Domains. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Seattle, WA, USA: IEEE; 2020. p. 8185–8194.
  • 16.Lee HY, Tseng HY, Huang JB, Singh M, Yang MH. Diverse Image-to-Image Translation via Disentangled Representations. In: Computer Vision—ECCV 2018. Cham: Springer International Publishing; 2018. p. 36–52.
  • 17. Lee HY, Tseng HY, Mao Q, Huang JB, Lu YD, Singh M, et al. DRIT++: Diverse Image-to-Image Translation via Disentangled Representations. Int J Comput Vis. 2020. doi: 10.1007/s11263-019-01284-z [DOI] [Google Scholar]
  • 18.Iglesias JE, Konukoglu E, Zikic D, Glocker B, Van Leemput K, Fischl B. Is Synthesizing MRI Contrast Useful for Inter-modality Analysis? In: Medical Image Computing and Computer-Assisted Intervention—MICCAI 2013. Lecture Notes in Computer Science. Berlin, Heidelberg: Springer; 2013. p. 631–638. [DOI] [PMC free article] [PubMed]
  • 19.Bogovic JA, Hanslovsky P, Wong A, Saalfeld S. Robust Registration of Calcium Images by Learned Contrast Synthesis. In: 2016 IEEE 13th International Symposium on Biomedical Imaging (ISBI); 2016. p. 1123–1126.
  • 20. Chen M, Carass A, Jog A, Lee J, Roy S, Prince JL. Cross Contrast Multi-Channel Image Registration Using Image Synthesis for MR Brain Images. Medical Image Analysis. 2017;36:2–14. doi: 10.1016/j.media.2016.10.005 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Qin C, Shi B, Liao R, Mansi T, Rueckert D, Kamen A. Unsupervised Deformable Registration for Multi-Modal Images via Disentangled Representations. In: Information Processing in Medical Imaging. Lecture Notes in Computer Science. Springer International Publishing; 2019. p. 249–261.
  • 22. Tschuchnig ME, Oostingh GJ, Gadermayr M. Generative Adversarial Networks in Digital Pathology: A Survey on Trends and Future Potential. Patterns. 2020;1(6):100089. doi: 10.1016/j.patter.2020.100089 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23. Meijering E. A Bird’s-Eye View of Deep Learning in Bioimage Analysis. Comput Struct Biotechnol J. 2020;18:2312–2325. doi: 10.1016/j.csbj.2020.08.003 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24. Wells WM III, Viola P, Atsumi H, Nakajima S, Kikinis R. Multi-Modal Volume Registration by Maximization of Mutual Information. Medical image analysis. 1996;1(1):35–51. doi: 10.1016/S1361-8415(01)80004-9 [DOI] [PubMed] [Google Scholar]
  • 25. Maes F, Collignon A, Vandermeulen D, Marchal G, Suetens P. Multimodality Image Registration by Maximization of Mutual Information. IEEE transactions on Medical Imaging. 1997;16(2):187–198. doi: 10.1109/42.563664 [DOI] [PubMed] [Google Scholar]
  • 26. Heinrich MP, Jenkinson M, Bhushan M, Matin T, Gleeson FV, Brady SM, et al. MIND: Modality Independent Neighbourhood Descriptor for Multi-Modal Deformable Registration. Med Image Anal. 2012;16(7):1423–1435. doi: 10.1016/j.media.2012.05.008 [DOI] [PubMed] [Google Scholar]
  • 27.Haber E, Modersitzki J. Intensity gradient based registration and fusion of multi-modal images. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer; 2006. p. 726–733. [DOI] [PubMed]
  • 28. Keikhosravi A, Li B, Liu Y, Eliceiri KW. Intensity-Based Registration of Bright-Field and Second-Harmonic Generation Images of Histopathology Tissue Sections. Biomed Opt Express. 2020;11(1):160. doi: 10.1364/BOE.11.000160 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Pielawski N, Wetzer E, Öfverstedt J, Lu J, Wählby C, Lindblad J, et al. CoMIR: Contrastive Multimodal Image Representation for Registration. In: Advances in Neural Information Processing Systems. vol. 33. Curran Associates, Inc.; 2020. p. 18433–18444.
  • 30. Studholme C, Hill DL, Hawkes DJ. An Overlap Invariant Entropy Measure of 3D Medical Image Alignment. Pattern recognition. 1999;32(1):71–86. doi: 10.1016/S0031-3203(98)00091-0 [DOI] [Google Scholar]
  • 31.Pluim JP, Maintz JA, Viergever MA. Image Registration by Maximization of Combined Mutual Information and Gradient Information. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer; 2000. p. 452–461. [DOI] [PubMed]
  • 32. Öfverstedt J, Lindblad J, Sladoje N. Fast and Robust Symmetric Image Registration Based on Distances Combining Intensity and Spatial Information. IEEE Trans on Image Process. 2019;28(7):3584–3597. doi: 10.1109/TIP.2019.2899947 [DOI] [PubMed] [Google Scholar]
  • 33. Yang X, Kwitt R, Styner M, Niethammer M. Quicksilver: Fast Predictive Image Registration—A Deep Learning Approach. NeuroImage. 2017;158:378–396. doi: 10.1016/j.neuroimage.2017.07.008 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34. Cheng X, Zhang L, Zheng Y. Deep Similarity Learning for Multimodal Medical Images. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization. 2018;6(3):248–252. doi: 10.1080/21681163.2015.1135299 [DOI] [Google Scholar]
  • 35. Balakrishnan G, Zhao A, Sabuncu MR, Guttag J, Dalca AV. VoxelMorph: A Learning Framework for Deformable Medical Image Registration. IEEE Transactions on Medical Imaging. 2019;38(8):1788–1800. doi: 10.1109/TMI.2019.2897538 [DOI] [PubMed] [Google Scholar]
  • 36.Lowe DG. Object Recognition from Local Scale-Invariant Features. In: Proceedings of the Seventh IEEE International Conference on Computer Vision. vol. 2; 1999. p. 1150–1157 vol.2.
  • 37.Rublee E, Rabaud V, Konolige K, Bradski G. ORB: An Efficient Alternative to SIFT or SURF. In: 2011 International Conference on Computer Vision. Barcelona, Spain: IEEE; 2011. p. 2564–2571.
  • 38.Wu G, Kim M, Wang Q, Gao Y, Liao S, Shen D. Unsupervised Deep Feature Learning for Deformable Registration of MR Brain Images. In: Medical Image Computing and Computer-Assisted Intervention—MICCAI 2013. Lecture Notes in Computer Science. Springer Berlin Heidelberg; 2013. p. 649–656. [DOI] [PMC free article] [PubMed]
  • 39.Dosovitskiy A, Springenberg JT, Riedmiller M, Brox T. Discriminative Unsupervised Feature Learning with Convolutional Neural Networks. In: Proceedings of the 27th International Conference on Neural Information Processing Systems—Volume 1. NIPS’14. Cambridge, MA, USA: MIT Press; 2014. p. 766–774.
  • 40. Cao X, Yang J, Gao Y, Wang Q, Shen D. Region-Adaptive Deformable Registration of CT/MRI Pelvic Images via Learning-Based Image Synthesis. IEEE Transactions on Image Processing. 2018;27(7):3500–3512. doi: 10.1109/TIP.2018.2820424 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.Goodfellow IJ, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, et al. Generative Adversarial Nets. In: Proceedings of the 27th International Conference on Neural Information Processing Systems—Volume 2. NIPS’14. Cambridge, MA, USA: MIT Press; 2014. p. 2672–2680.
  • 42. Rivenson Y, Wang H, Wei Z, de Haan K, Zhang Y, Wu Y, et al. Virtual Histological Staining of Unlabelled Tissue-Autofluorescence Images via Deep Learning. Nat Biomed Eng. 2019;3(6):466–477. doi: 10.1038/s41551-019-0362-y [DOI] [PubMed] [Google Scholar]
  • 43.de Bel T, Hermsen M, Kers J, van der Laak J, Litjens G. Stain-Transforming Cycle-Consistent Generative Adversarial Networks for Improved Segmentation of Renal Histopathology. In: Proceedings of the 2nd International Conference on Medical Imaging with Deep Learning. vol. 102 of Proceedings of Machine Learning Research. London, United Kingdom: PMLR; 2019. p. 151–163.
  • 44.Burlingame EA, Margolin AA, Gray JW, Chang YH. SHIFT: Speedy Histopathological-to-Immunofluorescent Translation of Whole Slide Images Using Conditional Generative Adversarial Networks. In: Medical Imaging 2018: Digital Pathology. vol. 10581. International Society for Optics and Photonics; 2018. p. 1058105. [DOI] [PMC free article] [PubMed]
  • 45.Siddiquee MMR, Zhou Z, Tajbakhsh N, Feng R, Gotway M, Bengio Y, et al. Learning Fixed Points in Generative Adversarial Networks: From Image-to-Image Translation to Disease Detection and Localization. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV). Seoul, Korea (South): IEEE; 2019. p. 191–200. [DOI] [PMC free article] [PubMed]
  • 46.Armanious K, Jiang C, Abdulatif S, Kustner T, Gatidis S, Yang B. Unsupervised Medical Image Translation Using Cycle-MedGAN. In: 2019 27th European Signal Processing Conference (EUSIPCO). A Coruna, Spain: IEEE; 2019. p. 1–5.
  • 47.Engelhardt S, De Simone R, Full PM, Karck M, Wolf I. Improving Surgical Training Phantoms by Hyperrealism: Deep Unpaired Image-to-Image Translation from Real Surgeries. In: Medical Image Computing and Computer Assisted Intervention—MICCAI 2018. vol. 11070. Cham: Springer International Publishing; 2018. p. 747–755.
  • 48. Cirillo MD, Abramian D, Eklund A. Vox2Vox: 3D-GAN for Brain Tumour Segmentation. In: Brainlesion: Glioma, Multiple Sclerosis, Stroke and Traumatic Brain Injuries. vol. 12658. Cham: Springer International Publishing; 2021. p. 274–284. [Google Scholar]
  • 49. Yang Q, Li N, Zhao Z, Fan X, Chang EIC, Xu Y. MRI Cross-Modality Image-to-Image Translation. Sci Rep. 2020;10(1):3753. doi: 10.1038/s41598-020-60520-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 50.Dorent R, Kujawa A, Ivory M, Bakas S, Rieke N, Joutard S, et al. CrossMoDA 2021 challenge: Benchmark of Cross-Modality Domain Adaptation techniques for Vestibular Schwnannoma and Cochlea Segmentation. arXiv preprint arXiv:220102831. 2022. [DOI] [PMC free article] [PubMed]
  • 51. Alotaibi A. Deep Generative Adversarial Networks for Image-to-Image Translation: A Review. Symmetry. 2020;12(10):1705. doi: 10.3390/sym12101705 [DOI] [Google Scholar]
  • 52. Cao X, Yang J, Gao Y, Guo Y, Wu G, Shen D. Dual-Core Steered Non-Rigid Registration for Multi-Modal Images via Bi-Directional Image Synthesis. Medical image analysis. 2017;41:18–31. doi: 10.1016/j.media.2017.05.004 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 53.Arar M, Ginger Y, Danon D, Bermano AH, Cohen-Or D. Unsupervised Multi-Modal Image Registration via Geometry Preserving Image-to-Image Translation. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Seattle, WA, USA: IEEE; 2020. p. 13407–13416.
  • 54. Casamitjana A, Mancini M, Iglesias JE. Synth-by-Reg (SbR): Contrastive Learning for Synthesis-Based Registration of Paired Images. In: Simulation and Synthesis in Medical Imaging. vol. 12965. Cham: Springer International Publishing; 2021. p. 44–54. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 55.Mahapatra D, Antony B, Sedai S, Garnavi R. Deformable Medical Image Registration Using Generative Adversarial Networks. In: 2018 IEEE 15th International Symposium on Biomedical Imaging (ISBI 2018). Washington, DC: IEEE; 2018. p. 1449–1453.
  • 56.Sheikhjafari A, Noga M, Punithakumar K, Ray N. Unsupervised Deformable Image Registration with Fully Connected Generative Neural Network. In: International Conference on Medical Imaging with Deep Learning (MIDL). Amsterdam, The Netherlands; 2018.
  • 57. Kaji S, Kida S. Overview of Image-to-Image Translation by Use of Deep Neural Networks: Denoising, Super-Resolution, Modality Conversion, and Reconstruction in Medical Imaging. Radiol Phys Technol. 2019;12(3):235–248. doi: 10.1007/s12194-019-00520-y [DOI] [PubMed] [Google Scholar]
  • 58.Pang Y, Lin J, Qin T, Chen Z. Image-to-Image Translation: Methods and Applications. arXiv:210108629 [cs]. 2021.
  • 59.Dalca AV, Balakrishnan G, Guttag J, Sabuncu MR. Unsupervised Learning for Fast Probabilistic Diffeomorphic Registration. In: Medical Image Computing and Computer Assisted Intervention—MICCAI 2018. Lecture Notes in Computer Science. Cham: Springer International Publishing; 2018. p. 729–738.
  • 60. de Vos BD, Berendsen FF, Viergever MA, Sokooti H, Staring M, Išgum I. A Deep Learning Framework for Unsupervised Affine and Deformable Image Registration. Medical Image Analysis. 2019;52:128–143. doi: 10.1016/j.media.2018.11.010 [DOI] [PubMed] [Google Scholar]
  • 61.Jiang H, Yu H, Zhou X, Kang H, Wang Z, Hara T, et al. Learning 3D Non-Rigid Deformation Based on an Unsupervised Deep Learning for PET/CT Image Registration. In: Medical Imaging 2019: Biomedical Applications in Molecular, Structural, and Functional Imaging. San Diego, United States: SPIE; 2019. p. 67.
  • 62.Bastiaansen WA, Rousian M, Steegers-Theunissen RP, Niessen WJ, Koning A, Klein S. Towards segmentation and spatial alignment of the human embryonic brain using deep learning for atlas-based registration. In: International Workshop on Biomedical Image Registration. Springer; 2020. p. 34–43.
  • 63.Mirza M, Osindero S. Conditional Generative Adversarial Nets. arXiv preprint arXiv:14111784. 2014.
  • 64.Ronneberger O, Fischer P, Brox T. U-Net: Convolutional Networks for Biomedical Image Segmentation. In: Medical Image Computing and Computer-Assisted Intervention—MICCAI 2015. Lecture Notes in Computer Science. Springer International Publishing; 2015. p. 234–241.
  • 65.Wang TC, Liu MY, Zhu JY, Tao A, Kautz J, Catanzaro B. High-Resolution Image Synthesis and Semantic Manipulation with Conditional GANs. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, UT, USA: IEEE; 2018. p. 8798–8807.
  • 66.Zhang P, Zhang B, Chen D, Yuan L, Wen F. Cross-Domain Correspondence Learning for Exemplar-Based Image Translation. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Seattle, WA, USA: IEEE; 2020. p. 5142–5152.
  • 67.Taigman Y, Polyak A, Wolf L. Unsupervised Cross-Domain Image Generation. arXiv:161102200 [cs]. 2016.
  • 68.Kim T, Cha M, Kim H, Lee JK, Kim J. Learning to Discover Cross-Domain Relations with Generative Adversarial Networks. In: International Conference on Machine Learning. PMLR; 2017. p. 1857–1865.
  • 69.Yi Z, Zhang H, Tan P, Gong M. DualGAN: Unsupervised Dual Learning for Image-to-Image Translation. In: 2017 IEEE International Conference on Computer Vision (ICCV). Venice: IEEE; 2017. p. 2868–2876.
  • 70. de Bel T, Bokhorst JM, van der Laak J, Litjens G. Residual Cyclegan for Robust Domain Transformation of Histopathological Tissue Slides. Medical Image Analysis. 2021;70:102004. doi: 10.1016/j.media.2021.102004 [DOI] [PubMed] [Google Scholar]
  • 71.Liu MY, Breuel T, Kautz J. Unsupervised Image-to-Image Translation Networks. In: Proceedings of the 31st International Conference on Neural Information Processing Systems. NIPS’17. Long Beach, California, USA: Curran Associates Inc.; 2017. p. 700–708.
  • 72.Mao Q, Lee HY, Tseng HY, Ma S, Yang MH. Mode Seeking Generative Adversarial Networks for Diverse Image Synthesis. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Long Beach, CA, USA: IEEE; 2019. p. 1429–1437.
  • 73.Hjelm RD, Fedorov A, Lavoie-Marchildon S, Grewal K, Bachman P, Trischler A, et al. Learning Deep Representations by Mutual Information Estimation and Maximization. In: International Conference on Learning Representations; 2018.
  • 74. Lindblad J, Sladoje N. Linear Time Distances Between Fuzzy Sets With Applications to Pattern Matching and Classification. IEEE Trans on Image Process. 2014;23(1):126–136. doi: 10.1109/TIP.2013.2286904 [DOI] [PubMed] [Google Scholar]
  • 75. Solorzano L, Pereira C, Martins D, Almeida R, Carneiro F, Almeida GM, et al. Towards Automatic Protein Co-Expression Quantification in Immunohistochemical TMA Slides. IEEE J Biomed Health Inform. 2021;25(2):393–402. doi: 10.1109/JBHI.2020.3008821 [DOI] [PubMed] [Google Scholar]
  • 76.Volpi M, Ferrari V. Semantic Segmentation of Urban Scenes by Learning Local Class Interactions. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW); 2015. p. 1–9.
  • 77.Michele Volpi VF. Zurich Summer Dataset; 2022. Available from: 10.5281/zenodo.5914759. [DOI]
  • 78.Vicar T, Raudenska M, Gumulec J, Masarik M, Balvan J. Quantitative Phase Microscopy Timelapse Dataset of PNT1A, DU-145 and LNCaP Cells with Annotated Caspase 3,7-Dependent and Independent Cell Death; 2019. Available from: 10.5281/zenodo.2601562. [DOI]
  • 79.Vicar T, Raudenska M, Gumulec J, Masarik M, Balvan J. Fluorescence Microscopy Timelapse Dataset of PNT1A, DU-145 and LNCaP Cells with Annotated Caspase 3,7-Dependent and Independent Cell Death; 2021. Available from: 10.5281/zenodo.4531900. [DOI]
  • 80. Vicar T, Raudenska M, Gumulec J, Balvan J. The Quantitative-Phase Dynamics of Apoptosis and Lytic Cell Death. Scientific Reports. 2020;10(1):1–12. doi: 10.1038/s41598-020-58474-w [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 81.Eliceiri K, Li B, Keikhosravi A. Multimodal Biomedical Dataset for Evaluating Registration Methods (Full-Size TMA Cores); 2021. Available from: 10.5281/zenodo.4550300. [DOI]
  • 82. West J, Fitzpatrick JM, Wang MY, Dawant BM, Maurer CR Jr, Kessler RM, et al. Comparison and evaluation of retrospective intermodality brain image registration techniques. Journal of computer assisted tomography. 1997;21(4):554–568. doi: 10.1097/00004728-199707000-00007 [DOI] [PubMed] [Google Scholar]
  • 83.Fitzpatrick JM. The RIRE Dataset;. Available from: https://www.insight-journal.org/rire/download_data.php.
  • 84.Marstal K, Berendsen F, Staring M, Klein S. SimpleElastix: A User-Friendly, Multi-Lingual Library for Medical Image Registration. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops; 2016. p. 134–142.
  • 85. Klein S, Pluim JPW, Staring M, Viergever MA. Adaptive Stochastic Gradient Descent Optimisation for Image Registration. Int J Comput Vis. 2008;81(3):227. doi: 10.1007/s11263-008-0168-y [DOI] [Google Scholar]
  • 86.Sandkühler R, Jud C, Andermatt S, Cattin PC. Airlab: Autograd image registration laboratory. arXiv preprint arXiv:180609907. 2018.
  • 87. Bradski G. The OpenCV Library. Dr Dobb’s Journal of Software Tools. 2000. [Google Scholar]
  • 88. Fischler MA, Bolles RC. Random Sample Consensus: A Paradigm for Model Fitting with Applications to Image Analysis and Automated Cartography. Commun ACM. 1981;24(6):381–395. doi: 10.1145/358669.358692 [DOI] [Google Scholar]
  • 89. van der Walt S, Schönberger JL, Nunez-Iglesias J, Boulogne F, Warner JD, Yager N, et al. Scikit-Image: Image Processing in Python. PeerJ. 2014;2:e453. doi: 10.7717/peerj.453 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 90.Heusel M, Ramsauer H, Unterthiner T, Nessler B, Hochreiter S. GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium. In: Advances in Neural Information Processing Systems. vol. 30; 2017. p. 6626–6637.
  • 91.Seitzer M. pytorch-fid: FID Score for PyTorch; 2020. https://github.com/mseitzer/pytorch-fid.
  • 92.Lu J, Öfverstedt J, Lindblad J, Sladoje N. Datasets for Evaluation of Multimodal Image Registration; 2021. Available from: 10.5281/zenodo.5557568. [DOI]

Decision Letter 0

Dzung Pham

8 Jul 2022

PONE-D-22-07182Is image-to-image translation the panacea for multimodal image registration? A comparative studyPLOS ONE

Dear Dr. Lindblad,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process. In particular, both reviewers raised the need to cite additional references, as well as concerns about how the some of the evaluation experiments were conducted. 

Please submit your revised manuscript by Aug 20 2022 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

  • A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.

  • A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.

  • An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols.

We look forward to receiving your revised manuscript.

Kind regards,

Dzung Pham

Academic Editor

PLOS ONE

Journal Requirements:

When submitting your revision, we need you to address these additional requirements.

1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at 

https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and 

https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf

2. Thank you for stating the following in the Acknowledgments Section of your manuscript: 

The authors are financially supported by the Wallenberg AI, 906

Autonomous Systems and Software Program (WASP) AI-Math initiative, VINNOVA 907

(MedTech4Health project 2017-02447) and the Swedish Research Council (project 908

2017-04385).

However, funding information should not appear in the Acknowledgments section or other areas of your manuscript. We will only publish funding information present in the Funding Statement section of the online submission form. 

Please remove any funding-related text from the manuscript and let us know how you would like to update your Funding Statement. Currently, your Funding Statement reads as follows: 

The authors are financially supported by the Wallenberg AI, Autonomous Systems and Software Program (WASP) AI-Math initiative, VINNOVA (MedTech4Health, project 2017-02447) and the Swedish Research Council (project 2017-04385). 

The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Please include your amended statements within your cover letter; we will change the online submission form on your behalf.

3. We note that Figures 2 and 4 in your submission contain satellite images which may be copyrighted. All PLOS content is published under the Creative Commons Attribution License (CC BY 4.0), which means that the manuscript, images, and Supporting Information files will be freely available online, and any third party is permitted to access, download, copy, distribute, and use these materials in any way, even commercially, with proper attribution. For these reasons, we cannot publish previously copyrighted maps or satellite images created using proprietary data, such as Google software (Google Maps, Street View, and Earth). For more information, see our copyright guidelines: http://journals.plos.org/plosone/s/licenses-and-copyright.

We require you to either (1) present written permission from the copyright holder to publish these figures specifically under the CC BY 4.0 license, or (2) remove the figures from your submission:

a. You may seek permission from the original copyright holder of Figures 2 and 4 to publish the content specifically under the CC BY 4.0 license.  

We recommend that you contact the original copyright holder with the Content Permission Form (http://journals.plos.org/plosone/s/file?id=7c09/content-permission-form.pdf) and the following text:

“I request permission for the open-access journal PLOS ONE to publish XXX under the Creative Commons Attribution License (CCAL) CC BY 4.0 (http://creativecommons.org/licenses/by/4.0/). Please be aware that this license allows unrestricted use and distribution, even commercially, by third parties. Please reply and provide explicit written permission to publish XXX under a CC BY license and complete the attached form.”

Please upload the completed Content Permission Form or other proof of granted permissions as an "Other" file with your submission.

In the figure caption of the copyrighted figure, please include the following text: “Reprinted from [ref] under a CC BY license, with permission from [name of publisher], original copyright [original copyright year].”

b. If you are unable to obtain permission from the original copyright holder to publish these figures under the CC BY 4.0 license or if the copyright holder’s requirements are incompatible with the CC BY 4.0 license, please either i) remove the figure or ii) supply a replacement figure that complies with the CC BY 4.0 license. Please check copyright information on all replacement figures and update the figure caption with source information. If applicable, please specify in the figure caption text when a figure is similar but not identical to the original image and is therefore for illustrative purposes only.

The following resources for replacing copyrighted map figures may be helpful:

USGS National Map Viewer (public domain): http://viewer.nationalmap.gov/viewer/

The Gateway to Astronaut Photography of Earth (public domain): http://eol.jsc.nasa.gov/sseop/clickmap/

Maps at the CIA (public domain): https://www.cia.gov/library/publications/the-world-factbook/index.html and https://www.cia.gov/library/publications/cia-maps-publications/index.html

NASA Earth Observatory (public domain): http://earthobservatory.nasa.gov/

Landsat: http://landsat.visibleearth.nasa.gov/

USGS EROS (Earth Resources Observatory and Science (EROS) Center) (public domain): http://eros.usgs.gov/#

Natural Earth (public domain): http://www.naturalearthdata.com/

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Yes

Reviewer #2: Yes

**********

2. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #2: Yes

**********

3. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

**********

4. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: Yes

**********

5. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: The authors present an empirical study evaluating the efficacy of image translations (I2I) techniques for the purpose of multimodal registration. The study evaluates 4 recent deep learning based I2I techniques which are used with mono-modal registrations (SIFT+RANSAC and a-AMD) , and compared against four standard multi-modal methods as benchmarks. The methods were evaluated over 4 distinct biomedical imaging datasets, using Frechet Inception Distance (FID) to evaluate the image translation, and displacement accuracy to evaluation the registration.

Overall, I am impressed by this work. The authors were thorough in their analysis and explored a broad sample of techniques. The paper was also well written and reads very clearly. However, I have several concerns/questions I hope the authors can address:

1.) The introduction/background seems to be missing a discussion of image translation-based registration techniques that pre-dates the advent of deep learning in the field. Back then this was often referred to as image synthesis. Here are a few examples of these techniques, although many more were proposed in the literature:

Bogovic, J.A., Hanslovsky, P., Wong, A. and Saalfeld, S., 2016, April. Robust registration of calcium images by learned contrast synthesis. In 2016 IEEE 13th International Symposium on Biomedical Imaging (ISBI) (pp. 1123-1126). IEEE.

Chen, M., Carass, A., Jog, A., Lee, J., Roy, S. and Prince, J.L., 2017. Cross contrast multi-channel image registration using image synthesis for MR brain images. Medical image analysis, 36, pp.2-14.

Cao, X., Yang, J., Gao, Y., Wang, Q. and Shen, D., 2018. Region-adaptive deformable registration of CT/MRI pelvic images via learning-based image synthesis. IEEE Transactions on Image Processing, 27(7), pp.3500-3512.

2.) One topic that is rarely discussed in this area, which I hope the authors can touch upon, is the stochastic nature of deep learning based I2I techniques. That is, given the exact same train dataset, it is difficult to rebuild a model to provide the exact same image translation. While this problem isn’t unique to I2I models, it seems to have less of an impact on segmentation or classification networks where the model and results seem to converge quickly. In our experience, this issue gets amplified when further using the l2l model results as inputs for downstream registration, which seem to be very sensitive to this variance. Has the authors explored this limitation of these techniques? And how much impact do they think it has on their results? Is 3 folds sufficient to cover this variance?

3.) I am somewhat confused by the structuring of the comparisons reported in the experimental results. It seems the logical evaluation here would be to use each registration method with/without the image translation results as inputs, and swapping multimodal cost functions (MIND, MI, etc.) for monomodal ones (SSD, NCC, etc.) That way we can see the direct improvement that translating the modality has on the registration. However, this was only done for a-AMD and SIFT, and not fully for SIFT, since it wasn’t presented with a multimodal equivalent. The other methods were compared on just the raw images. This makes it difficult to tell if the loss/gains between the methods were due to the image translation, or differences between the registration technique themselves. For example, we know that Elastix is heavily tuned to perform well on MR images, that would explain the 100% registration accuracy on the MR dataset. However, that high performance is not necessarily because it used MI as a metric, there’s a good chance there are other heuristics built into the registration model that greatly help with the alignment. A more informative test would be to switch Elastix to using SSD instead of MI and use I2I results as inputs and see what happens. (And likewise for the other methods.)

4.) The authors mention that in Table 1, we can see some asymmetry depending on the translation direction. How do we know this difference is due to asymmetric performance of the image translation, and not a property of the different modalities? I.e. Is it possible the FID measures is just higher/lower in general for one modality over the other? Or is it normalized somehow to each modality?

5) In the conclusion, the authors make some strong statements about the efficacy of I2I based registration approaches relative to standard approaches (such as MI) based on their results. However, I think it is important to note that in this work, only intra-subject, rigid alignment, using simulated displacements were considered. This is a fairly limited registration task, and is not a great representation of the majority of real-world registration problems. Other studies have shown that once we start working with more complex transformations, cross subject/specimen data, or temporal shifts in the image, the effectiveness of traditional multi-modal measures such as MI starts falling dramatically, which is the driving motivation for many of the techniques cited by the authors.

Reviewer #2: This work tackles the well-known problem of multimodal registration, which lacks proper definitions for similarity functions between aligned images. Recently, due to the advent of many image-to-image translation methods, several methods propose to cast this multimodal registration problem as a monomodal problem, where several similarity functions have proved successful, by means of synthesising source domain images following the appearance of the target domain images.

The authors provide a nice trade-off between some of the standard metrics designed for multimodal registration and some methods using the intermediate I2I step with a monomodal similarity metric. Moreover, they add their recently proposed methods CoMIR, which in my opinion could be though also as an I2I but to a latent domain between the source and target domains.

The methods, datasets and experiments are well presented in the manuscript, even though sometimes it may be confusing due to the wealth of results and comparisons. I highlight the fact that the methods have been tested in 4 different medical imaging datasets.

However, I have some major comments and some minor comments that can be found in the attached PDF.

**********

6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: Yes: Adrià Casamitjana

**********

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

Attachment

Submitted filename: reviewer-comments.pdf

PLoS One. 2022 Nov 28;17(11):e0276196. doi: 10.1371/journal.pone.0276196.r002

Author response to Decision Letter 0


31 Aug 2022

Dear Dr Dzung Pham,

Thank you for giving serious consideration to our submission concerning image-to-image translation in the context of multimodal image registration. We are grateful to the associate editor and the two expert reviewers who have read and evaluated the merits of our manuscript. Following their recommendations, we have modified our study by:

adding new reference results in Table 1;

adding discussions and citations of several related works;

modifying the presentation of our results.

Furthermore, we have thoroughly revised and updated the text, to further improve its informative content as well as its clarity, incorporating constructive and highly appreciated suggestions given by the reviewers. For clarity, all important changes are highlighted in yellow in the manuscript.

Please find enclosed our point-by-point responses to each of the comments raised by the reviewers and editor after the revision of our manuscript.

Sincerely,

Jiahao Lu, Johan Öfverstedt, Joakim Lindblad, Nataša Sladoje

To Journal Requirements

1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at

https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf

Authors’ response:

We have modified our style according to the provided guidance. Please let us know if we missed anything, and we will do our best to make further modifications to meet the required style.

2. Thank you for stating the following in the Acknowledgments Section of your manuscript:

“The authors are financially supported by the Wallenberg AI, Autonomous Systems and Software Program (WASP) AI-Math initiative, VINNOVA (MedTech4Health project 2017-02447) and the Swedish Research Council (project 2017-04385).”

However, funding information should not appear in the Acknowledgments section or other areas of your manuscript. We will only publish funding information present in the Funding Statement section of the online submission form.

Please remove any funding-related text from the manuscript and let us know how you would like to update your Funding Statement. Currently, your Funding Statement reads as follows:

The authors are financially supported by the Wallenberg AI, Autonomous Systems and Software Program (WASP) AI-Math initiative, VINNOVA (MedTech4Health, project 2017-02447) and the Swedish Research Council (project 2017-04385).

The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Please include your amended statements within your cover letter; we will change the online submission form on your behalf.

Authors’ response:

We have removed the funding information from our Acknowledgements Section. We are satisfied with the current Funding Statement and do not need to update it.

3. We note that Figures 2 and 4 in your submission contain satellite images which may be copyrighted. All PLOS content is published under the Creative Commons Attribution License (CC BY 4.0), which means that the manuscript, images, and Supporting Information files will be freely available online, and any third party is permitted to access, download, copy, distribute, and use these materials in any way, even commercially, with proper attribution. For these reasons, we cannot publish previously copyrighted maps or satellite images created using proprietary data, such as Google software (Google Maps, Street View, and Earth). For more information, see our copyright guidelines: http://journals.plos.org/plosone/s/licenses-and-copyright.

We require you to either (1) present written permission from the copyright holder to publish these figures specifically under the CC BY 4.0 license, or (2) remove the figures from your submission:

Authors’ response:

We have replaced the images included in Figures 2 and 4 with the new source published under CC BY 4.0 (https://doi.org/10.5281/zenodo.5914758), and updated our citation accordingly.

To Reviewer #1

The authors present an empirical study evaluating the efficacy of image translations (I2I) techniques for the purpose of multimodal registration. The study evaluates 4 recent deep learning based I2I techniques which are used with mono-modal registrations (SIFT+RANSAC and a-AMD) , and compared against four standard multi-modal methods as benchmarks. The methods were evaluated over 4 distinct biomedical imaging datasets, using Frechet Inception Distance (FID) to evaluate the image translation, and displacement accuracy to evaluation the registration.

Overall, I am impressed by this work. The authors were thorough in their analysis and explored a broad sample of techniques. The paper was also well written and reads very clearly. However, I have several concerns/questions I hope the authors can address:

Authors’ response:

We thank the reviewer for recognising the contribution of our study and its relevance, as well as for the constructive comments and suggestions. We have no doubts that, by addressing them, we further increased the informative content and the overall quality of the manuscript.

1.) The introduction/background seems to be missing a discussion of image translation-based registration techniques that pre-dates the advent of deep learning in the field. Back then this was often referred to as image synthesis. Here are a few examples of these techniques, although many more were proposed in the literature:

Bogovic, J.A., Hanslovsky, P., Wong, A. and Saalfeld, S., 2016, April. Robust registration of calcium images by learned contrast synthesis. In 2016 IEEE 13th International Symposium on Biomedical Imaging (ISBI) (pp. 1123-1126). IEEE.

Chen, M., Carass, A., Jog, A., Lee, J., Roy, S. and Prince, J.L., 2017. Cross contrast multi-channel image registration using image synthesis for MR brain images. Medical image analysis, 36, pp.2-14.

Cao, X., Yang, J., Gao, Y., Wang, Q. and Shen, D., 2018. Region-adaptive deformable registration of CT/MRI pelvic images via learning-based image synthesis. IEEE Transactions on Image Processing, 27(7), pp.3500-3512.

Authors’ response:

We find the reviewer’s suggestion very relevant. Therefore, we added discussions in both Section 1 (Introduction) and Section 2 (Background and related work), with all the mentioned works cited.

2.) One topic that is rarely discussed in this area, which I hope the authors can touch upon, is the stochastic nature of deep learning based I2I techniques. That is, given the exact same train dataset, it is difficult to rebuild a model to provide the exact same image translation. While this problem isn’t unique to I2I models, it seems to have less of an impact on segmentation or classification networks where the model and results seem to converge quickly. In our experience, this issue gets amplified when further using the l2l model results as inputs for downstream registration, which seem to be very sensitive to this variance. Has the authors explored this limitation of these techniques? And how much impact do they think it has on their results? Is 3 folds sufficient to cover this variance?

Authors’ response:

We appreciate the reviewer’s interest in this topic. We find it relevant, but somewhat out of the scope of our study; we see this as a part of the analysis and evaluation of each particular I2I method. To keep the focus of the study, we prefer to not further elaborate on the topic in the paper, but instead leave it to any interested reader to find further details about the used methods in the original publications, where the findings of their authors are reported. Our observations related to the reviewer’s question are summarised below:

We agree with the reviewer that this topic is not widely discussed, but there are existing works touching upon this issue. In DRIT++[1] (one of the I2I-methods evaluated in our study), the authors’ reported variance values in their Table 1 are not so large, for any of their 4 evaluated metrics, as to expect a big impact on the downstream registration performance. Another cited work[2], which also uses I2I methods prior to deformable multimodal registration, reported variance values of 6 metrics on 2 datasets. It can be observed from their Tables 1 and 2 that the variance values of GAN-based I2I methods are close to, or smaller than, the ones of MIND and Elastix.

In our experiments, we also do not observe a large variance when training the I2I methods with exactly the same data. Do note that our variance values are not computed from multiple random runs given identical training data; the point of our 3-fold validation is to expose the variance when training with different data samples, which we find more relevant for real usage. What we observe and report on is, indeed as the reviewer mentioned, that the GAN-based I2I methods exhibit instability and strong data dependence.

References:

[1] H.-Y. Lee et al., “DRIT++: Diverse Image-to-Image Translation via Disentangled Representations,” Int J Comput Vis, Feb. 2020.

[2] C. Qin et al., “Unsupervised Deformable Registration for Multi-modal Images via Disentangled Representations,” in Information Processing in Medical Imaging, 2019, pp. 249–261.

3.) I am somewhat confused by the structuring of the comparisons reported in the experimental results. It seems the logical evaluation here would be to use each registration method with/without the image translation results as inputs, and swapping multimodal cost functions (MIND, MI, etc.) for monomodal ones (SSD, NCC, etc.) That way we can see the direct improvement that translating the modality has on the registration. However, this was only done for a-AMD and SIFT, and not fully for SIFT, since it wasn’t presented with a multimodal equivalent. The other methods were compared on just the raw images. This makes it difficult to tell if the loss/gains between the methods were due to the image translation, or differences between the registration technique themselves. For example, we know that Elastix is heavily tuned to perform well on MR images, that would explain the 100% registration accuracy on the MR dataset. However, that high performance is not necessarily because it used MI as a metric, there’s a good chance there are other heuristics built into the registration model that greatly help with the alignment. A more informative test would be to switch Elastix to using SSD instead of MI and use I2I results as inputs and see what happens. (And likewise for the other methods.)

Authors’ response:

We thank the reviewer for pointing out these potential sources of ambiguity. Our evaluation strategy indeed does not exactly follow the structure envisioned by the reviewer, but in our opinion still offers a possibility to attribute changes in performance due to the I2I methods, separating them from the role of the particular used registration approaches. One important reason for not following the evaluation structure suggested by the reviewer is that a simple adjustment of the corresponding settings for monomodal and multimodal cases (such as changing the metric from MI in a multimodal case, to SSD in a monomodal case) is not possible for the majority of the used registration frameworks. The registration framework based on α-AMD is tailored for a particular distance measure (α-AMD), and does not support the direct use of other similarity/distance measures; it is shown in previous studies[1,2] that this particular distance measure has clear advantages over several alternatives (including both SSD and MI) in the monomodal registration context. Furthermore, MIND cannot be seen as a multimodal cost function (as formulated in the reviewer’s question), but the framework involves the generation of image representations (similarly to CoMIR, but without learning) prior to applying a selected metric on these representations to perform registration; in that sense, a discussion on the usage of multimodal and monomodal metrics, in the suggested way, does not appear appropriate. MIND can however be used in combination with different distance metrics, and we here evaluate both α-AMD and (the default) SSD/MSD for this task, where the prior clearly outperforms the latter.

Based on the above, we have designed the evaluation to present the performance of a range of I2I-style methods in combination with (subsequently applied) different registration methods, as well as including several different types of baselines, both monomodal and multimodal; we believe that by doing so we managed to highlight, and support, the relevant and important results of the study.

Let us also add that we agree with the reviewer that Elastix might be well optimised for a particular type of data (medical/brain images), but as can be observed from our results, the good performance of MI-based registration holds also for Cytological and partly for Histological data (where MI still clearly outperform all of the I2I-based paths, but at the same time stays far behind the best performing CoMIR combined with α-AMD). Therefore we feel confident in our conclusion that MI as such remains a good option for general multimodal registration.

To respond to the reviewer’s question and the interesting initiated discussion, we below, as well as in the paper, try to further clarify our evaluation strategy:

We set B2A as the baseline, which uses α-AMD and SIFT respectively to register the raw multimodal images, without intermediate image translations, or representations. (Neither of the two registration frameworks differentiates the settings for registration of multimodal vs monomodal data, and both are primarily designed with monomodal data in focus.) The direct effect of modality translation can be observed by comparing the I2I methods’ performance with this baseline.

We include a comparison with registration frameworks based on MI maximisation, MIND (with two different similarity metrics) and NGF similarity maximization, in order to provide reference performance on each dataset of generally well-performing conventional multimodal registration methods, each in its own form which can be used out-of-the-box for end users.

We have somewhat re-formulated our presentation of results in Table 2, aiming for a clearer comparison of the performance with/without modality translation.

References:

[1] J. Öfverstedt, J. Lindblad, and N. Sladoje. Fast and Robust Symmetric Image Registration Based on Distances Combining Intensity and Spatial Information. IEEE Transactions on Image Processing, Vol. 27, No. 7, pp. 3584-3597, 2019.

[2] J. Öfverstedt, J. Lindblad, and N. Sladoje. INSPIRE: Intensity and Spatial Information-Based Deformable Image Registration. arXiv preprint: arXiv:2012.07208

4.) The authors mention that in Table 1, we can see some asymmetry depending on the translation direction. How do we know this difference is due to asymmetric performance of the image translation, and not a property of the different modalities? I.e. Is it possible the FID measures is just higher/lower in general for one modality over the other? Or is it normalized somehow to each modality?

Authors’ response:

We agree with the reviewer and find the concern that it is possible that the FID is just higher/lower in general for one modality over the other relevant and worth further investigation. Therefore we have evaluated the FID between the training and testing splits within each modality for each dataset to establish references, and we reported the obtained results in Table 1 (in the two added rows at the bottom). In most cases, the computed FIDs do not show any large differences for the different modalities within each of the datasets. An exception is the Cytological dataset where the FID is generally larger for modality B (82.8 ±20.4) than modality A (48.8 ±27.1), which is most likely attributed to partial intensity value overflows occurring during the data acquisition, as we observed during visual inspection. We removed our related (not totally well-founded) observation (ii) after Table 1.

5) In the conclusion, the authors make some strong statements about the efficacy of I2I based registration approaches relative to standard approaches (such as MI) based on their results. However, I think it is important to note that in this work, only intra-subject, rigid alignment, using simulated displacements were considered. This is a fairly limited registration task, and is not a great representation of the majority of real-world registration problems. Other studies have shown that once we start working with more complex transformations, cross subject/specimen data, or temporal shifts in the image, the effectiveness of traditional multi-modal measures such as MI starts falling dramatically, which is the driving motivation for many of the techniques cited by the authors.

Authors’ response:

We appreciate the reviewer’s opinion and find it very relevant. Hence we have modified our Conclusion section accordingly and mentioned the limitation of MI.

To Reviewer #2

This work tackles the well-known problem of multimodal registration, which lacks proper definitions for similarity functions between aligned images. Recently, due to the advent of many image-to-image translation methods, several methods propose to cast this multimodal registration problem as a monomodal problem, where several similarity functions have proved successful, by means of synthesising source domain images following the appearance of the target domain images.

The authors provide a nice trade-off between some of the standard metrics designed for multimodal registration and some methods using the intermediate I2I step with a monomodal similarity metric. Moreover, they add their recently proposed methods CoMIR, which in my opinion could be though also as an I2I but to a latent domain between the source and target domains.

The methods, datasets and experiments are well presented in the manuscript, even though sometimes it may be confusing due to the wealth of results and comparisons.

However, I have some major comments and some minor comments, both detailed below in a section-by-section basis. Before that, I wanted to highlight my main concern (repeated below). In my opinion, the experiments design may need a bit more of discussion or refactoring. I think that some readers may eventually say that the comparison is not entirely faire unless clarified by the authors. The I2I methods have been trained using different training data (not datasets, but different inputs, for example, aligned vs unaligned images), or hyperparameters (batch size, number of iterations). I acknowledge that all parameters cannot be set equally, but for a fair comparison one needs to decide which ones to keep and which not and justify this choice. In the case of this article, I think that the same training data (all methods sees the same aligned pairs in training), and batch size need to be kept constant for all methods. In contrast, the number of training iterations/epoch may be different based on the convergence of the methods (rather than a fixed number of epochs as it is now; in such case, I’d say it’s better to fix the same number of epochs for all methods).

Such decisions, independent of which ones, need to need properly justified. The authors mention all these experimental decisions but the reasons behind are not convincing to me.

Authors’ response:

We thank the reviewer for recognising the contribution of our study and its relevance, as well as for the constructive comments and suggestions. We have no doubts that, by addressing them, we further increased the informative content and the overall quality of the manuscript. We reply to the reviewer’s concerns in a point-to-point manner in the following paragraphs.

Background section:

I think that this paper (https://doi.org/10.1007/978-3-642-40811-3_79) asked a similar question about 10 years ago and may be mentioned in the introduction, as a very similar type of work. It can be also interesting to compare the conclusions from that article with the present in the conclusions.

Authors’ response:

We find the reviewer’s suggestion very relevant. Therefore, we added discussions in both Section 1 (Introduction) and Section 2 (Background and related work), with this paper and several other related works cited. The experiments in the mentioned article were conducted on already affinely-aligned images where only small local deformation exists, thus making a direct comparison of conclusion challenging. Their MRI T1-PD exemplar-based synthesis shows good visual quality, presumably due to rather correlated modalities. This indirectly supports our observation that the I2I-based approach exhibits rather high data dependence.

The distinction between intensity- and feature-based and hybrid registration methods seems appropriate for this work, since they are mainly focused on the multimodal scenario. They discuss several intensity-based and a few feature-based approaches. In the last paragraphs, they highlight several works that use I2I for registration of medical images. For completeness, I think that this part lacks:

In which category of the classification introduced by the authors do the I2I methods fall? (intensity, feature, hybrid).

Why do we need I2I? (justification)

What are the problems when training I2I translation methods? How is it addressed in the literature? (e.g., in 49 they use invariance between registration and translation)

Authors’ response:

We reply to the reviewer’s questions/suggestions as follows:

I2I methods do not actually fall into any category of registration methods, because I2I translation is used as a “pre-processing” step to convert the multimodal problem to a pseudo-monomodal one, and the classification into the mentioned categories should then be based on the subsequent registration approach. In our study, α-AMD and SIFT are used as representatives of intensity- and feature-based registration methods respectively, while both can be used in conjunction with the different I2I methods.

Following the reviewer’s suggestion above, we extended the justification of introducing I2I translation to tackle the multimodal registration problem in Sec. 2, noting that with the recent advent of GAN-based I2I methods, the potential of this approach, which was previously limited by the synthesis methods, is yet to be explored.

The main problem of using I2I methods in registration is caused by the unstable, data-dependent, and varying outputs of GANs; therefore the previous works (and suggested solutions) are mostly application dependent. We have added a statement on this. Since we are conducting a high-level comparative study of several GAN-based I2I methods, we prefer to avoid delving deeper into the analysis of more method- and application-specific details. An interested reader is referred to the cited original papers.

Moreover, I think there are two concepts probably misused (happy to discuss with the authors so that they can convince me):

“Monomodal registration frameworks”. I don’t think a registration framework is monomodal or multimodal per se. It is the metric optimised that is more appropriate for monomodal or multimodal scenarios.

As an example, NiftyReg can use the SSD, LNCC or NMI metrics depending on the problem.

The authors define the MIND optimisation framework as a “similarity-based (monomodal) registration framework”, where they use the SSD between MIND representations of both moving and reference images, even though it can be used to align both monomodal and multimodal images.

In the same line, I’m not sure that “monomodal registration is easier”, there are just better defined losses (intensity-based, voxel-level) than in the multimodal case. For example, SSD of Local NCC.

Authors’ response:

We thank the reviewer for bringing up these relevant topics:

We agree with the reviewer’s point. To reduce confusion, we have modified our wording and phrasing accordingly to avoid statements such as “monomodal registration frameworks”.

We consider the loss (similarity/distance used etc.) to be an integral part of the registration method, and thus the monomodal registration problem is typically easier because there (as also pointed out by the reviewer) exist better-defined losses. In addition, note that we only state that this is in general the case, since other factors also play a role in the overall difficulty.

Considered methods:

I think the second paragraph should be in the results section (less confusing for the reader). You actually mention it later as well.

“We also evaluated VoxelMorph [32] using MI as a loss function for the task of rigid multimodal registration. However, it consistently under-performed in our rigid registration task (similar is also observed by [57]) and we exclude the related results for clarity.”

Moreover, you could in few words say which VoxelMorph configuration did you test: SVF, deformation fields, rigid transform parameters (if exists)?

Authors’ response:

We appreciate and agree with the reviewer's comment. Thus we have removed this paragraph and added Sec. 3.1.5 to briefly introduce VoxelMorph and our performed modification.

Experiments:

Fair comparison of I2I methods: the same data augmentation is used (great!) but:

The data used is different! (depending on the need of aligned or not images).

The batch sizes are also different.

The number of iterations are also different and it is hard coded (it could be justified by convergence for this certain problem, but not regarding the default number of epochs, which are probably optimised for other tasks).

My thinking is that one could optimise each method and it may be right (e.g., run until convergence), but for a comparison to be fair some training parameters must be fixed and equal for all methods (e.g., the most restrictive method sets the parameter). The criteria should be clear and justified.

My special concern is with the use of aligned and not aligned training data, especially because training with aligned images seems more appropriate for the task at hand (which is registration, not I2I). Also, the method (at least for DRIT++ is made explicit but I think it’s also the case for CycleGAN, StarGANv2), is is trained with unpaired images but tested with paired images. And then in the results sections, CoMIR and Pix2Pix are the best performing ones.

The feature-wise transformation los DRIT++ should be briefly summarised in the methods section (3.2.3.) as you mention a slight modification and seems important for understanding the method comparison.

Authors’ response:

We appreciate the reviewer’s raised concerns, which indeed help us to improve the presentation of our study. We have made the following modifications to address each point above:

The considered I2I methods are selected based on their popularity in the community, and generally observed good performance in different applications, while ensuring diversity of the selection in terms of supervision, scalability to more modalities, and ease of usage. We have tried our best to make a fair comparison of such a diverse collection and adjusted the I2I methods where appropriate. Here we hope that some possible misunderstandings can be clarified:

The used data for all methods are actually exactly the same. The difference is how each method utilises the provided data. More specifically, “supervised” or “unsupervised” (in terms of utilisation of the aligned image pairs) is an intrinsic property of the I2I translation methods. Being unsupervised I2I translation methods, DRIT++, StarGANv2, and CycleGAN do not benefit from aligned pairs during training (these methods do not use that information even though they are presented with it). On the other hand, pix2pix and CoMIR need to utilise the alignment information for training and cannot be trained without. The qualitative advantage of the “unsupervised” I2I approaches is reasonable, while, as observed in our results, the quantitative benefits of the “supervised” ones are clearly shown.

We can now understand why confusion regarding the used training data occurred and we have therefore removed the somewhat misleading sentence in Sec. 5.1; the way the training data is handled is not introduced by our implementation but intrinsically exists within the different methods. We clarify the supervised/unsupervised nature of the different I2I methods in Sec. 3.2 when introducing them.

Batch size has shown to be not a critical factor of performance[1]. We try to stay close to the default setting of the I2I methods while also optimising for our GPU resources. On the other hand, although keeping batch size constant makes sense when evaluating different configurations of a specific method, however, since different methods benefit differently from large batch sizes, using the same batch size for all methods is not necessarily fair. Our setting, limiting the GPU memory consumption for each method instead, reflects a more practical usage parameter and therefore may be more informative to the reader (who typically is, in a similar way, limited by the GPU memory).

We agree with the reviewer on this point. The convergence was indeed checked and confirmed for each training. We have modified our description in the manuscript. Yet we do not opt to set the parameters by the most restrictive method, but to give each method a fair chance to reach its best on this specific task.

We appreciate the reviewer’s suggestion and have modified our Sec. 3.2.3 accordingly.

References:

[1] S. Raschka, “No, We Don’t Have to Choose Batch Sizes As Powers Of 2,” Dr. Sebastian Raschka. [Online]. Available: https://sebastianraschka.com/blog/2022/batch-size-2.html. [Accessed: 21-Jul-2022].

Results:

Figure 8 needs indexing: A,B, etc…

Authors’ response:

We thank the reviewer for pointing this out. We find that using sub-captions provides improved clarity over indexing, however, this is not well handled in the PLOS review layout. We are keeping this technical issue in mind and assure to have it appropriately handled in the camera-ready version.

Conclusions:

One of the problems with I2I methods is that the topology of the generated image may not be preserved. Then, recent works such as (https://doi.org/10.1007/978-3-030-87592-3_5., citation 49 and even the authors’ method CoMIR could be thought as an being on this line) propose different options to ameliorate this fact. I think that, at least, this needs to be discussed somewhere in the paper (maybe linked to the previous comment about “problems with I2I translation methods”?). Maybe not in the conclusions but in sections 6.2 or 6.4.

Authors’ response:

We appreciate the reviewer’s suggestion and have added a paragraph of discussion in Sec. 6.4. We find that this mentioned recent work is indeed relevant and we have added a citation to it in our Sec. 2 when discussing related works using I2I for registration.

Minor comments:

1. In the introduction (lines 52-55) the authors seem to say that multimodal image registration using I2I and GAN-based approaches are mostly inexistent. I wouldn’t say so, and even the authors discuss many approaches in the following section.

Authors’ response:

We appreciate the reviewer’s suggestion and have modified our phrasing accordingly, to emphasise our work's uniqueness in quantitative evaluation and multiple datasets.

2. In Fig.1, I would not say that the registration from generated modality B to acquired modality B is simpler (same for modality A), but it’s monomodal. Since it is a monomodal registration, you could use standard pixel-based similarity metrics that are well defined in such scenarios. In the case of registering acquired modality A and B, the registration is not harder but multimodal; in this case there are not well-defined objective functions to optimise. The most widely used, as you say throughout the text, is the mutual information, but the performance is far from the monomodal scenarios. Moreover, what you depict here is more like a CycleGAN-based approach. I think it would be more clear to the reader just to show a forward transformation (or clarify it in the text and caption).

Authors’ response:

We appreciate the comment, but we disagree. Firstly, we do not claim that the monomodal approach is necessarily “simpler”. We use question marks to introduce the scientific question that our study is trying to answer, and have it clearly stated in the caption as possibly simpler. As aforementioned, we consider the loss an integral part of the registration problem; thus, monomodal is typically easier because better-defined losses exist. In addition, please note that there is no cycle in our figure. The image shows three optional paths of registration (as stated in the caption - "either of the two peripheral paths"). The modality translation can intrinsically be conducted in both directions, as we did in our experiments. Which direction to translate may depend on the information contained in the modalities, where asymmetry often exists. We think excluding one path would risk losing the information we intended to convey.

3. Good to have published the methods open-source in Github. I think it would be useful to have the link in the text (and not only in the abstract). Either in the introduction when you mention it, or in the Methods, section.

Authors’ response:

We appreciate the reviewer’s suggestion and have added a link in the first paragraph of Sec. 5.4, since we share not only the implementation of methods but also the whole experimental setup and evaluation protocol described there.

4. Please, uniform StarGANv2 and StarGAN-v2 naming convention.

Authors’ response:

We thank the reviewer for pointing this out. All namings have been unified to StarGANv2.

5. I think that MSD acronym is defined after the first use.

Authors’ response:

We thank the reviewer for pointing this out. The second (repeated) definition has been removed.

Attachment

Submitted filename: Response_to_Reviewers.pdf

Decision Letter 1

Dzung Pham

2 Oct 2022

Is image-to-image translation the panacea for multimodal image registration? A comparative study

PONE-D-22-07182R1

Dear Dr. Lindblad,

We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements.

Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication.

An invoice for payment will follow shortly after the formal acceptance. To ensure an efficient process, please log into Editorial Manager at http://www.editorialmanager.com/pone/, click the 'Update My Information' link at the top of the page, and double check that your user information is up-to-date. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org.

If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

Kind regards,

Dzung Pham

Academic Editor

PLOS ONE

Additional Editor Comments (optional):

Reviewer 2 suggested that a sentence from Section 5.1 that was removed in the revision be put back. I leave it up to the authors whether they would like to add this sentence back in proofs.

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation.

Reviewer #1: All comments have been addressed

Reviewer #2: All comments have been addressed

**********

2. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Yes

Reviewer #2: Yes

**********

3. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #2: Yes

**********

4. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: Yes

**********

6. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: The authors have adequately addressed each of my major concerns regarding their work. I have no additional comments.

Reviewer #2: Dear authors,

Thanks for carefully addressing all concerns from both reviewers and the editors. I think the manuscript

has improved in readability and completeness. I mostly agree with all reviewer’s responses, and thanks for justifying your experimental decisions. I just want to add something related to the paired vs. unpaired discussion we began.

1.- Regarding your paragraph in the letter of response:

“Being unsupervised I2I translation methods, DRIT++, StarGANv2, and CycleGAN do not benefit from aligned pairs during training (these methods do not use that information even though they are presented with it). On the other hand, pix2pix and CoMIR need to utilise the alignment information for training and cannot be trained without. The qualitative advantage of the “unsupervised” I2I approaches is reasonable, while, as observed in our results, the quantitative benefits of the “supervised” ones are clearly shown.”

I completely agree that some methods (e.g., pix2pix and CoMIR) need paired training data by design and that others, such as CycleGAN, DRIT++ and StarGANv2 need not. However, the optimization of the latter seems easier when they are fed with paired images -- even though I don’t know if that is the case in practice (maybe it gets to a local minima that do not generalize well to unseen images).

In any case I would keep the sentence where you explain how you sample the images from both modalities depending on the method (Sec. 5.1). Apologies if my comment was confusing in the first place.

**********

7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: Yes: Adrià Casamitjana Díaz

**********

Acceptance letter

Dzung Pham

15 Nov 2022

PONE-D-22-07182R1

Is image-to-image translation the panacea for multimodal image registration? A comparative study

Dear Dr. Lindblad:

I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department.

If your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information please contact onepress@plos.org.

If we can help with anything else, please email us at plosone@plos.org.

Thank you for submitting your work to PLOS ONE and supporting open access.

Kind regards,

PLOS ONE Editorial Office Staff

on behalf of

Dr Dzung Pham

Academic Editor

PLOS ONE

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    S1 Appendix. List of abbreviations (and method names) used in the paper.

    (PDF)

    Attachment

    Submitted filename: reviewer-comments.pdf

    Attachment

    Submitted filename: Response_to_Reviewers.pdf

    Data Availability Statement

    All used datasets are openly shared on Zenodo: https://zenodo.org/record/5557568 (DOI: 10.5281/zenodo.5557568).


    Articles from PLOS ONE are provided here courtesy of PLOS

    RESOURCES