Skip to main content
. Author manuscript; available in PMC: 2024 Apr 1.
Published in final edited form as: Med Image Anal. 2023 Jan 31;85:102762. doi: 10.1016/j.media.2023.102762

Table 3.

The summarized review of Transformer-based model for medical image detection (upper panel) and registration (lower panel). "N.A." denotes for not applicable for intermediate blocks or decoder module. "N" denotes not reported or not applicable on the number of model parameters. "t" denotes temporal dimension.

Reference Architecture 2D/3D Pre-training #Param Detection Task Modality Dataset ViT as Enc/Inter/Dec Highlights
Detection COTR (Shen et al., 2021a) Conv-Transformer Hybrid 2D No N Polyp Detection Colonoscopy CVC-ClinicDB (Bernal et al., 2015), ETIS-LARIB (Silva et al., 2014), CVC-ColonDB (Bernal et al., 2012 Yes/No/Yes Convolution layers embedded between Transformer encoder and decoder to preserve feature structure.
(Mathai et al., 2022) Conventional Transformer 2D No N Lymph Node Detection MRI Abdominal MRI Yes/N.A./Yes DETR applied to T2 MRI.
(Jiang et al., 2021) Conv-Transformer Hybrid 2D No N Dental Caries Detection RGB Dental Caries Digital Image No/Yes/No Augment YOLO by applying Transformer on the features extracted from the CNN encoder.
TR-Net (Ma et al., 2021a) Conv-Transformer Hybrid 3D No N Stenosis Detection CTA Coronary CT Angiography No/Yes/N.A. CNN applied to image patches, followed by a Transformer to learn patch-wise dependencies.
DATR (Zhu et al., 2022) Conv-Transformer Hybrid 2D Pre-trained Swin N Landmark Detection X-ray Head (Wang et al.,2016), Hand (Payeret al., 2019), and Chest (Zhu et al., 2021) Yes/N.A./No The integration of a learnable diagonal matrix to Swin Transformer enables the learning of domain-specific features across domains.
SATr (Li et al., 2022b) Conv-Transformer Hybrid 2D No N Lesion Detection CT DeepLesion (Yan et al., 2018 No/Yes/No Introduce slice attention Transformer to commonly used CNN backbones for capturing inter- and intra-slice dependencies.
(Tian et al., 2022) Conv-Transformer Hybrid 2D+t Pre-trained CNN N Polyp Detection Colonoscopy Hyper-Kvasir (Borgli et al., 2020), LD-PolypVideo (Ma et al., 2021b No/Yes/N.A. A weakly-supervised framework with a hybrid CNN-Transformer model is developed for polyp detection.
SCT (Windsor et al., 2022) Conv-Transformer Hybrid 2D Pre-trained CNN N Spinal Cancer Detection MRI Whole Spine MRI No/Yes/N.A. A Transformer that considers contextual information from the multiple spinal columns and all accessible MRI sequences is used to detect spinal cancer.
Reference Architecture 2D/3D Pre-training #Param Registration Task Modality Dataset ViT as Enc/Inter/Dec Highlights
Registration ViT-V-Net (Chen et al., 2021c) Conv-Transformer Hybrid 3D No 110.6M Inter-patient MRI Brain MRI Yes/No/No ViT applied to the CNN extracted features in the encoder.
TransMorph (Chen et al., 2022b) Conv-Transformer Hybrid 3D No 46.8 Inter-patient, Atlas-to-patient, Phantom-to-patient MRI, CT, XCAT IXI*, OASIS (Marcus et al., 2007), Abdominal and Pelvic CT, (Segars et al., 2013) Yes/No/No Swin Transformer is used as the encoder for extracting features from the concatenated input image pair.
DTN (Zhang et al., 2021c) Conv-Transformer Hybrid 3D No N Inter-patient MRI OASIS (Marcus et al., 2007 No/Yes/No Separate Transformers are employed to capture inter- and intra-image dependencies within the image pair.
PC-SwinMorph (Liu et al., 2022a) Conv-Transformer Hybrid 3D No N Inter-patient MRI CANDI (Kennedy et al., 2012), LPBA-40 (Shattuck et al., 2008) No/No/Hybird Patch-based image registration; Swin Transformer is used for stitching the patch-wise deformation fields.
Swin-VoxelMorph (Zhu and Lu, 2022) Conv-Transformer Hybrid 3D No N Patient-to-atlas MRI ADNI (Mueller et al.,2005), PPMI (Marek et al., 2011) Yes/N.A./Yes Swin-Transformer-based encoder and decoder network for inverse-consistent registration.
XMorpher (Shi et al., 2022) Conv-like Transformers 3D No N Inter-patient CT MM-WHS 2017 (Zhuang and Shen, 2016), ASOCA (Gharleghi et al., 2022) Yes/N.A./Yes Two Swin-like Transformers are used for fixed and moving images, with cross-attention blocks facilitating communication between Transformers.
C2FViT (Mok and Chung, 2022) Conv-like Transformers 3D No N Template-matching, patient-to-atlas MRI OASIS (Marcus et al., 2007), LPBA-40 (Shattuck et al., 2008 Yes/N.A./N.A. Multi-resolution Vision Transformer is used to tackle the affine registration problem with brain MRI.