Table 3.
The summarized review of Transformer-based model for medical image detection (upper panel) and registration (lower panel). "N.A." denotes for not applicable for intermediate blocks or decoder module. "N" denotes not reported or not applicable on the number of model parameters. "t" denotes temporal dimension.
Reference | Architecture | 2D/3D | Pre-training | #Param | Detection Task | Modality | Dataset | ViT as Enc/Inter/Dec | Highlights | |
---|---|---|---|---|---|---|---|---|---|---|
Detection | COTR (Shen et al., 2021a) | Conv-Transformer Hybrid | 2D | No | N | Polyp Detection | Colonoscopy | CVC-ClinicDB (Bernal et al., 2015), ETIS-LARIB (Silva et al., 2014), CVC-ColonDB (Bernal et al., 2012 | Yes/No/Yes | Convolution layers embedded between Transformer encoder and decoder to preserve feature structure. |
(Mathai et al., 2022) | Conventional Transformer | 2D | No | N | Lymph Node Detection | MRI | Abdominal MRI | Yes/N.A./Yes | DETR applied to T2 MRI. | |
(Jiang et al., 2021) | Conv-Transformer Hybrid | 2D | No | N | Dental Caries Detection | RGB | Dental Caries Digital Image | No/Yes/No | Augment YOLO by applying Transformer on the features extracted from the CNN encoder. | |
TR-Net (Ma et al., 2021a) | Conv-Transformer Hybrid | 3D | No | N | Stenosis Detection | CTA | Coronary CT Angiography | No/Yes/N.A. | CNN applied to image patches, followed by a Transformer to learn patch-wise dependencies. | |
DATR (Zhu et al., 2022) | Conv-Transformer Hybrid | 2D | Pre-trained Swin | N | Landmark Detection | X-ray | Head (Wang et al.,2016), Hand (Payeret al., 2019), and Chest (Zhu et al., 2021) | Yes/N.A./No | The integration of a learnable diagonal matrix to Swin Transformer enables the learning of domain-specific features across domains. | |
SATr (Li et al., 2022b) | Conv-Transformer Hybrid | 2D | No | N | Lesion Detection | CT | DeepLesion (Yan et al., 2018 | No/Yes/No | Introduce slice attention Transformer to commonly used CNN backbones for capturing inter- and intra-slice dependencies. | |
(Tian et al., 2022) | Conv-Transformer Hybrid | 2D+t | Pre-trained CNN | N | Polyp Detection | Colonoscopy | Hyper-Kvasir (Borgli et al., 2020), LD-PolypVideo (Ma et al., 2021b | No/Yes/N.A. | A weakly-supervised framework with a hybrid CNN-Transformer model is developed for polyp detection. | |
SCT (Windsor et al., 2022) | Conv-Transformer Hybrid | 2D | Pre-trained CNN | N | Spinal Cancer Detection | MRI | Whole Spine MRI | No/Yes/N.A. | A Transformer that considers contextual information from the multiple spinal columns and all accessible MRI sequences is used to detect spinal cancer. | |
Reference | Architecture | 2D/3D | Pre-training | #Param | Registration Task | Modality | Dataset | ViT as Enc/Inter/Dec | Highlights | |
Registration | ViT-V-Net (Chen et al., 2021c) | Conv-Transformer Hybrid | 3D | No | 110.6M | Inter-patient | MRI | Brain MRI | Yes/No/No | ViT applied to the CNN extracted features in the encoder. |
TransMorph (Chen et al., 2022b) | Conv-Transformer Hybrid | 3D | No | 46.8 | Inter-patient, Atlas-to-patient, Phantom-to-patient | MRI, CT, XCAT | IXI*, OASIS (Marcus et al., 2007), Abdominal and Pelvic CT, (Segars et al., 2013) | Yes/No/No | Swin Transformer is used as the encoder for extracting features from the concatenated input image pair. | |
DTN (Zhang et al., 2021c) | Conv-Transformer Hybrid | 3D | No | N | Inter-patient | MRI | OASIS (Marcus et al., 2007 | No/Yes/No | Separate Transformers are employed to capture inter- and intra-image dependencies within the image pair. | |
PC-SwinMorph (Liu et al., 2022a) | Conv-Transformer Hybrid | 3D | No | N | Inter-patient | MRI | CANDI (Kennedy et al., 2012), LPBA-40 (Shattuck et al., 2008) | No/No/Hybird | Patch-based image registration; Swin Transformer is used for stitching the patch-wise deformation fields. | |
Swin-VoxelMorph (Zhu and Lu, 2022) | Conv-Transformer Hybrid | 3D | No | N | Patient-to-atlas | MRI | ADNI (Mueller et al.,2005), PPMI (Marek et al., 2011) | Yes/N.A./Yes | Swin-Transformer-based encoder and decoder network for inverse-consistent registration. | |
XMorpher (Shi et al., 2022) | Conv-like Transformers | 3D | No | N | Inter-patient | CT | MM-WHS 2017 (Zhuang and Shen, 2016), ASOCA (Gharleghi et al., 2022) | Yes/N.A./Yes | Two Swin-like Transformers are used for fixed and moving images, with cross-attention blocks facilitating communication between Transformers. | |
C2FViT (Mok and Chung, 2022) | Conv-like Transformers | 3D | No | N | Template-matching, patient-to-atlas | MRI | OASIS (Marcus et al., 2007), LPBA-40 (Shattuck et al., 2008 | Yes/N.A./N.A. | Multi-resolution Vision Transformer is used to tackle the affine registration problem with brain MRI. |