. Author manuscript; available in PMC: 2024 Apr 1.

Published in final edited form as: Med Image Anal. 2023 Jan 31;85:102762. doi: 10.1016/j.media.2023.102762

Table 3.

The summarized review of Transformer-based model for medical image detection (upper panel) and registration (lower panel). "N.A." denotes for not applicable for intermediate blocks or decoder module. "N" denotes not reported or not applicable on the number of model parameters. "t" denotes temporal dimension.

	Reference	Architecture	2D/3D	Pre-training	#Param	Detection Task	Modality	Dataset	ViT as Enc/Inter/Dec	Highlights
Detection	COTR (Shen et al., 2021a)	Conv-Transformer Hybrid	2D	No	N	Polyp Detection	Colonoscopy	CVC-ClinicDB (Bernal et al., 2015), ETIS-LARIB (Silva et al., 2014), CVC-ColonDB (Bernal et al., 2012	Yes/No/Yes	Convolution layers embedded between Transformer encoder and decoder to preserve feature structure.
	(Mathai et al., 2022)	Conventional Transformer	2D	No	N	Lymph Node Detection	MRI	Abdominal MRI	Yes/N.A./Yes	DETR applied to T2 MRI.
	(Jiang et al., 2021)	Conv-Transformer Hybrid	2D	No	N	Dental Caries Detection	RGB	Dental Caries Digital Image	No/Yes/No	Augment YOLO by applying Transformer on the features extracted from the CNN encoder.
	TR-Net (Ma et al., 2021a)	Conv-Transformer Hybrid	3D	No	N	Stenosis Detection	CTA	Coronary CT Angiography	No/Yes/N.A.	CNN applied to image patches, followed by a Transformer to learn patch-wise dependencies.
	DATR (Zhu et al., 2022)	Conv-Transformer Hybrid	2D	Pre-trained Swin	N	Landmark Detection	X-ray	Head (Wang et al.,2016), Hand (Payeret al., 2019), and Chest (Zhu et al., 2021)	Yes/N.A./No	The integration of a learnable diagonal matrix to Swin Transformer enables the learning of domain-specific features across domains.
	SATr (Li et al., 2022b)	Conv-Transformer Hybrid	2D	No	N	Lesion Detection	CT	DeepLesion (Yan et al., 2018	No/Yes/No	Introduce slice attention Transformer to commonly used CNN backbones for capturing inter- and intra-slice dependencies.
	(Tian et al., 2022)	Conv-Transformer Hybrid	2D+t	Pre-trained CNN	N	Polyp Detection	Colonoscopy	Hyper-Kvasir (Borgli et al., 2020), LD-PolypVideo (Ma et al., 2021b	No/Yes/N.A.	A weakly-supervised framework with a hybrid CNN-Transformer model is developed for polyp detection.
	SCT (Windsor et al., 2022)	Conv-Transformer Hybrid	2D	Pre-trained CNN	N	Spinal Cancer Detection	MRI	Whole Spine MRI	No/Yes/N.A.	A Transformer that considers contextual information from the multiple spinal columns and all accessible MRI sequences is used to detect spinal cancer.
	Reference	Architecture	2D/3D	Pre-training	#Param	Registration Task	Modality	Dataset	ViT as Enc/Inter/Dec	Highlights
Registration	ViT-V-Net (Chen et al., 2021c)	Conv-Transformer Hybrid	3D	No	110.6M	Inter-patient	MRI	Brain MRI	Yes/No/No	ViT applied to the CNN extracted features in the encoder.
	TransMorph (Chen et al., 2022b)	Conv-Transformer Hybrid	3D	No	46.8	Inter-patient, Atlas-to-patient, Phantom-to-patient	MRI, CT, XCAT	IXI^*, OASIS (Marcus et al., 2007), Abdominal and Pelvic CT, (Segars et al., 2013)	Yes/No/No	Swin Transformer is used as the encoder for extracting features from the concatenated input image pair.
	DTN (Zhang et al., 2021c)	Conv-Transformer Hybrid	3D	No	N	Inter-patient	MRI	OASIS (Marcus et al., 2007	No/Yes/No	Separate Transformers are employed to capture inter- and intra-image dependencies within the image pair.
	PC-SwinMorph (Liu et al., 2022a)	Conv-Transformer Hybrid	3D	No	N	Inter-patient	MRI	CANDI (Kennedy et al., 2012), LPBA-40 (Shattuck et al., 2008)	No/No/Hybird	Patch-based image registration; Swin Transformer is used for stitching the patch-wise deformation fields.
	Swin-VoxelMorph (Zhu and Lu, 2022)	Conv-Transformer Hybrid	3D	No	N	Patient-to-atlas	MRI	ADNI (Mueller et al.,2005), PPMI (Marek et al., 2011)	Yes/N.A./Yes	Swin-Transformer-based encoder and decoder network for inverse-consistent registration.
	XMorpher (Shi et al., 2022)	Conv-like Transformers	3D	No	N	Inter-patient	CT	MM-WHS 2017 (Zhuang and Shen, 2016), ASOCA (Gharleghi et al., 2022)	Yes/N.A./Yes	Two Swin-like Transformers are used for fixed and moving images, with cross-attention blocks facilitating communication between Transformers.
	C2FViT (Mok and Chung, 2022)	Conv-like Transformers	3D	No	N	Template-matching, patient-to-atlas	MRI	OASIS (Marcus et al., 2007), LPBA-40 (Shattuck et al., 2008	Yes/N.A./N.A.	Multi-resolution Vision Transformer is used to tackle the affine registration problem with brain MRI.

https://brain-development.org/ixi-dataset/