. Author manuscript; available in PMC: 2022 Jun 9.

Published in final edited form as: Mach Learn Med Imaging. 2021 Sep 21;12966:692–702. doi: 10.1007/978-3-030-87589-3_71

Table 1:

ViT performs inferiorly compared with CNN for image-level PE classification. For both architectures (ViT-B_32 and ViT-B_16), random initialization provides the worst performance. Both increasing the image size and reducing the patch size can enlarge the training set and therefore lead to an improved performance. Finally, similar to CNNs, initializing ViTs on ImageNet21k provided significant performance gain, indicating the usefulness of transfer learning.

PE AUC with vision transformer (ViT)
Model	Image Size	Patch Size	Initialization	Val AUC
SeXception	576	NA	ImageNet	0.9634
ViT-B_32	512	32	Random	0.8212
ViT-B_32	224	32	ImageNet21k	0.8456
ViT-B_32	512	32	ImageNet21k	0.8847
ViT-B_16	512	16	Random	0.8385
ViT-B_16	224	16	ImageNet21k	0.8826
ViT-B_16	512	16	ImageNet21k	0.9065
ViT-B_16	576	16	ImageNet21k	0.9179