Skip to main content
. Author manuscript; available in PMC: 2022 Jun 9.
Published in final edited form as: Mach Learn Med Imaging. 2021 Sep 21;12966:692–702. doi: 10.1007/978-3-030-87589-3_71

Table 1:

ViT performs inferiorly compared with CNN for image-level PE classification. For both architectures (ViT-B_32 and ViT-B_16), random initialization provides the worst performance. Both increasing the image size and reducing the patch size can enlarge the training set and therefore lead to an improved performance. Finally, similar to CNNs, initializing ViTs on ImageNet21k provided significant performance gain, indicating the usefulness of transfer learning.

PE AUC with vision transformer (ViT)
Model Image Size Patch Size Initialization Val AUC
SeXception 576 NA ImageNet 0.9634
ViT-B_32 512 32 Random 0.8212
ViT-B_32 224 32 ImageNet21k 0.8456
ViT-B_32 512 32 ImageNet21k 0.8847
ViT-B_16 512 16 Random 0.8385
ViT-B_16 224 16 ImageNet21k 0.8826
ViT-B_16 512 16 ImageNet21k 0.9065
ViT-B_16 576 16 ImageNet21k 0.9179