Fig. 2.

Illustration of the six stages of the pre-trained ViT. There is an input layer, four successive transformer blocks followed by the classifier head. The re-trained layers are encapsulated in black dashed lines. Note that the patch merging layer encapsulated by the red dashed line is only re-trained in the first transformer stage.