Fig. 4.
Encoder architecture. The encoder architecture utilizes ResNet50 with pre-trained ImageNet weights as its backbone. This architecture comprises two distinct residual blocks that connect layers with varying input and output sizes. In ResBlock1, the input and output sizes differ, whereas in ResBlock2, the input and output sizes are the same. Finally, an average pooling layer is added, which generates a 2048-dimensional vector for the projection MLP.
