| CNNs | Convolutional Neural Networks |
| ViTs | Vision Transformers |
| COCO | Common Objects in Context |
| mIoU | Mean Intersection over Union |
| Acc | Accuracy |
| SIFT | Scale-Invariant Feature Transform |
| SVMs | Support Vector Machines |
| ASIFT | Affine-Scale-Invariant Feature Transform |
| SURF | Speeded-Up Robust Features |
| ORB | Oriented FAST and Rotated BRIEF |
| FAST | Features from Accelerated Segment Test |
| BRIEF | Binary Robust Independent Elementary Features |
| HOG | Histogram of Oriented Gradients |
| SPM | Spatial Pyramid Matching |
| DIAL | Domain Invariant Adversarial Learning |
| DAT | Domain-wise Adversarial Training |
| DCT | Discrete Cosine Transform |
| PyramidAT | Pyramid Adversarial Training |
| SP-ViT | Spatial Prior-enhanced Vision Transformers |
| STNs | Spatial Transformer Networks |
| RViT | Rotation Invariant Vision Transformer |
| AMR | Artificial Mental Rotation |
| SPP | Spatial Pyramid Pooling |
| RiT | Rotation Invariance Transformer |
| ILSVRC2012 | ImageNet Large Scale Visual Recognition Challenge 2012 |
| FPNs | Feature pyramid networks |