Comparison of different hierarchical architectures for classification models
After patch embedding, the feature map size is , where h, w, and c are the height, width, and channel numbers. There is a patch merging operation between every two stages, usually patches are merged, and the number of channels doubles. The resolutions of the feature maps are different, usually in single stage, in two stage, and in pyramid, where H and W are the height and width of the input image.