Calibrating Multi-modal Representations: A Pursuit of Group Robustness without Annotations

. Author manuscript; available in PMC: 2024 Dec 5.

Published in final edited form as: Proc IEEE Comput Soc Conf Comput Vis Pattern Recognit. 2024 Sep 16;2024:26140–26150. doi: 10.1109/cvpr52733.2024.02470

Figure 2. — Each row starts with the original image on the left, followed by four GradCAM visualizations corresponding to the four successive layers of the ResNet-50, with the depth of the layers increasing from left to right.

Figure 2. Layer-by-layer GradCAM analysis of the CLIP-ResNet50.