Skip to main content
. 2023 Nov 15;9(11):248. doi: 10.3390/jimaging9110248

Figure 1.

Figure 1

The cross-attention vs. the association maps. The first row consists of text images. The second and third rows consist of the cross-attention and association maps, respectively, that associate each predicted character with image regions. The last row consists of text transcriptions. The cross-attention map is obtained from a Transformer decoder, while the association map is obtained from a ViT-CTC model. Best viewed in color.