Skip to main content
. 2024 Mar 14;37(4):1652–1663. doi: 10.1007/s10278-024-01051-8

Fig. 2.

Fig. 2

The cross-attention module takes in the upsampled feature maps, which are used as the query, and the projected word-level embeddings from the language model, which are used as both the key and value. It calculates pixel-level attention maps which are used to weight the decode feature maps