Skip to main content
. Author manuscript; available in PMC: 2022 Oct 17.
Published in final edited form as: Med Image Comput Comput Assist Interv. 2021 Sep 21;12902:273–283. doi: 10.1007/978-3-030-87196-3_26

Fig. 2.

Fig. 2.

Local MI Maximization. First, we randomly select a sentence in the text and encode the sentence into a sentence-level feature. The corresponding image is encoded into a M×M×D feature block. We estimate the MI values between all local image features and the sentence feature. Note that the MI estimation needs shuffled image-text data, which is not illustrated in this diagram. We select the local image feature with the highest MI and update the image encoder, text encoder, and the MI discriminator such that the local MI between that image feature and the sentence feature is maximized.