Skip to main content
. 2021 Jun 4;15(6):e12417. doi: 10.1111/lnc3.12417

FIGURE 2.

FIGURE 2

Datasets of natural images: The task of answering a question about an image has been promoted by the release of datasets containing an image and a question about it, such as VQA v1.0 (Antol et al., 2015). By controlling for the multimodal data points, models have been pushed to build finer‐grained representations (see VQA v2.0; Goyal et al., 2017). The release of densely annotated datasets, such as Visual Genome (Krishna et al., 2017), made it possible to tackle the challenge of building multimodal representations of relations between objects. This paved the way to resources, such as GQA (Hudson & Manning, 2019), which include compositional questions involving such relations