An effective spatial relational reasoning networks for visual question answering

Xiang Shen; Dezhi Han; Chongqing Chen; Gaofeng Luo; Zhongdai Wu

doi:10.1371/journal.pone.0277693

. 2022 Nov 28;17(11):e0277693. doi: 10.1371/journal.pone.0277693

An effective spatial relational reasoning networks for visual question answering

Xiang Shen ^1,^#, Dezhi Han ^1,^*,^#, Chongqing Chen ^1,^‡, Gaofeng Luo ^2,^‡, Zhongdai Wu ^3,^‡

Editor: Sriparna Saha⁴

PMCID: PMC9704574 PMID: 36441742

Abstract

Visual Question Answering (VQA) is a method of answering questions in natural language based on the content of images and has been widely concerned by researchers. The existing research on the visual question answering model mainly focuses on the point of view of attention mechanism and multi-modal fusion. It only pays attention to the visual semantic features of the image in the process of image modeling, ignoring the importance of modeling the spatial relationship of visual objects. We are aiming at the existing problems of the existing VQA model research. An effective spatial relationship reasoning network model is proposed, which can combine visual object semantic reasoning and spatial relationship reasoning at the same time to realize fine-grained multi-modal reasoning and fusion. A sparse attention encoder is designed to capture contextual information effectively in the semantic reasoning module. In the spatial relationship reasoning module, the graph neural network attention mechanism is used to model the spatial relationship of visual objects, which can correctly answer complex spatial relationship reasoning questions. Finally, a practical compact self-attention (CSA) mechanism is designed to reduce the redundancy of self-attention in linear transformation and the number of model parameters and effectively improve the model’s overall performance. Quantitative and qualitative experiments are conducted on the benchmark datasets of VQA 2.0 and GQA. The experimental results demonstrate that the proposed method performs favorably against the state-of-the-art approaches. Our best single model has an overall accuracy of 71.18% on the VQA 2.0 dataset and 57.59% on the GQA dataset.

1. Introduction

Visual Question Answering (VQA) [1] is an emerging cross-field of Computer Vision (CV) and Natural Language Processing (NLP) in Artificial Intelligence (AI). It predicts the correctness of the question based on the image input by the user and the content of the image-related question answer, which is a very challenging task. It requires the model to be able to reason and understand images and text simultaneously. Their goal is to give machines the ability to understand vision and language like humans. In recent years, various multi-modal analysis tasks have emerged, breaking the boundaries between vision and language. Language and vision are widely used in computer vision tasks (e.g., image captioning [2], visual descriptions [3], cross-modal information retrieval [4–6], visual question answering [1]), etc. Compared with other tasks, VQA is a method that requires the model to fully understand the input images and answer questions in natural language. At present, VQA tasks are also applied to real-life scenarios, such as assisting the blind and early childhood education, which has a wide range of practical applications. Considering the challenges and significance of VQA, visual question answering has received more and more research and attention in the intersection of computer vision and natural language processing.

In recent years, researchers have explored the multi-modal learning and visual reasoning of text and image features. The most advanced VQA method [7–11] is mainly focused on learning the multi-modal joint representation of images and questions. Specifically, the early proposed visual question answering model uses Convolutional Neural Network (CNN) to extract global features of an image and a Bag-of-words (BOW) model to extract text features of the question. After obtaining the global features of the image from the visual feature extractor, multi-modal fusion is used to learn a joint representation, which represents the alignment between each region and the question. Then input this joint representation into an answer prediction module to produce a correct answer. However, the use of global image features as the visual input of the model may introduce noisy information. In addition, the joint embedding learning method only maps the question and the image and lacks the process of reasoning, which leads to the lower accuracy of the model’s answer. Researchers later introduced an attention mechanism to use image features and the distribution of questions to answers to alleviate noise information and make the model have simple reasoning capabilities. For example, Yang et al. [11] proposed a stacked attention network, which hierarchically focuses attention and locates the image region iterative. Lu et al. [8] proposed a hierarchical co-attention model, and learning the co-attention of vision and text simultaneously is more conducive to the fine-grained representation of images and questions to predict answers more accurately. Yu et al. [12] used the complementarity between visual attention and semantic attention to propose a novel multi-level attention network to enhance the fine-grained analysis of image understanding; Anderson et al. [13] proposed to detect salient objects in images for the first time and then use the top-down attention mechanism to learn object-level attention weights; Kim et al. [14] proposed a bilinear attention network and discussed high-order multi-modal fusion strategies to combine text information with visual information better; In addition, the researchers also proposed co-attention module BAN [14], DCN [15], DFAF [16], MCAN [17], which can stack these models to effectively capture the difference between the visual domain and the language domain. High-level information to obtain fine-grained multi-modal interaction features. The state-of-the-art performance has been tested on the benchmark dataset. However, these co-attention modules are still insufficient to model the complex inference features required for VQA tasks.

The VQA model introduced above mainly focuses on the attention mechanism and modal fusion strategy. The co-attention mechanism can only realize simple implicit relational reasoning. However, most VQA tasks require understanding and explicit reasoning on the input of the picture by the user. Visual-spatial relationship reasoning plays a significant role in modern VQA tasks. However, the model described above does not consider the spatial position relationship of visual objects but integrates the position features of the object into the visual features of the object, which leads to the lack of relationship in the model reasoning ability. It is very difficult to model the spatial relationship of visual objects. Because these targets are located anywhere in the image, have different scales, belong to different categories, and different images have different numbers of visual objects. For example, responding to the question “What is the animal in the picture?” in Fig 1(a). The VQA model only needs to detect the “giraffe” object in the image and is not necessary to understand the entire image’s content, as shown in Fig 1(b). “What is under the cat?” The model must first locate the two objects, cat and laptop, then model the spatial relationship between the objects, and fully understand the spatial concept of “under” before making the correct answer. We believe that combining the deep co-attention mechanism with the spatial relational reasoning mechanism can further improve the performance of the VQA model.

Inspired by related works [17–21], we designed an effective spatial relational reasoning network (SRRN) for visual question answering based on a deep co-attention mechanism. A novel sparse attention mechanism is introduced into the question encoder, which can explicitly select compelling question features and strengthen image features’ selection in guided attention to image features. The sparse attention encoder can avoid the introduction of irrelevant information into the model and enhance the interaction between modalities. In addition, we also designed an efficient visual object spatial relationship reasoning network based on a graph neural network, which can capture the relationship between static objects and objects other than region detection. In this way, object spatial relationship modeling is realized so that the model can obtain fine-grained image features and visual object concepts, which can be used to answer complex spatial relationship reasoning questions. Considering the complexity and efficiency of model calculation, we propose a compact self-attention (CSM) mechanism based on the existing model, which effectively reduces the linear redundancy in the linear variation process, improves the model calculation efficiency and effectively improves the model precision and performance. The quantitative and qualitative experimental results on the VQA 2.0 [22] and GQA [23] datasets show that the SRRN model is an effective object spatial relationship reasoning network. In summary, our contributions are as follows:

A novel spatial relationship reasoning network model is proposed, effectively modeling visual objects’ spatial position relationship and object attribute relationship. The SRRN model can simultaneously model visual object semantic features and spatial position relationship features, which is vital for the model to predict the correct answer in VQA tasks.
A sparse attention encoder is designed to avoid the introduction of irrelevant information into the model when modeling question feature relationships while also improving the ability of modal interaction. A sparse attention encoder also dynamically captures each question’s most relevant visual object features.
A compact self-attention (CSA) optimization algorithm reduces the linear redundancy in linear change. This method can effectively improve the computational efficiency of the model and improve the model’s accuracy based on the initial model.
The SRRN model was tested on the benchmark datasets VQA 2.0 and GQA to obtain more satisfactory accuracy, and the ablation experiment analyzes the best performance of the SRRN model. Visual examples further reveal the interpretability of the model.

2. Related works

2.1 Visual question answering

The VQA task has received increasing attention from researchers in the past few years. The traditional VQA research method maps the image and question features into a common high-dimensional space and sends them to the classifier for classification tasks. The representation method of global features will introduce noise and make the model ineffective. Many researchers propose an attention mechanism in the VQA task to improve the model effect. For example, [7–10, 24, 25] use CNN-based networks to explore various image attention mechanisms to localize question-related regions. At the same time, there are also related works [8, 9, 26] that propose methods of using question-guided image attention and image-guided question attention and modal fusion of the extracted image and question features through the decoder and encoder. Multimodal fusion is crucial in VQA models. Traditional multimodal fusion methods map image features and question features to joint space embeddings. Recent studies [14, 27–30] have explored complex multimodal fusions way. With the advent of pre-trained models, multi-tasking in vision and language can effectively promote alignment between different modalities so that the models show good performance. In the previous pre-training process, the visual and language module was pre-trained independently in early visual question answering methods, failing to capture the connection between the visual dataset and the language dataset. In recent years, researchers have applied models based on Transformer structure to visual language tasks, and their representative works include ViLBERT [31], VLBERT [32], LXMERT [33], VisualBERT [34]. While these models achieve great results, they can significantly improve the model’s performance by pre-training the base model and transferring it to downstream tasks based on large-scale visual and question datasets. However, in the research of network models for VQA tasks and downstream tasks, utilizing the end-to-end approach [7–9, 12, 16, 17, 20] to train network models can better capture the modal information of images and texts and can effectively improve model performance.

Explicit modeling of object interaction represented by graphs has attracted more and more attention. The object-relational reasoning ability of the VQA model is the core component of modern AI. At present, the main reasoning is divided into implicit reasoning and explicit reasoning. In [35], a graph-based method is proposed, which combines questions and abstract images with graph neural networks. It shows the great potential of graph neural networks in the VQA task. [36, 37] use simple graph neural networks to infer the object relationships between image regions implicitly. In recent years, explicit reasoning methods to model images have also been widely used in many tasks [37–39]. Many researchers use graph neural networks to focus on counting problems in VQA tasks. [40] proposed an outer product computing graph network based on feature attention weights. This method only uses feature changes to improve the baseline model’s performance counting ability. [41] proposed an iterative method based on object similarity to improve the counting ability and interpretability of the model. These two methods aim to eliminate duplicate target detection to evaluate the model based on counting. Wang et al. proposed VQA-Machine [42], which clarified the semantic relationship between objects. Although these models can significantly improve the performance of the VQA model, these models cannot simultaneously model the visual object space and the semantic representation of the visual object. The test results of the SRRN model on the VQA 2.0 and GQA datasets show that our model has a significant improvement in counting ability and can model the relationship between visual-spatial objects and the visual representation of objects simultaneously.

2.2 Attention mechanisms

The basic idea of the attention mechanism in the VQA model is to pay attention to specific visual regions and specific text in the image and provide more practical information for the vision to answer questions for the input questions related to the image. Bahdanau et al. [43] first used the attention mechanism to improve the neural machine translation (NMT) task. After that, the attention mechanism was also widely used in the VQA model and became an essential part of the VQA task. At present, the attention mechanism of the VQA task is mainly classified into three categories. The first category primarily focuses on the image region by questionable guidance. Most of this method uses top-down image attention and expresses question features and image features as vector elements to convey the concept of visual objects in the image region. The visual representation of the attention is generated by calculating the average value of all the visual features of interest. Yang et al. [11] proposed that Stacked Attention Networks (SANs) use the obtained context vectors to continuously pay attention to the regions in the image to obtain more accurate image features. Zhu et al. [44] combined CNN with Long Short-term Memory (LSTM) to generate an attention map for each output word. Researchers have proposed that DAN [24] uses one or more support and opposition paradigms of the differential attention network to obtain a different attention region, making it more like human attention. The second category is the cooperative attention mechanism. This attention mechanism considers visual attention and contextual attention (which words are more important in the question), using the hierarchical model of the co-attention model. The image representation can guide the attention of the question, and the performance of the question can guide the attention of the image. Nam et al. [9] proposed a dual attention network to collect necessary information through multi-step processing of specific areas in the image and keywords in the question. However, because these attention models learn multimodal coarse interaction examples, it is difficult to infer the correlation images and questions between them. Yu et al. [17] proposed the Deep Modular Co-Attention Networks (MCAN) model that overcomes the shortcomings of the model’s dense attention (that is, the relationship between words in the text) and the relationship between regions in the image) in each mode at the same time. The third category is the target detection attention mechanism. Anderson et al. [13] used the target detection network Faster R-CNN to achieve bottom-up attention, segmented the image into specific objects for screening, and selected the first K proposals in the picture as visual features. Lu et al. [45] proposed to combine free-form attention and detection attention to better expand the breadth of detection categories.

2.3 Visual relational reasoning

With deep learning and machine learning development, researchers are also exploring visual objects in the VQA task. Early works [46–48] proposed that the target relationship (e.g., position and size [48]) was invoked as post-processing steps for target detection. The processing step is a method of re-scoring the detected target. In addition, some previous works [49, 50] also explored the spatial relationship between visual objects to help the model better understand the positional relationship of objects in the image. Recently, visual relational reasoning has been introduced to VQA tasks, which can help the model better answer images and questions that require logical understanding. It has received extensive attention from researchers. For example, visual object spatial relationship reasoning helps the cognitive task of image mapping to subtitles [51, 52] and improves image search [53, 54] and target localization. Recent visual object-relational reasoning [55, 56] focuses more on semantic relational reasoning rather than object-spatial relational reasoning. At present, some neural networks are also used for visual relationship prediction tasks [57, 58]. Most of the existing visual object-relational reasoning researches focus on implicit reasoning relations; they do not use explicit semantics or spatial relations to construct graphs. They model the interaction of objects by capturing implicitly on all attention modules or through fully connected graphs of high-level input images [19, 59]. For example, in [60, 61], the bilinear fusion method MuRel cell is introduced to model the object relationship. Yu et al. [61]designed a visual relationship reasoning relationship module to reason about paired and intra-group visual relationships between visual objects to enhance visual representation at the relationship level.

3. Methodology

The following is the question definition of the VQA task: Given a question q based on the picture I. The goal is to predict an answer $\hat{a} \in A$ that matches the ground-truth answer. As in the common literature in VQA, it is defined as a classification problem:

\begin{matrix} \hat{a} = \underset{a \in A}{a r g} max p_{θ} (a | I, q) \end{matrix}

(1)

where p_θ is the training model.

In this section, we will describe the detail of the SRRN model. The overall flowchart of the SRRN model is shown in Fig 2. It is mainly composed of question and image feature extraction, visual object semantic reasoning model and spatial relation reasoning module, and modal fusion and answer prediction module. We first describe how to extract the image and question features, then explain the visual object spatial relationship reasoning module and semantic reasoning module. Finally, the modal fusion and answer prediction module will be described. In the visual object semantic reasoning module, we will introduce the sparse attention mechanism encoder and the compact self-attention (CSA) optimization strategy in the co-attention module.

3.1 Question and image representation

The MCAN [17] model similarly makes all questions have the same length. We first trim each input question to a maximum of S words by simply discarding the extra words of the question longer than S words. Each word is transformed into a vector representation and pre-trained on a large-scale corpus to obtain a 300-D GloVe feature vector [62] into a word vector. Then the question is converted into a sequence of word embeddings {e₁, e₂, ⋯e_S}, which are then passed through a bi-directional GRU (Bi-GRU) to output the word representation as follows:

\begin{matrix} {\vec{q}}_{n} = B i - G R U ({\vec{q}}_{n - 1}, e_{n}) \end{matrix}

(2)

\begin{matrix} {\overset{\leftarrow}{q}}_{n} = B i - G R U ({\overset{\leftarrow}{q}}_{n + 1}, e_{n}) \end{matrix}

(3)

where $\vec{q_{n}}$ is the output value of the forward hidden layer, and ${\overset{\leftarrow}{q}}_{n}$ is the output value of the backward hidden layer. Each question can be represented by a matrix $Q_{q} = {q_{1}, q_{2}, \dots q_{S}} \in R^{d_{q} \times S}$ , where $q_{n} = [\vec{q_{n}}, \overset{\leftarrow}{q_{n}}]$ , and [⋅, ⋅] denotes concatenation.

Inspired by bottom-up attention [14], we use the pre-trained Faster R-CNN ResNet-101 [63] network to extract the target detection frame on the input image, and at the same time it is used to identify a sequence of visual objects $V_{I} = {v_{i}}_{i = 1}^{K}$ . The defined visual object is related to the visual representation vector $v_{i} \in R^{d_{v}}$ and the boundary feature vector $b_{i} \in R^{d_{b}}$ at the same time. Set (K = 36, d_v = 2048, and d_g = 4) in the experiment. Each g_i = [x, y, w, h] corresponds to a 4-dimensional space coordinate, where (x, y) represents the coordinate point of the upper left corner of the bounding box, and (h, w)corresponds to the height and width of the box. K salient targets are obtained by selecting candidate boxes, and the i-th visual object feature is denoted as $v_{i} \in R^{d_{v}}$ . The image output feature is V_I ∈ R^K×2048, and the setting is the total number of target detection features from some comparative experiments and computing resource conditions. Considering the better performance and higher computational efficiency, set K = 36. In the experiment, we use linear change V_I to make the image feature dimension consistent with the question feature dimension. Therefore, the final image feature is V_I ∈ R^K×512.

3.2 Spatial reasoning of visual objects

The visual object spatial relationship reasoning proposed in this paper can dynamically capture the relationship between objects in an image. For VQA tasks, different question types have different visual object relationships. The spatial relationship features of visual objects are mainly composed of appearance features and geometric features. The visual object representation is mainly the focused image feature output by the co-attention module, and the geometric feature is the 4-dimensional visual object bounding box represented by g_i. Given that there are K visual objects for self-attention learning, the generated hidden layer relationship feature ${v_{i}}_{i = 1}^{K}$ is used to represent the relationship between the target object and adjacent objects. The graph reasoning attention mechanism formula of each object spatial relationship is as follows:

\begin{matrix} v_{i} = σ (\sum_{j \in N_{i}} α_{i j} . W v_{j}^{^{'}}) \end{matrix}

(4)

For different VQA tasks, the definition of the attention coefficient α_ij in formula (4) is also different, where the projection matrix $W \in R^{d_{h} \times (d_{q} + d_{v})}$ and the target domain object are different. σ(⋅) is the activation function.

Since the construction of the reasoning graph between visual objects is a fully connected graph, all N_i includes the object itself and all the visual objects in the image. Inspired by [19], we designed the attention weight α_ij to rely on the visual weight $α_{i j}^{v}$ and the object bounding box $α_{i j}^{g}$ . The specific equation is as follows:

\begin{matrix} α_{i j} = \frac{α_{i j}^{g} \cdot exp (α_{i j}^{v})}{\sum_{j = 1}^{K} α_{i j}^{g} \cdot exp (α_{i j}^{v})} \end{matrix}

(5)

where $α_{i j}^{v}$ represents the similarity between the object and the object’s position, and the calculation of $α_{i j}^{v}$ uses the scaled dot product [64]. The specific equation is as follows:

\begin{matrix} α_{i j}^{v} = {(U v_{i}^{'})}^{T} \cdot V v_{j}^{^{'}} \end{matrix}

(6)

where $U, V \in R^{d_{h} \times (d_{q} + d_{v})}$ is a projection matrix, $α_{i j}^{g}$ is to calculate the relative geometric position between the object and the object, the specific equation is as follows:

\begin{matrix} α_{i j}^{g} = max {0, w \cdot f_{g} (g_{i}, g_{j})} \end{matrix}

(7)

where f_g(⋅, ⋅) first calculates a 4-dimensional relative geometric feature $(log (\frac{x_{i} - x_{j}}{w_{i}}), log (\frac{y_{i} - y_{j}}{h_{i}}), log (\frac{w_{j}}{w_{i}}), log (\frac{h_{j}}{h_{i}}))$ , and then embed it into dimensional feature by calculating the cosine function and sine function of different wavelengths. $w \in R^{d_{h}}$ convert the d_h-dimensional features into scalar weights. The model crops the scalar weight at 0 to limit the specific geometric relationship between the visual objects. g_i, g_j are 4-dimensional space coordinates, which represent the relative geometric position relationship of visual objects.

In addition, to strengthen the spatial reasoning relationship features of visual objects, we have also extended the above graph attention mechanism and adopted multi-head attention. Using M independent multi-head attention mechanisms and connecting their output features, the following features are obtained:

\begin{matrix} v_{i}^{*} = {| |}_{m = 1}^{M} σ (\sum_{j \in N_{i}} α_{i j}^{m} \cdot W^{m} v_{j}^{^{'}}) \end{matrix}

(8)

Finally, $v_{i}^{*}$ is added to the original visual features v_i, as the final reasoning feature of spatial object relation.

3.3 Semantic reasoning of visual objects

The visual semantic reasoning is formed by stacking encoders and decoders similar to the MCAN module. Unlike MACN, we use a sparse attention mechanism in the encoder, which helps to capture text context information better. At the same time, the use of a sparse attention encoder can prevent irrelevant information from being introduced into the model, which helps to enhance the robustness of the model and the semantic reasoning ability of visual feature objects. Experiments verify that the sparse attention encoder is effective in retaining important features and removing noise.

3.3.1 Sparse mechanism encoder

We first introduce the sparse attention mechanism encoder, as shown in Fig 3. The sparse attention mechanism encoder is a simplified transformer model, including a multi-head dot-product attention layer [64] and several fully connected layers. The question of traditional transformer encoder self-attention can establish a long-term dependence model. However, when modeling question features for self-attention learning, context-irrelevant information is also introduced into the model to distract the attention weight of the model. In order to solve this problem, we use a sparse attention mechanism encoder in the question encoder, which can improve the concentration of attention on the global context by displaying and selecting the most relevant segments. It helps the question features important guide regions of the image.

Fig 3 shows the structure of a sparse attention mechanism encoder. The question features get different vector values, key values and query values through linear transformation, in which the similarity of query values and key values determines the weight of similarity of question words. For the convenience of calculation, we separately set the input of sparse dot product attention including query Q_E[l_Q, d], key K_E[l_K, d], and value V_E[l_V, d]. Firstly, the scaling dot-product formula is used to calculate the similarity between the query value and the key value to obtain a high-dimensional matrix, and then divide by get an attention matrix score, the specific equation is as follows:

\begin{matrix} W = s (Q_{E}, K_{E}) = \frac{Q_{E} K_{E}^{T}}{\sqrt{d}} \end{matrix}

(9)

We assume that the higher the score calculated in the W matrix, the higher the correlation between word features. In practice, for simultaneous estimation c query value function, which we embed into a matrix Q_E ∈ R^c×d; Similarly, t key-value functions are embedded in the matrix K_E ∈ R^t×d. Obtaining W ∈ R^c×t is a weight matrix as shown in Fig 4.

The principle of the sparse attention mechanism is to eliminate the words with low weights learned by the initial scaled dot-product attention model self-attention, which is used to guide essential image regions related to the question. We assume that the higher the score, the higher the correlation, and the sparse attention masking operation M_ij is performed on W to select the most critical δ contribution elements. Specifically, we select the most significant δ element and record their position in the position matrix (i, j), where δ is a parameter. δ-th is the row with the most considerable value in the i-th item a_i. If the value of the j-th component is greater than a_i, the position (i, j) is recorded. We connect the thresholds of each row to form a vector A = [a₁, a₂, …, a_c]. The mask function M_ij is defined as follows:

\begin{matrix} M_{i j} = {\begin{matrix} W_{i j}, & i f W_{i j} \geq a_{i} \\ - \infty, & i f W_{i j} < a_{i}, \end{matrix} \end{matrix}

(10)

\begin{matrix} \bar{M} = s o f t m a x (M_{i j}) \end{matrix}

(11)

where $\bar{M}$ refers to the standardized score, since scores smaller than the previous maximum score are assigned −∞ by the masking function M_ij, their normalized scores, the probability is approximately 0. The output representation of self-attention after selection can be calculated as:

\begin{matrix} F = \bar{M} V \end{matrix}

(12)

F is the expected value of sparse distribution, and the sparse attention mechanism can get more concentrated attention. This attention mechanism can be extended to context attention, similar to the common self-attention mechanism but differs in that Q_E is not a linear change of the original context but a decoding state. In order to further improve the semantic representation ability of visual objects, inspired by multi-head attention [60], each head uses a sparse dot product attention function to calculate the weight of the question word in the input encoder. Unlike the ordinary multi-head attention mechanism, we directly choose the feature weights of important issues that are directly discarded if lower than the threshold. Finally, we can get the most critical weight information of the question. The calculation equation is defined as follows:

\begin{matrix} M H S A t t = (Q_{E}, K_{E}, V_{E}) = C o n c a t (h e a d_{1,} \dots h e a d_{h}) W^{o} \end{matrix}

(13)

\begin{matrix} h e a d_{i} = S A t t (Q_{E} W_{i}^{Q}, K_{E} W_{i}^{K}, V_{E} W_{i}^{V}, δ) \end{matrix}

(14)

where $W_{i}^{Q}$ , $W_{i}^{K}$ and $W_{i}^{V} \in R^{d \times d_{h}}$ are the projection matrices of the i-th head, and $W^{O} \in R^{h \times d_{h} \times d}$ is the learned weight matrix. SAtt(.) represents dot-product self-attention using a sparse attention mechanism.

3.3.2 Co-attention modular

The co-attention module in the SRRN model is based on the scaled dot product attention, and the scaled dot product attention is a mapping. It is embedded in three matrices, representing the embedded query, key, and value vectors, and these matrices are named Q, K, V ∈ R^L×d by convention. Where L is the sequence length of the input tag, and d is the hidden dimension. Then the proportional dot product attention can be defined as:

\begin{matrix} A t t (Q, K, V) = s o f t max (\frac{Q K^{T}}{\sqrt{d}}) V \end{matrix}

(15)

Inspired by the work [21], we carefully examined and reviewed linear transformations. Since Q and K vary linearly from the same input, the weight matrices W_K and W_Q are entangled in the gradient backpropagation, a basic redundancy in the conventional self-attention mechanism. To avoid this redundancy, we propose to contribute the weights of K and V to achieve weight redundancy in the process of self-attention. To avoid this redundancy, we propose to contribute the weights of K and V to achieve weight redundancy in the process of self-attention. We set W_k = W_v and found through experiments that the compact self-attention mechanism has a good optimization effect. The specific change equation is as follows:

\begin{matrix} A t t (Q, K, K) = s o f t max (\frac{Q K^{T}}{\sqrt{d}}) K \end{matrix}

(16)

Fig 5 shows the structure of the multi-modal co-attention module, which is mainly composed of a stack of encoders and decoders. The encoder uses a sparse self-attention (SSA) mechanism to capture important question features to guide attention to important question regions features in the image. Q_q is input into the SSA unit as a question feature, and essential question features are learned through self-attention. Specifically, The SSA unit consists of two sub-layers (see Fig 4). $Q_{q} = {q_{1}, q_{2}, \dots, q_{S}} \in R^{d_{q} \times S}$ is used as the question feature input to calculate the relationship between each word <q_i, q_j>, and then the sparse question feature is input into the fully connected layer to obtain the weight between each word. The sparse encoder question feature output matrix F₁ can be expressed as:

\begin{matrix} F_{1} = M H S A t t (Q_{q}, K_{q}, K_{q}) \end{matrix}

(17)

\begin{matrix} = C o n c a t (h e a d_{1}, \dots, h e a d_{h}) W^{O} \end{matrix}

(18)

\begin{matrix} h e a d_{i} = S A t t (Q_{q} W_{i}^{Q_{q}}, K_{q} W_{i}^{K_{q}}, K_{q} W_{i}^{K q}, δ) \end{matrix}

(19)

where $Q_{q} W_{i}^{Q_{q}}$ and $K_{q} W_{i}^{K_{q}}$ are the sparse question feature matrix of the i-th head. is the question feature selecting a parameter, and the feedforward layer further transforms the question features. The final features are as follows:

\begin{matrix} F F N (F_{1}) = max (0, F_{1} W_{1} + b_{1}) W_{2} + b_{2} \end{matrix}

(20)

where W_i and b_i represents weight coefficient and biased variable respectively.

The decoder has three sub-layers, mainly composed of SA unit and SGA unit. Specifically, SGA units in the first layer need sparse question features. $Q_{q} = {q_{1}, q_{2}, \dots, q_{S}} \in R^{d_{q} \times S}$ and V_v = {v₁, v₂, ⋯, v_k} ∈ R^k×2048 as an input, sparse question features are used to guide the features of the concerned image. The second layer is self-attention learning of image and question features and output to the third layer full connection layer. Final output question and image features <q_i, v_j> in the experiment, we tried to add a sparse image self-attention unit before the first layer, but the experimental results were not satisfactory. In addition, we also use linear transformation to keep the dimension of question features consistent with that of image features. The specific equation is as follows:

\begin{matrix} F_{2} = M H S A t t (Q_{q}, K_{v}, K_{v}) \end{matrix}

(21)

\begin{matrix} = C o n c a t (h e a d_{1}, \dots, h e a d_{h}) W^{O} \end{matrix}

(22)

\begin{matrix} h e a d_{i} = A t t (Q_{q} W_{i}^{Q_{q}}, K_{v} W_{i}^{K_{v}}, K_{v} W_{i}^{K_{v}}) \end{matrix}

(23)

Feed forward the questions and image features F₂:

\begin{matrix} F F N (F_{2}) = max (0, F_{2} W_{1} + b_{1}) W_{2} + b_{2} \end{matrix}

(24)

where W₁ ∈ R^512×2048, W₂ ∈ R^2048×512,and b₁, b₂ ∈ R^512×2048 are projection matrixes.

Take the image mentioned above feature V_v and question feature Q_q as input. Through deep cascading L-layer MCA (represented as MAC⁽¹⁾, …, MAC^(L)), a deep co-attention model is formed to transfer input features and perform deep co-learning. The input features of MAC^(L−1) are $Q_{q}^{(l - 1)}$ and $V_{v}^{(l - 1)}$ . These features are further passed to the MAC^(L+1) layer as input in a recursive manner. The specific equation is as follows:

\begin{matrix} [Q_{q}^{L}, V_{v}^{L}] = M C A^{L} ([Q_{q}^{(L - 1)}, V_{v}^{(L - 1)}]) \end{matrix}

(25)

We set the input features of MAC⁽¹⁾ as $Q_{q}^{(0)} = Q_{q}$ and $V_{v}^{(0)} = V_{v}$ respectively.

3.4 Modal fusion and answer predictions

The question feature Q_q and the image feature V_v output by the encoder and decoder. After the sparse attention mechanism encoder, co-attention learning, the question feature, and the image feature contain the most significant feature value, which provides the weight information of the word and image region. Similarly, the spatial relationship feature of visual objects obtained by graph reasoning is denoted as $v_{r}^{*}$ . Then we design an attentional reduction model with a two-layer MLP(FC(512) − Relu − Dropout(0.1) − FC(1)) to obtain its attended question feature $\bar{Q_{q}}$ and image feature $\bar{V_{v}}$ . Specifically, we input the image features into the MLP through the softmax function to calculate the attention weight value, and then multiply and sum each region image feature to get the final image feature. The equation is as follows:

\begin{matrix} λ = s o f t max (M L P (V_{v}^{L} + V_{r}^{*})) \end{matrix}

(26)

\begin{matrix} {\bar{V}}_{v} = \sum_{j = 1}^{m} λ_{j} (V_{j}^{(L)} + V_{r}^{*}) \end{matrix}

(27)

where λ = [λ₁, λ₂, ⋯, λn] ∈ Rⁿ are the learned attention weights. L represents the number of stacked layers of MAC. Use the softmax function to obtain the attention weights related to the image and the question and normalize these weights in all regions. Finally, the image and question features from all regions are weighted through these attention weights, and the final weighted sum is used as the final visual feature and question feature. Use the linear multi-modal function to fuse the final question feature $\bar{Q_{q}}$ and image feature $\bar{V_{v}}$ , and the fused feature is expressed as Eq 28:

\begin{matrix} f = l a y e r N o r m (W_{v}^{T} {\bar{V}}_{v} + W_{q}^{T} {\bar{Q}}_{q}) \end{matrix}

(28)

where f represents the fusion feature of the question and the image, and W_v and W_q are linear projection matrices. Then the f is passed through the non-linear activation function Relu, and the sigmoid function is used to classify the answer. In the training process, we use the binary cross-entropy (BCE) function as the loss function.

\begin{matrix} s = s i g m o i d (W_{0} r e l u (W_{f} f)) \end{matrix}

(29)

where s represents the score of the candidate answers, and W_f is the linear projection matrix. The candidate answers with the highest probability are selected as the prediction result. Finally, we use BCE as the loss function to train N answer classifications.

\begin{matrix} N = \sum_{i}^{N} γ_{i} log (s_{i}) + (1 - γ_{i}) log (1 - s_{i}) \end{matrix}

(30)

where N is the size of the candidate set, and s_i is the score predicted by the model for each candidate answer, γ_i is the soft score that provides the answer in the dataset.

4. Experiments

All experiments in this paper are based on Linux Ubuntu 18.04 system, GPU is NVIDIA TITAN V 12GB, deep learning framework is Pytorch, and CUDA version is 10.0. This section first describes the VQA 2.0 dataset [22] and the newly introduced GQA [23] dataset in Section 4.1 to evaluate our proposed model. In Section 4.2, we describe the experimental setup details. Section 4.3 discusses the ablation experiment and displays the experimental results and experimental setting parameters. Section 4.4 respectively describes the results of our proposed SRRN model compared with the state-of-the-art results on the two datasets. Finally, we use several successful examples and failure examples to explain the model reasonably visually.

4.1 Dataset

Unlike pre-trained model datasets, models trained end-to-end utilizing VQA-specific datasets are more likely to capture and extract image and text features, which are helpful for the classification of downstream tasks of the model. However, the datasets of the pre-trained models come from various corpora, and the features learned from different corpora have generalization and generality. Using pre-trained datasets can also effectively avoid biases and language priors that exist in the dataset from interfering with model performance. For a fair comparison of the experimental results, this paper employs the VQA 2.0 and GQA datasets to train the model.

VQA 2.0: The SRRN model is training, validating, and testing on the VQA 2.0 [22] dataset, which is based on Microsoft COCO image data and is currently the most commonly used large-scale dataset for evaluating the performance of visual question answering models. It tries to minimize the effectiveness of the model learning dataset bias by balancing the answers to each question. The VQA 2.0 dataset contains 1.1M questions posed by humans. It consists of three parts: training set, validation set, and test set. Each valid piece of data is represented by a three-element question and answer group composed of the dataset (image, question, answer). The training set contains 82,783 images and 443,757 question and answer groups corresponding to the images. The verification set contains 40,504 images and a corresponding 214,354 question and answer groups, and the test set contains 81,434 images and 447,793 question and answer groups. According to the categories of answers, questions can be divided into three types: yes/no (Yes/No), count (Number) and Other. We show the results on test-dev and test-standard on the VQA evaluation server.

GQA: It consists of 22M questions generated from 113K images. Compared with VQA 2.0, more questions in the GQA data set require multi-step reasoning to balance the answers. About 94% of the questions require multi-step reasoning, and 51% need to query the relationship between objects. In addition to the standard accuracy measures, the authors of GQA have designed several new measures, including consistency, credibility, validity, and distribution. The higher the score, the consistency, effectiveness, and credibility in these indicators, but the lower score is conducive to promotion. The model is trained based on a balanced training split and a balanced verification split, and then the test split is tested on the evaluation server.

4.2 Details of the experimental setup

We implement our model with the Pytorch library on a machine with 4 Nvidia TITAN V 12GB GPUs. We set the dimension of the hidden layer in the proportional dot product attention to d = 512. The number of heads in the multi-head attention h is 8, and the number of dimensions of each head’s output feature is d/h = 64. According to the suggestion in [28], the number of layers L of the decoder and encoder is set to 6, and the structure of the feedforward layer is FC(4d) − ReLU − dropout(0.1) − FC(d). The structure of the multilayer perceptron used to calculate the features of interest is FC(d) − ReLU − dropout(0.1) − FC(1), where ReLU is the activation function, and dropout is used to prevent overfitting. The number of visual reasoning features is set to N_i = 16. The dimension of the fusion feature f is 1024. We use AdamW [65] (β₁ = 0.9, β₂ = 0.999) to train the SSRN model, set its batch size to 64 and use BCE as the loss function. The warm-up learning rate is min(2.5te⁻⁵, 1e⁻⁴), where t is the current epoch number starting from 1. The code implementation of all models proposed in this paper are based on PyTorch. In order to prevent the gradient from exploding, a gradient clipping strategy with a threshold of 0.25 is used; In order to stabilize the output and prevent over-fitting, each linear map is subjected to weight normalization and dropout processing.

4.3 Ablation study

This section mainly discusses choosing the optimal parameters and proving the validity and interpretability of the model. We designed different SRRN variant models and used train+val+vg for training on the VQA 2.0 dataset and tested them on the test dataset to obtain the results.Visual genome (vg) is a dataset, a knowledge base, an ongoing effort to connect structured image concepts to language. Moreover, we give a detailed discussion. Firstly, we explore the effects of using visual-spatial object reasoning, sparse encoder, and compact self-attention mechanism in the model in Section 4.3.1. Set different parameters in Section 4.3.2 to prove an effective sparse attention encoder.

4.3.1 SRRN variants

As shown in Table 1, we show the performance of different variants of the SRRN model. For the convenience of writing, denoted “SRRN-r” as the object spatial relation reasoning module in Table 1. The input and output dimensions of the object spatial relationship reasoning model of the SRRN model are the same. In the experiment, we use the stacked network module to explore the performance of the model. Among them, “SRRN-r1” and “SRRN-r2” stand for stacking one and two layers of object spatial reasoning modules, respectively. The experimental results found that although the effect of stacking two layers is better than that of stacking one layer, the calculation efficiency of the model will decrease due to the increase in the number of stacked layers, and the model network training parameters will also be increased. In order to reduce the number of parameters of the model and improve the calculation efficiency, the method of stacking one layer is used in the subsequent experiments. “SRRN-s” and “SRRN-c” indicate that the model uses a sparse encoder and a compact self-attention mechanism, respectively. The experiments in Table 1 show that using the sparse attention mechanism encoder on the reasoning module will significantly improve the model effect. Because the sparse attention encoder can better capture the contextual information of the question, focus on the important image region when the question guides the focus on the image better to achieve the semantic reasoning of the visual object. The model we propose must process both the semantic features of the visual object and the spatial relationship reasoning feature to optimize the model. We optimize the model based on the sparse attention encoder and reasoning module and propose an effective compact self-attention optimization method. According to the experimental results, it can be seen that the compact self-attention optimization method is very effective. “SRRN-c” represents that the compact self-attention optimization method is directly used based on the MCAN model. Except for the accuracy of the “Number” type, there is no noticeable improvement. “Other” type indicators have excellent results. Through experimental comparison, it can be observed that our model reached 55.02% on the “Number” type, indicating that the use of the object spatial relationship reasoning module has excellent performance on the “Number” type. We adopted “SRRN-r-s-c” model finally. The comparison shows that our model combines visual semantic and spatial object relationship features with excellent VQA task performance. It is also a simple and effective visual question answering reasoning model.

Table 1. Performance of different SRRN variant models on VQA 2.0 dataset.

Model	Yes/No	Number	Other	All
MCAN [17]	86.82	53.26	60.72	70.63
SRRN-s	87.05	53.21	60.72	70.71
SRRN-c	87.14	53.56	61.21	71.02
SRRN-r1	86.86	54.17	60.43	70.71
SRRN-r2	86.78	54.57	60.57	70.69
SRRN-r1-s	86.79	54.20	60.51	70.78
SRRN-r2-s	86.84	54.82	60.54	70.73
SRRN-r1-s-c	86.97	55.02	60.80	70.92

Open in a new tab

We also discuss the accuracy of model training and the convergence speed of model training on the GAQ and VQA datasets. As shown in Fig 6, Fig 6(a) shows that the “SRRN-r-s-c” model is training on the GQA dataset using train + vg, and the accuracy of each epoch is verified on the val validation set. We found that training with the GQA dataset achieves the best results in the 9th epoch. As the training continues, the accuracy gradually decreases and tends to balance. The accuracy we compare in Table 4 uses the 9th epoch of training data. Fig 6(b) indicates the changes in the loss function of the “SRRN-r-s-c” model trained on the VQA 2.0 and the GQA datasets. Through experiments, we found that the model converges faster when using the GQA data laid down for training, and the VQA 2.0 dataset converges more slowly than the GQA dataset. GQA is a dataset for most VQA reasoning tasks, which illustrates our model’s excellent spatial relational reasoning networks performance.

4.3.2 SRRN parameter ablation

In this section, we mainly discuss the selection of parameters used in the sparse attention encoder. We first discuss using the sparse attention encoder to select the appropriate hyperparameter. As shown in Fig 7, we verified the influence of different parameters on the model’s performance. In the process of exploring the ablation experiment, we used the VQA 2.0 dataset train + val + vg to experiment based on the “SRRN-r-s-c” model in Table 1. From Fig 7, we find that when δ = 3, the accuracy of “Other” and “All” is the best, and when δ = 8, Number’s accuracy is the best. The model’s overall performance is better by combining the four indicators and setting the hyperparameter of sparse attention to δ = 8.

Fig 7 — (a) The accuracy of “Yes/No” based on different parameters. (b) Accuracy of “Number” based on different parameter. (c) Accuracy of “Other” based on different parameters. (d) Accuracy of “All” based on different parameters.

Since the object semantic reasoning module of the SRRN model is based on the Transformer encoder and decoder stacked L layers, we use different layers to verify the performance of the model. The experimental data is shown in Table 2. When verifying the number of layers, we use the train dataset for training and the val dataset for verification. Considering the computational time efficiency and cost, we set L = 6 for SRRN model experiment training.

Table 2. The performance of different layers of encoder and decoder.

L	Yes/No	Number	Other	All
2	83.88	49.86	57.60	66.46
4	84.68	50.54	58.11	67.10
6	84.72	50.34	57.93	67.02
8	84.52	50.84	57.85	66.95

Open in a new tab

4.3.3 Number of model parameters comparison

The parameters and accuracies of the SRRN model and the VQA pre-trained model are compared in Table 3. The SRRN model uses an end-to-end method to train on the VQA 2.0 dataset and obtains good results. Experiments show that the spatial object relation reasoning module can effectively improve the model effect. The amount and complexity of model parameters has always been a major concern for VQA tasks. Most of the existing VQA pre-training models can be fine-tuned on the VQA 2.0 dataset to achieve good results. In order to more effectively illustrate the performance and complexity of the SRRN model, we compare the SRRN model with the classical pre-trained model. Unified VLP [66], VilBERT [31], VisualBERT [34], VLBERT [32] are also pre-trained models using encoder and decoder. DFAF-BERT [16] and MLI-BERT [67] are based on end-to-end models using BERT as a pre-trained model. As shown in Table 3, the parameters of the “SRRN-r1-s-c” and “SRRN-r2-s” models are 58.19M and 58.98M, respectively, however most VQA pre-training models have larger parameters than the SRRN model. Besides, the SRRN model only needs 1 TITAN GUPs to complete the training.

Table 3. Comparison of pre-trained model parameters and SRRN model on the VQA 2.0 dataset.

Model	Parameters	Test-dev	Test-std
Unified VLP [66]	-	70.50	70.70
ViLBERT [31]	218.9M	70.55	70.92
ViusalBERT [34]	85.05M	70.80	70.92
VLBERT [32]	134.8M	71.16	-
DFAF-BERT [16]	173.2M	70.59	70.81
MLI-BERT [67]	120.0M	71.19	71.27
SRRN-r2-s	58.98M	70.73	-
SRRN-r1-s-c	58.19M	70.92	71.18

Open in a new tab

4.4 Comparison with state-of-the-arts

In Table 4, we compare our model SRRN with the state-of-art model in VQA 2.0 dataset. A single model obtains all results. Table 4 is divided into three blocks in the row. In the first part, several feature models without Faster-RCNN are summarized. In the second part, the pre-trained Faster-RCNN is used to detect prominent targets and Glove is used to encode word vectors. Our results are in the last block, using the same Faster-RCNN and Glove for pre-training. Most model indicators are better than the existing advanced methods, especially “Number” counting, highlighting the importance of SRRN reasoning in VQA tasks.

Table 4. Performance comparison results on VQA 2.0 dataset.

Model	Test-dev				Test-std
Model	Yes/No	Num	Other	All	All
Language only [22]	-	-	-	-	44.26
LSTM+CNN [22]	-	-	-	-	54.22
MCB reported in [22]	-	-	-	-	62.27
DCN [15]	83.50	46.60	56.72	66.60	67.00
Bottom-up [10]	81.82	44.21	56.05	65.32	65.67
Bottom-up+MFH [17]	84.27	49.56	59.89	68.76	-
MFH [68]	85.31	49.56	59.89	68.76	-
BAN [14]	85.42	50.93	60.26	69.52	-
BAN-Counter [14]	85.42	54.04	60.52	70.04	70.35
VRR [37]	83.31	45.51	58.41	67.20	67.34
DFAF [16]	86.09	53.32	60.49	70.22	70.34
MuRel [17]	84.77	49.84	57.85	68.03	68.41
ReGAT [20]	86.08	54.42	60.33	70.27	70.58
MCAN [17]	86.82	53.26	60.72	70.63	70.90
ViLBERT [31]	-	-	-	70.55	70.92
VisualBERT [34]	-	-	-	70.80	71.00
SRRN-r-s-c(Ours)	86.97	55.02	60.80	70.92	71.18

Open in a new tab

Among these state-of-the-art models, Bottom-up [10] and Bottom-up+MFH [17] combine regional visual features with question-guided visual attention, which considers the biological basis of attention. BAN [14] is a bilinear attention network that considers the bilinear interaction between the input multi-modality to use the question features and image features information fully. BAN-Counter [14] combines BAN with Counter [14]. The latter is a neural network structure that allows robust counting between object suggestions and further improves the accuracy of the model in counting indicators. MuRel [17] and ReGAT [20] used graphs to construct deep reasoning networks and graph reasoning networks based on the relationships between objects, achieving impressive results. Both ViLBERT [31] and VisualBERT [34] utilize the BERT architecture to extend a multimodal dual-stream task. The ViLBERT model is pre-trained on an automatically collected large-scale conceptual caption dataset by two proxy tasks and then transferred to multiple established vision and language tasks, such as visual question answering and visual common sense reasoning. Based on the MCAN and ReGAT models, we designed a deep co-attention visual object spatial relationship reasoning network. The spatial object reasoning feature constructed by the graph reasoning network is combined with the semantic feature of the visual object constructed by the deep co-attention module. Our model is better than the existing state-of-the-art visual question answering reasoning model according to the experimental results.

Table 4 compares the proposed SRRN model with the current state-of-the-art model on the VQA 2.0 dataset. Our model is superior to previous models with or without visual relational reasoning. The MCAN model is the champion model of the 2019 VQA Challenge. The SRRN model and the MCAN model have a better accuracy rate. Specifically, compared with the MCAN model, the accuracy of the four types has been improved (Yes/No increased by 0.15%, Number increased by 1.76%, Other increased by 0.08%, and All increased by 0.29% and 0.28% on test-dev and test-std respectively). It is worth noting that the “Number” type has increased by 0.98% and 0.6% compared with BAN-Counter [14] and ReGAT [20], respectively. It shows the effectiveness of our proposed spatial relationship reasoning based on graph neural networks.

Table 5 shows the comparison results between SRRN and the state-of-the-art model on the GQA dataset. The first block shows human performance, which can be considered the VQA task’s upper bound. CNN+LSTM [23] uses a linear combination of image and question features to predict the answer. Other models use Faster-RCNN to extract image features. MAC is a milestone model on the CLEVR dataset, which decomposes a task into a series of continuous reasoning. SceneGraph [69] and LGCN [71] use graph neural networks to model the visual object region. The network completes the visual question answering task by jointly inferring visual objects’ semantic and attribute relationships. DMFNet [70] uses a multi-graph inference and fusion layer to use pre-trained semantic relationships to embed inferences about complex spatial and semantic relationships between visual objects. According to Table 5, the SRRN model has better accuracy than the most advanced reasoning model. Compared with DMFNet [70], Accuracy, Open, Binary, and Consistency increased by 0.54%, 0.33%, 1.77%, and 0.34%, respectively, but Validity and Plausibility did not reach the best level. We guess that it is because it is effective to add a graph neural network to the original basic model to reason about the spatial position relationship. However, the validity and rationality of testing whether the answer is within the scope of the question is insufficient.

Table 5. Performance comparison results on GQA dataset.

Model	Accuracy	Open	Binary	Validity	Plausibility	Consistency
Human [23]	89.30	87.40	91.20	98.90	97.20	98.40
CNN+LSTM [23]	46.55	31.80	63.26	96.02	84.25	74.57
Bottom-up [23]	49.74	34.83	66.64	96.18	84.57	78.71
MAC [23]	54.06	38.91	71.23	96.16	84.48	81.59
SceneGCN [69]	54.56	40.63	70.33	95.90	84.23	83.49
BAN [14]	56.19	41.13	73.31	96.77	85.58	84.64
DMFNet [70]	57.05	41.86	73.98	97.62	84.87	86.98
LGCN [71]	56.10	-	-	-	-	-
SRRN-r-s-c(ours)	57.59	42.19	75.75	97.11	85.41	87.32

Open in a new tab

4.5 Visualization

In Fig 8, we describe the effect of our model through the visualization results on the VQA 2.0 and GQA datasets. The first visualization is to describe the model on the VQA 2.0 dataset. For example, the third image in the first column counts the number of elephants. The VRR and MCAN model counts are incorrect, while our model counts are correct. When counting elephants, if the visual object is not modeled by spatial reasoning, it is easy to superimpose the occluded elephant as a visual object, which will cause the model to count errors. Because our model integrates the object spatial relationship position reasoning module, it can effectively count in complex visual objects. It can also be seen from the experiment that our model has reached the current best level on the “Number” indicator. In the first image of the second row, our model answers correctly. Because the model combines semantic reasoning and spatial object-relational reasoning simultaneously, it better integrates object features, making the model more accurate in answering questions. However, in the last visualization example in the second line, the answers are wrong because of questions that require reasoning and external knowledge understanding. The model does not have an in-depth understanding of such questions and has external knowledge understanding, so it is difficult to answer them correctly.

In the GQA dataset visualization example, most of the questions require the model to understand and reason. For example, in the third picture in the first row, the model needs to understand the concepts of the two visual objects, the dog and the skateboard, and then infer the relationship between the dog and the skateboard to get the correct answer. Also, in the first picture in the second row, the LGCN model answered incorrectly, which is a spatial relationship reasoning question. The model must first find the positional relationship between the table and the space object of the plate and then understand the concept of “banana” in the plate through the positional relationship and semantic relationship to answer correctly. In the last example in the second row, our model answer is also wrong, which is a question that requires a deep understanding of both semantics and space. It is necessary to model the spatial position relationship of objects and understand the semantic features of the objects. It is also impossible to correctly understand some complex object semantic feature models. Therefore, it is difficult for the model to give a correct answer.

5. Conclusion

The VQA task requires the model to deeply understand visual objects’ spatial position relationship features and semantic features. Existing methods generally focus on studying visual representations or interactive modeling of complex multimodalities. This paper investigates the importance of visual object spatial relational features for models to answer complex reasoning questions correctly. In addition, we also propose a sparse self-attention mechanism encoder, which can effectively capture contextual information while encoding the question while avoiding the introduction of irrelevant information in the modeling process. Finally, we utilize the compact self-attention (CSA) method to optimize the model, which effectively improves the accuracy and computational efficiency of the model based on the initial model. Experiments on our model on benchmark datasets VQA 2.0 and GQA demonstrate the effectiveness and interpretability of the SRRN model. Contribute to further research on spatial object relationship modeling for VQA tasks.

The accuracy of the SRRN model in each type of indicator in the VQA task has been improved. It can be seen from the experimental results that the addition of the spatial object relationship inference module has a significant improvement in the “Number” type indicator. The SRRN model solely focuses on the spatial position relationship of objects among many visual relationships. However, many interaction relationships between objects, such as behavioral relationships, represent action interactions. The follow-up work will explore more interaction relationships between objects and apply the relational reasoning method proposed in this paper to every visual relationship.

Data Availability

The data underlying the results presented in the study are available from (include the name of the third party https://visualqa.org/vqa_v2_teaser.html) The codes of our models are available at https://github.com/shenxiang-vqa/SRRN.

Funding Statement

This research is supported by the National Natural Science Foundation of China (Grant No. 61873160) https://www.nsfc.gov.cn/. This research is also supported by Scientific Research Fund of Hunan Provincial Education Department (Grant No. 21A0470) http://kxjsc.gov.hnedu.cn/. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

1.Antol, Stanislaw, Agrawal, Aishwarya, Lu, Jiasen, et al. Vqa: Visual question answering[C]. International Conference on Computer Vision. 2015: 2425–2433.
2. Malinowski M, Fritz M. A multi-world approach to question answering about real-world scenes based on uncertain input[J]. Advances in neural information processing systems. 2014: 1682–1690. [Google Scholar]
3.Xu K, Ba J, Kiros R, et al. Show, attend and tell: Neural image caption generation with visual attention[C]. International conference on machine learning, 2015: 2048–2057.
4.Xu K, Ba J, Kiros R, et al. Modeling text with graph convolutional network for cross-modal information retrieval[C]. Pacific Rim Conference on Multimedia, 2018: 223–234.
5.Chen H, Ding G, Lin Z, et al. Cross-modal image-text retrieval with semantic consistency[C]. Proceedings of the 27th ACM International Conference on Multimedia.2019: 1749–1757.
6.Zhang Z, Lin Z, Zhao Z, et al. Cross-modal interaction networks for query-based moment retrieval in videos[C]. Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, 2019: 655–664.
7.Lee KH, Xi C, Gang H, et al. Stacked Cross Attention for Image-Text Matching[J]. Proceedings of the European Conference on Computer Vision, 2018:201–216.
8. Lu J, Yang J, Batra D, et al. Hierarchical question-image co-attention for visual question answering[J]. Advances in neural information processing systems, 2016, 29: 289–297. [Google Scholar]
9. Shen X, Han D, Chang CC, et al. Dual Self-Guided Attention with Sparse Question Networks for Visual Question Answering[J]. IEICE TRANSACTIONS on Information and Systems, 2022, 105(4): 785–796. doi: 10.1587/transinf.2021EDP7189 [DOI] [Google Scholar]
10.Teney D, Anderson P, He X, et al. Tips and tricks for visual question answering: Learnings from the 2017 challenge[C]. Proceedings of the IEEE conference on computer vision and pattern recognition, 2018: 4223–4232.
11.Yang Z, He X, Gao J, et al. Stacked Attention Networks for Image Question Answering[J]. Proceedings of the IEEE conference on computer vision and pattern recognition. 2016: 21–29.
12.Yu D, Fu J, Tao M, et al. Multi-level Attention Networks for Visual Question Answering[C]. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017: 4709–4717.
13.Anderson P, He X, Buehler C, et al. Bottom-up and top-down attention for image captioning and visual question answering[C]. Proceedings of the IEEE conference on computer vision and pattern recognition,2018: 6077–6086.
14.Kim JH, Jun J, Zhang BT. Bilinear Attention Networks[J]. arXiv preprint arXiv:1805.07932, 2018.
15.Nguyen DK, Okatani T. Improved Fusion of Visual and Language Representations by Dense Symmetric Co-Attention for Visual Question Answering[C]. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018: 6087–6096.
16.Gao P, Jiang Z, You H, et al. Dynamic fusion with intra-and inter-modality attention flow for visual question answering[C]. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019: 6639–6648.
17.Yu Z, Yu J, Cui Y, et al. Deep Modular Co-Attention Networks for Visual Question Answering[C]. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019: 6281–6290.
18.Zhao G, Lin J, Zhang Z, et al. Explicit sparse transformer: Concentrated attention through explicit selection[J]. arXiv preprint arXiv:1912.11637, 2019.
19.Hu H, Gu J, Zhang Z, et al. Relation networks for object detection[C]. Proceedings of the IEEE conference on computer vision and pattern recognition, 2018:3588–3597.
20.Li L, Gan Z, Cheng Y, et al. Relation-aware graph attention network for visual question answering[C]. Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019:10313–10322.
21.Meng LJaPA. Armour: Generalizable Compact Self-Attention for Vision Transformers[J]. arXiv preprint arXiv:2108.01778, 2021.
22.Goyal Y, Khot T, Summers-Stay D, et al. Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering[C]. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017:6904–6913.
23.Hudson DA, Manning CD. Gqa: A new dataset for real-world visual reasoning and compositional question answering[C]. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition,2019:6700–6709.
24.Patro B, Namboodiri VP. Differential attention for visual question answering[C]. Proceedings of the IEEE conference on computer vision and pattern recognition, 2018: 7680–7688.
25.Zhu C, Zhao Y, Huang S, et al. Structured attentions for visual question answering[C]. Proceedings of the IEEE International Conference on Computer Vision, 2017:1291–1300.
26.Fan H, Zhou J. Stacked latent attention for multimodal reasoning[C]. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018:1072–1080.
27.Yu Z, Yu J, Fan J, et al. Multi-modal Factorized Bilinear Pooling with Co-attention Learning for Visual Question Answering[C]. Proceedings of the IEEE international conference on computer vision. 2017:1821–1830.
28.Kim JH, On KW, Lim W, et al. Hadamard Product for Low-rank Bilinear Pooling[J]. arXiv preprint arXiv:1610.04325, 2016.
29.Ben-Younes H, Ca Dene R, Cord M, et al. MUTAN: Multimodal Tucker Fusion for Visual Question Answering[J]. Proceedings of the IEEE conference on computer vision and pattern recognition, 2017:1–9.
30. Yu Z, Yu J, Xiang C, et al. Beyond bilinear: Generalized multimodal factorized high-order pooling for visual question answering[J]. IEEE transactions on neural networks and learning systems, 2018, 29(12):5947–5959. doi: 10.1109/TNNLS.2018.2817340 [DOI] [PubMed] [Google Scholar]
31. Lu Jiasen and Batra Dhruv and Parikh Devi and Lee Stefan. Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Advances in neural information processing systems,2019,32. [Google Scholar]
32.Su W, Zhu X, Cao Y, et al. Vl-bert: Pre-training of generic visual-linguistic representations[J]. arXiv preprint arXiv:1908.08530, 2019.
33.Tan H, Bansal M. Lxmert: Learning cross-modality encoder representations from transformers[J]. arXiv preprint arXiv:1908.07490, 2019.
34.Li LH, Yatskar M, Yin D, et al. Visualbert: A simple and performant baseline for vision and language[J]. arXiv preprint arXiv:1908.03557, 2019.
35.Teney D, Liu L, Van Den Hengel A. Graph-structured representations for visual question answering[C]. Proceedings of the IEEE conference on computer vision 740 and pattern recognition, 2017:1–9.
36.Santoro A, Raposo D, Barrett DG, et al. A simple neural network module for relational reasoning[J]. arXiv preprint arXiv:1706.01427, 2017.
37. Zhang W, Yu J, Hu H, et al. Multimodal feature fusion by relational reasoning and attention for visual question answering[J]. Information Fusion, 2020, 55:116–126. doi: 10.1016/j.inffus.2019.08.009 [DOI] [Google Scholar]
38.Chen S, Jin Q, Wang P, et al. Say as you wish: Fine-grained control of image caption generation with abstract scene graphs[C]. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020:9962–9971.
39.Yang X, Tang K, Zhang H, et al. Auto-encoding scene graphs for image captioning[C]. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019:10685–10694.
40.Zhang Y, Hare J, Prügel-Bennett AJaPA. Learning to count objects in natural images for visual question answering[J]. arXiv preprint arXiv:1802.05766, 2018.
41.Trott A, Xiong C, Socher RJaPA. Interpretable counting for visual question answering[J]. arXiv preprint arXiv:1712.08697, 2017.
42.Wang P, Wu Q, Shen C, et al. The vqa-machine: Learning how to use existing vision algorithms to answer new questions[C]. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,2017:1173–1182.
43.Bahdanau D, Cho K, Bengio YJCS. Neural Machine Translation by Jointly Learning to Align and Translate[J]. arXiv preprint arXiv:1409.0473, 2014.
44.Zhu Y, Groth O, Bernstein M, et al. Visual7w: Grounded question answering in images[C]. Proceedings of the IEEE conference on computer vision and pattern recognition, 2016:4995–5004.
45.Lu P, Li H, Zhang W, et al. Co-attending free-form regions and detections with multi-modal multiplicative feature embedding for visual question answering[C]. Proceedings of the AAAI Conference on Artificial Intelligence. 2018, 32(1).
46. Choi MJ, Torralba A, Willsky AS. A tree-based context model for object recognition[J]. IEEE transactions on pattern analysis and machine intelligence, 2011, 34(2): 240–252. doi: 10.1109/TPAMI.2011.119 [DOI] [PubMed] [Google Scholar]
47. Felzenszwalb PF, Girshick RB, Mcallester D, et al. Object detection with discriminatively trained part-based models[J]. IEEE transactions on pattern analysis and machine intelligence, 2009,32(9):1627–1645. doi: 10.1109/TPAMI.2009.167 [DOI] [PubMed] [Google Scholar]
48.Divvala SK, Hoiem D, Hays JH, et al. An empirical study of context in object detection[C]. 2009 IEEE Conference on computer vision and Pattern Recognition,2009: 1271–1278.
49.Galleguillos C, Rabinovich A, Belongie S. Object categorization using co-occurrence, location and appearance[C]. 2008 IEEE Conference on Computer Vision and Pattern Recognition, 2008:1–8.
50. Gould S, Rodgers J, Cohen D, et al. Multi-class segmentation with relative location prior[J]. International journal of computer vision, 2008, 80(3): 300–316. doi: 10.1007/s11263-008-0140-x [DOI] [Google Scholar]
51.Yao T, Pan Y, Li Y, et al. Exploring visual relationship for image captioning[C]. Proceedings of the European conference on computer vision (ECCV),2018: 684–699.
52.Fang H, Gupta S, Iandola F, et al. From captions to visual concepts and back[C]. Proceedings of the IEEE conference on computer vision and pattern recognition,2015: 1473–1482.
53.Johnson J, Krishna R, Stark M, et al. Image retrieval using scene graphs[C]. Proceedings of the IEEE conference on computer vision and pattern recognition,2015: 3668–3678.
54.Schuster S, Krishna R, Chang A, et al. Generating semantically precise scene graphs from textual descriptions for improved image retrieval[C]. Proceedings of the fourth workshop on vision and language,2015: 70–80.
55.Farhadi A, Sadeghi A. Recognition using visual phrases[C]. Computer Vision and Pattern Recognition (CVPR),2011.
56.Ramanathan V, Li C, Deng J, et al. Learning semantic relationships for better action retrieval in images[C]. Proceedings of the IEEE conference on computer vision and pattern recognition, 2015:1100–1109.
57.Lu C, Krishna R, Bernstein M, et al. Visual relationship detection with language priors[C]. European conference on computer vision, 2016:852–869.
58.Zhang H, Kyaw Z, Chang S-F, et al. Visual translation embedding network for visual relation detection[C]. Proceedings of the IEEE conference on computer vision and pattern recognition,2017: 5532–5540.
59.Yang Z, Yu J, Yang C, et al. Multi-modal learning with prior visual relation reasoning[J]. arXiv preprint arXiv:1812.09681, 2018, 3(7).
60.Cadene R, Ben-Younes H, Cord M, et al. Murel: Multimodal relational reasoning for visual question answering[C]. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,2019: 1989–1998.
61. Yu J, Zhang W, Lu Y, et al. Reasoning on the relation: Enhancing visual representation for visual question answering and cross-modal retrieval[J]. IEEE Transactions on Multimedia, 2020, 22(12): 3196–3209 doi: 10.1109/TMM.2020.2972830 [DOI] [Google Scholar]
62.Pennington J, Socher R, Manning CD. Glove: Global vectors for word representation[C]. Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP),2014: 1532–1543.
63. Ren S, He K, Girshick R, et al. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks[J]. IEEE transactions on pattern analysis and machine intelligence, 2016, 39(6): 1137–1149. doi: 10.1109/TPAMI.2016.2577031 [DOI] [PubMed] [Google Scholar]
64. Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need[J]. Advances in neural information processing systems. 2017: 5998–6008. [Google Scholar]
65.Ren S, He K, Girshick R, et al. Multimodal encoder-decoder attention networks for visual question answering[J]. IEEE Access, 2020, 8: 35662–35671.
66.Zhou, Luowei and Palangi, Hamid and Zhang, Lei and Hu, Houdong and Corso, Jason and Gao, Jianfeng. Unified vision-language pre-training for image captioning and vqa. Proceedings of the AAAI Conference on Artificial Intelligence, 2020,34(7):13041–13049.
67.Gao, Peng and You, Haoxuan and Zhang, Zhanpeng and Wang, Xiaogang and Li, Hongsheng. Multi-modality latent interaction network for visual question answering. Proceedings of the IEEE/CVF international conference on computer vision, 2019:5825–5835.
68.Zhou B, Tian Y, Sukhbaatar S, et al. Simple Baseline for Visual Question Answering[J]. arXiv preprint arXiv:1512.02167, 2015.
69.Yang Z, Qin Z, Yu J, et al. Scene graph reasoning with prior visual relationship for visual question answering[J]. arXiv preprint arXiv:1812.09681, 2018.
70. Zhang W, Yu J, Zhao W, et al. DMRFNet: Deep Multimodal Reasoning and Fusion for Visual Question Answering and explanation generation[J]. Information Fusion, 2021, 72: 70–79. doi: 10.1016/j.inffus.2021.02.006 [DOI] [Google Scholar]
71.Hu R, Rohrbach A, Darrell T, et al. Language-conditioned graph networks for relational reasoning[C]. Proceedings of the IEEE/CVF International Conference on Computer Vision,2019: 10294–10303.

PLoS One. doi: 10.1371/journal.pone.0277693.r001

Decision Letter 0

Sriparna Saha

26 Apr 2022

PONE-D-21-39541An effective spatial relational reasoning networks for visual question answeringPLOS ONE

Dear Dr. Han,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

==============================

ACADEMIC EDITOR: Please revised the paper based on the comments of the reviewers.

==============================

Please submit your revised manuscript by Jun 10 2022 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.
A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.
An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols.

We look forward to receiving your revised manuscript.

Kind regards,

Sriparna Saha, PhD

Academic Editor

PLOS ONE

Journal Requirements:

When submitting your revision, we need you to address these additional requirements.

1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at

https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and

https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf

2. Please note that PLOS ONE has specific guidelines on code sharing for submissions in which author-generated code underpins the findings in the manuscript. In these cases, all author-generated code must be made available without restrictions upon publication of the work. Please review our guidelines at https://journals.plos.org/plosone/s/materials-and-software-sharing#loc-sharing-code and ensure that your code is shared in a way that follows best practice and facilitates reproducibility and reuse.

3. Thank you for stating the following in the Acknowledgments Section of your manuscript:

“This research is supported by the National Natural Science Foundation of China under Grant 61873160, Grant 61672338, and the Natural Science Foundation of Shanghai under Grant 21ZR1426500.

This research is also supported by the Hunan Provincial Natural Science Foundation (Grant No. 2020JJ4557).”

We note that you have provided additional information within the Acknowledgements Section that is not currently declared in your Funding Statement. Please note that funding information should not appear in the Acknowledgments section or other areas of your manuscript. We will only publish funding information present in the Funding Statement section of the online submission form.

Please remove any funding-related text from the manuscript and let us know how you would like to update your Funding Statement. Currently, your Funding Statement reads as follows:

“XS and DH are supported by the National Natural Science Foundation of China under Grant 61873160, Grant 61672338, and the Natural Science Foundation of Shanghai under Grant 21ZR1426500. https://isisn.nsfc.gov.cn/egrantindex/funcindex/prjsearch-list.

GL is supported by the Hunan Provincial Natural Science Foundation (Grant No. 2020JJ4557). http://kjt.hunan.gov.cn/kjt/zxgz/zkjj/xmgljcx/202006/t20200605_12266401.html”

Please include your amended statements within your cover letter; we will change the online submission form on your behalf.

4. Thank you for stating the following financial disclosure:

GL is supported by the Hunan Provincial Natural Science Foundation (Grant No. 2020JJ4557). http://kjt.hunan.gov.cn/kjt/zxgz/zkjj/xmgljcx/202006/t20200605_12266401.html”

Please state what role the funders took in the study. If the funders had no role, please state: "The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript."

If this statement is not correct you must amend it as needed.

Please include this amended Role of Funder statement in your cover letter; we will change the online submission form on your behalf.

5. We note that Figure 1 in your submission contain copyrighted images. All PLOS content is published under the Creative Commons Attribution License (CC BY 4.0), which means that the manuscript, images, and Supporting Information files will be freely available online, and any third party is permitted to access, download, copy, distribute, and use these materials in any way, even commercially, with proper attribution. For more information, see our copyright guidelines: http://journals.plos.org/plosone/s/licenses-and-copyright.

We require you to either (1) present written permission from the copyright holder to publish these figures specifically under the CC BY 4.0 license, or (2) remove the figures from your submission:

a. You may seek permission from the original copyright holder of Figure(s) [#] to publish the content specifically under the CC BY 4.0 license.

We recommend that you contact the original copyright holder with the Content Permission Form (http://journals.plos.org/plosone/s/file?id=7c09/content-permission-form.pdf) and the following text:

“I request permission for the open-access journal PLOS ONE to publish XXX under the Creative Commons Attribution License (CCAL) CC BY 4.0 (http://creativecommons.org/licenses/by/4.0/). Please be aware that this license allows unrestricted use and distribution, even commercially, by third parties. Please reply and provide explicit written permission to publish XXX under a CC BY license and complete the attached form.”

Please upload the completed Content Permission Form or other proof of granted permissions as an "Other" file with your submission.

In the figure caption of the copyrighted figure, please include the following text: “Reprinted from [ref] under a CC BY license, with permission from [name of publisher], original copyright [original copyright year].”

b. If you are unable to obtain permission from the original copyright holder to publish these figures under the CC BY 4.0 license or if the copyright holder’s requirements are incompatible with the CC BY 4.0 license, please either i) remove the figure or ii) supply a replacement figure that complies with the CC BY 4.0 license. Please check copyright information on all replacement figures and update the figure caption with source information. If applicable, please specify in the figure caption text when a figure is similar but not identical to the original image and is therefore for illustrative purposes only.

5. We note that Figures 2 and 8 includes an image of a [patient / participant / in the study].

As per the PLOS ONE policy (http://journals.plos.org/plosone/s/submission-guidelines#loc-human-subjects-research) on papers that include identifying, or potentially identifying, information, the individual(s) or parent(s)/guardian(s) must be informed of the terms of the PLOS open-access (CC-BY) license and provide specific permission for publication of these details under the terms of this license. Please download the Consent Form for Publication in a PLOS Journal (http://journals.plos.org/plosone/s/file?id=8ce6/plos-consent-form-english.pdf). The signed consent form should not be submitted with the manuscript, but should be securely filed in the individual's case notes. Please amend the methods section and ethics statement of the manuscript to explicitly state that the patient/participant has provided consent for publication: “The individual in this manuscript has given written informed consent (as outlined in PLOS consent form) to publish these case details”.

If you are unable to obtain consent from the subject of the photograph, you will need to remove the figure and any other textual identifying information or case descriptions for this individual.

Additional Editor Comments (if provided):

The paper is based on the traditional approach to VQA with 71% accuracy. The recent approaches (pre-trained Transformer) have achieved performance of ~81% accuracy. The authors need to discuss those approaches and how (in terms of computational complexity, interoperability, low resource scenarios, etc. ) and when the proposed method will be useful.

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Yes

Reviewer #2: Yes

**********

2. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #2: Yes

**********

3. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: No

**********

4. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: No

Reviewer #2: Yes

**********

5. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: This work introduces the spatial and semantic reasoning network to learn the spatial position relationship and object attribute relationship of visual objects in VQA tasks. The authors show that their approach outperforms the attention and fusion-based existing methods on VQA 2.0 and GQA. However, there is no discussion on ongoing pre-trained Transformer based approaches. The pre-trained networks have pushed the boundaries of VQA performance. A detailed discussion on pre-trained networks and why one should use the proposed method (with less than ~10% overall accuracy compared to pre-trained approaches) is required for the reader to understand the usefulness of the approach.

Strengths:

The results show their approach improves the performance on the VQA 2.0 and GQA datasets compared to the non-pretrained network.

A detailed experiment and analysis are provided.

Weakness:

The paper proposed exciting ideas, but they can be presented in a much better way.

Related work on pre-trained network-based approaches and comparisons are missing.

Questions:

With n words in the question, Eq. (3) must obey the index. It can not be n+1. Please re-write the equation.

It is written (line 243-245) “Therefore, we use the question-adaptive attention mechanism to extract the semantic information of the question when designing the visual object spatial relationship graph reasoning network. “ However, the question-information is never incorporated to generate the reasoning feature v* of spatial object relation. Please clarify.

Function M in Eq. 10 is independent of the δ. Please re-write it. Authors should be consistent with the notation. The variable Q denotes multiple things (question and key) in the paper.

Eq. 11 and 12 are the same, and the description around them are hard to follow. If the matrix W has c row (ref. Line 309), then how come the vector t has t (again, please use another notation) row as written in “We connect the thresholds of each row to form a vector t = [t1, t2, . . . , tt ]”?

The function SAtt(;) is not defined. Please define and refer to the appropriate equation/section.

Can you provide the appropriate reference to your statement: “Since Q and K vary linearly from the same input, the weight matrices Wk and WQ are entangled in the gradient backpropagation, a basic redundancy in the conventional self-attention mechanism”

What do the authors mean by this statement:

“This paper proves that SRRN is indeed a virtual visual 628 object spatial relationship network.”

The contributions statement needs to be re-written. It is hard for the reader to understand.

What does this sentence refer to: “The ablation experiment analyzes the hypernatremia effect of the SSRN model.”

What is vg in Section 4.3?

In Table 1, the performance of SRRN-c is better than SRNN-s. Does the proposed sparse attention mechanism not better than the traditional one?

Each model variant can be better presented in the itemized format.

Typos/grammar:

“We input the word embedding sequence of size n × 300 to get the question, where n × 300 is the number of words contained in the question.”

I think n × 300 should be changed to n.

“Inattention will lead to the failure of relevant information extraction.”

Duplicate sentence line 348-351

Reviewer #2: Good

1. The research background is clearly articulated.

2. Distinct progress is shown at the paper.

To be improved

1. Results and Dataset are not available in the specified paths. (https://eval.ai/web/challenges/challengepage/

830/submission)

2. The architecture & training of Graph Neural Network can be elaborated.

Minor correction

1. Lines 348~351 are repeated.

**********

6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

PLoS One. 2022 Nov 28;17(11):e0277693. doi: 10.1371/journal.pone.0277693.r002

Author response to Decision Letter 0

26 May 2022

Dear Editor,

Thanks very much for taking your time to review this manuscript. We really appreciate all your generous comments and suggestion, which are very helpful to make this revision much clearer and more compelling. We have followed the comments and suggestions and revised the manuscript.

Thanks again for your patience and guidance!

Best regards,

Xiang Shen, Dezhi Han, Chongqing Chen, Gaofeng Luo, Zhongdai Wu

Response to Editor

Additional Editor Comments (if provided):

Response editor：

First of all, thank you very much for your comment. Pre-training is widely used in vision and language tasks, achieving surprising results in various tasks. Undoubtedly, pre-training can effectively facilitate alignment between different modalities, and reasonable fine-tuning in downstream tasks can achieve great results. It has reached 81% in the VQA task.

Response:

Although the pre-trained model achieves good results in a specific VQA task, many researchers also focus on developing end-to-end models. For public VQA benchmark datasets, there are VQA 1.0, VQA 2.0, GQA, VQA-E, CLEVR, etc. To verify the effectiveness of the proposed models, researchers usually use these benchmarks dataset for training. We employ end-to-end development because we can flexibly design and change the network structure without considering the effect of the model due to dataset bias or other factors. In addition, considering the hardware resources, if a pre-trained model is used to obtain a pre-trained model for a specific VQA task, the amount of datasets and better hardware resources are required for training. Currently, our hardware resources are insufficient to support pre-training, and pre-training also takes a long time. The final model is fine-tuned on the VQA task to achieve good results. If we directly utilize the existing pre-trained model and fine-tune it in a specific VQA task, it may also be due to the bias of the dataset and the differences between tasks. Without an excellent fine-tuning strategy, the final result may also be unsatisfactory. Since we changed the model structure of the network, we cannot rigorously prove the effectiveness of our model without using a specific dataset for training.

Of course, we are also considering using pre-trained models in future work. To ensure that our end-to-end model has a perfect effect, continuing to employ pre-training for fine-tuning will promote the development of the VQA task and get better results. In recent research work, there are also pre-trained models using similar network structures in this paper, such as ViLT-BERT, VisualBERT, and ViLBERT, among which the accuracies of ViBERT, VisualBERT, and ViLBERT are 70.34%, 70.80%, 70.55%, respectively. The SRRN model is better than these models.

Based on your comments, to allow readers to understand our research work better, we have added pre-trained models and explained them in the related work section and model comparison. Related pre-trained models have been discussed in the related work section (Section 2.1: lines134-159) and the model comparison section (Section 4.4: lines 611-615), with modifications marked in blue.

Response to Reviewer #1

Dear Reviewer#1:

Thank you for reviewing our paper entitled “An effective spatial relational reasoning networks for visual question answering. (ID:PONE-D-21-39541)” Also, we would like to thank you for your good comments and suggestions on it. We have studied each of your comments carefully and made corresponding revisions at your suggestion, which are highlighted in blue in this paper. All of your question were answered one-by-one.

Response：

First of all thank you very much for your comment, as you said, “The pre-trained networks have pushed the boundaries of VQA performance.” In the past few years, the emergence of pre-training models has brought uni-modal fields such as computer vision and natural language processing to a new era. Substantial works have shown they are beneficial for downstream uni-modal tasks and avoid training a new model from scratch. Pre-training is an important technology in two aspects of computer vision and natural language, whose models can fine-tune different downstream tasks. Some representative methods, including ViLBERT, VLBERT, LXMERT, UNITER, OSCAR, VisualBERT, etc., use the bidirectional encoder representations from Transformers (BERT) structure and have achieved effective results in VQA tasks.

Why do we consider using an end-to-end network structure?

First, to demonstrate the SRRN model's effectiveness, we conveniently use the traditional end-to-end model to compare with the baseline model MCAN. The end-to-end model can be trained and tested on specific datasets (VQA 2.0 and GQA).

Second, for a specific VQA task. We train with a benchmark dataset for the VQA task, which can illustrate the rationality and effectiveness of the model more effectively. Most of the vision and language tasks that use pre-training today include multiple tasks, such as masked language modeling, masked object prediction (feature regression and label classification), cross-modality matching, and image question answering. Fine-tuning for downstream tasks requires meeting the demands of multiple tasks, and enabling parameter sharing is also a challenge.

Finally, although pre-trained models can achieve outstanding results on some tasks, there are certain disadvantages to using pre-trained models. For example, the pre-training model is large, the parameters are many, the flexibility of the model structure is poor, it is difficult to change the network structure, the calculation amount is large, and the application scenarios are limited.

Based on your comments, to allow readers to understand our research work better, we have added pre-trained models and explained them in the related work section and model comparison. Related pre-trained models have been discussed in the related work section (Section 2.1: lines 134-159) and the model comparison section (Section 4.4: lines 611-615), with modifications marked in blue.

Strengths:

The results show their approach improves the performance on the VQA 2.0 and GQA datasets compared to the non-pretrained network.

A detailed experiment and analysis are provided.

Weakness:

The paper proposed exciting ideas, but they can be presented in a much better way.

Related work on pre-trained network-based approaches and comparisons are missing.

Response：First of all, thank you very much for your affirmation of our research work and for pointing out the problems in our manuscript. In response to the issues in the manuscript, we carefully revise and improve it according to your comments, and it is your comments that make our research work and manuscripts be presented to readers in a better way.

Questions:

Q1. With n words in the question, Eq. (3) must obey the index. It can not be n+1. Please re-write the equation.

Response Q1:Thank you very much for pointing out the problems in our manuscript. Based on your comments, we re-checked the manuscript. To give readers a better understanding of the formulas in the manuscript, we use S to represent the maximum number of words so as not to conflict with n in Equation 3. Modifications in the manuscript are marked in blue as lines 264 to 269. Modified Equations and symbols are marked in yellow.

Q2. It is written (line 243-245) “Therefore, we use the question-adaptive attention mechanism to extract the semantic information of the question when designing the visual object spatial relationship graph reasoning network. “ However, the question-information is never incorporated to generate the reasoning feature v* of spatial object relation. Please clarify.

Response Q2: First of all, thank you very much for your comments. Section 3.2 mainly describes the visual object spatial relation inference module. The original meaning of what we want to express is: that the visual object spatial relationship combined with the visual semantic reasoning module can obtain image features with spatial relationships and object semantic attributes.The question information is not used in the spatial object reasoning module. Only in the semantic relation reasoning module is semantic information utilized to guide the attention to the vital information of the image. The sparse attention mechanism in the visual object semantic reasoning module can adaptively obtain critical visual semantic features according to the question information. In order not to cause misunderstanding and ambiguity to the readers, according to your suggestion, we remove the description of the information about the semantic reasoning of visual objects in this section3.2 (lines292-294).

Q3. Function M in Eq. 10 is independent of the δ. Please re-write it. Authors should be consistent with the notation. The variable Q denotes multiple things (question and key) in the paper.

Response Q3: First of all, thank you very much for your comment, based on your comments. To allow readers to understand the meaning of the equation(10) better, we have re-worked the equation in the manuscript. Specifically: Rewrite the formula in Eq. 10, delete δ. Redefine the vector-matrix A. Modifications in the manuscript are marked in yellow as lines 369 to 372.

According to your comments, Q appears several times in the manuscript because Q, K, and V are used in both the encoder and decoder. To allow readers to understand Q's meaning in each chapter clearly, we use different subscripts to indicate different meanings. In the revised manuscript, Q used in the encoder is changed to QE, and V and K are also represented in the same way. Q, K, and V also appear in Section 3.3.2. To illustrate the compact self-attention principle, we still use the original dot-product self-attention mechanism formulation method because the compact self-attention(CSA) mechanism also uses the encoder and decoder. Modifications in the manuscript are marked in yellow as Line352-350 and lines 380-387. In addition, the question feature vector is represented by Qq.

Q4. Eq. 11 and 12 are the same, and the description around them are hard to follow. If the matrix W has c row (ref. Line 309), then how come the vector t has t (again, please use another notation) row as written in “We connect the thresholds of each row to form a vector t = [t1, t2, . . . , tt ]”?

Response Q4: First of all, thank you very much for your comments. I am very sorry for the inconvenience caused to your review due to the repeated formulas caused by our carelessness in typesetting. Based on your comments. We have re-changed the repeated equation (12) and made the definition and interpretation of the equation.To make the formulas in the manuscript concise and easy to understand, we also rewrite the formulas that share the same symbols and re-denote the vector-matrix originally represented by t in the manuscript with A. And also modify fig4 in lines 369-370.

Q5. The function SAtt(;) is not defined. Please define and refer to the appropriate equation/section. Can provide the appropriate reference to your statement: “Since Q and K vary linearly from the n yousame input, the weight matrices Wk and WQ are entangled in the gradient backpropagation, a basic redundancy in the conventional self-attention mechanism”.

Response Q5: Thank you very much for your comment. Based on your comment, we have redefined the formula SAtt(;) in line389. Besides, our proposed compact self-attention mechanism is also inspired by the literature [21], and the cited reference in line 398 of the manuscript explains the compact self-attention mechanism(CSA). In the VQA task, we have found this simple but very effective method through many experiments. So we use this method in this paper, which can also inspire readers to use the compact self-attention mechanism to improve the model accuracy in similar VQA tasks.

Q6. What do the authors mean by this statement: line 628

“This paper proves that SRRN is indeed a virtual visual object spatial relationship network.”

Response Q6: Thank you very much for pointing out that sentences in our manuscript are complicated for readers to understand, based on your prompt. We rechecked and revised, and rewritten the Conclusion section. Sections re-written in the manuscript's Conclusions are marked in blue.

Q7. The contributions statement needs to be re-written. It is hard for the reader to understand.What does this sentence refer to: “The ablation experiment analyzes the hypernatremia effect of the SSRN model.”

Response Q7: Thank you very much for your comments, based on your comments. For readers to better read and understand our research work, we re-written the contributions of this paper. Modifications in the manuscript are marked in blue as lines 104 to 118.

Q8. What is vg in Section 4.3?

Response Q8:Visual genome (vg) is a dataset, a knowledge base, an ongoing effort to connect structured image concepts to language. We have explained the VG dataset in lines 528 and 529 of the manuscript.To effectively verify the experiment, three modes can be used during the experiment：--SPLIT={'train', 'train+val', 'train+val+vg'} can combine the training datasets. The default training split is 'train+val+vg'. Setting --SPLIT='train' will trigger the evaluation script to run the validation score after every epoch automatically. The download address of the vg dataset : https://pan.baidu.com/s/1QCOtSxJGQA01DnhUg7FFtQ#list/path=%2F

Q9. In Table 1, the performance of SRRN-c is better than SRNN-s. Does the proposed sparse attention mechanism not better than the traditional one?

Response Q9: First of all, thank you very much for your comments. In Table 1, "SRRN-c" is the experimental effect when we only utilize the compact self-attention mechanism without employing the sparse self-attention encoder and the spatial position relationship reasoning module. From the experimental results, It can be seen that although the compact self-attention mechanism has a simple idea, it has a good effect on improving the model's performance. Our baseline model is MCAN, and it can be seen from Table 1 that the impact of using a sparse attention mechanism (SRRN-s) in the encoder is better than that of the MCAN model, indicating that our proposed sparse attention mechanism is effective.

Additionally, the main goal of our research is to explore the use of suitable parameters in sparse self-attention combined with visual object spatial relational reasoning modules. As shown in Table 1, when only the compact self-attention mechanism is used, the change in the "Number" indicator is not apparent, although the overall effect is good. However, our proposed visual object spatial relationship reasoning aims to study the spatial position relationship between objects and objects, so improving the "Number" indicator in the experimental results is needed to prove the method's effectiveness.Therefore, we combine the sparse attention mechanism and the visual object spatial relation inference module for research, and the experimental results also demonstrate the effectiveness of our two methods. Finally, the model can be further optimized if the compact self-attention mechanism is added. The Number indicator of "SRRN-r1-s-c" in Table 1 has reached 55.02%, which has surpassed some models that focus on counting (Number), such as ReGAT [20 ] and BAN-Counter [15] et al.

Q10. Each model variant can be better presented in the itemized format.

Response Q10：Thank you very much for your comments, based on your comments. We re-arrange and breakdown the data results for Table 1.

Q11. “We input the word embedding sequence of size n × 300 to get the question, where n × 300 is the number of words contained in the question.” I think n × 300 should be changed to n.

Response Q11：First of all, thank you very much for checking the manuscript carefully and carefully for us. I am sorry that there are errors in the manuscript due to our hand mistakes. According to your suggestion, for readers to understand the steps of our research very clearly. We have reworked the content, marked in yellow (line 264-269 marked blue).

Q11. “Inattention will lead to the failure of relevant information extraction.” Duplicate sentence line 348-351.

Response Q11: Thank you very much for pointing out the problems in our manuscript, based on your hints. We checked and rechecked the manuscript, and now repeated sentences have been removed, and ambiguous sentences have been rewritten. The specific revisions have been marked in blue (line342-343) the manuscript. Repeated sentences on lines 348-351 have been re-corrected.

Response to Reviewer #2

Reviewer #2: Good

1. The research background is clearly articulated.

2. Distinct progress is shown at the paper.

Dear Reviewer#2:

First of all, thank you very much for your affirmation and encouragement of our research work, and thank you for reviewing our paper entitled “An effective spatial relational reasoning networks for visual question answering. (ID:PONE-D-21-39541)”. We are very sorry for the inconvenience caused to your review due to some defects in our article. We re-checked the manuscript and ensured that all data and formulas were utterly correct. Again, thank you very much for your affirmation of our work.

Q1. Results and Dataset are not available in the specified paths. (https://eval.ai/web/challenges/challengepage/830/submission)

Response Q1: Thank you very much for your comments, based on your comments. To ensure the accuracy of our experimental data, we upload our experimental code and some test models to the Baidu network disk. Among them, the VQA 2.0 dataset evaluation website: https://eval.ai/auth/login Username: shenxiang Password: 123456 VQA 2.0 dataset website:：https://pan.baidu.com/s/1C7jIWgM3hFPv-YXJexItgw#list/path=%2F

The data underlying the results presented in the study are available from (include the name of the third party https://visualqa.org/vqa_v2_teaser.html).

Q2. The architecture & training of Graph Neural Network can be elaborated.

Response Q2: Thank you very much for your comment. In this paper, we use the principles of graph neural networks in our spatial object-relational reasoning network. Specifically, we use the different visual object targets in the image as different nodes as the graph neural network as input and the relationship between different objects as the edge as the graph neural network. It is explained in Section 3.2 of the manuscript how to calculate the relation weights between different objects and finally get visual features with spatial relation positions.

Q3. Lines 348~351 are repeated.

Response Q3: Thank you very much for your careful examination of our manuscript, according to your request. We have removed duplicate sentences.

Attachment

Submitted filename: Response to Reviewers.docx

Click here for additional data file.^{(26.1KB, docx)}

PLoS One. doi: 10.1371/journal.pone.0277693.r003

Decision Letter 1

Sriparna Saha

12 Sep 2022

PONE-D-21-39541R1An effective spatial relational reasoning networks for visual question answeringPLOS ONE

Dear Dr. Han,

Please submit your revised manuscript by Oct 27 2022 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.
A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.
An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

We look forward to receiving your revised manuscript.

Kind regards,

Sriparna Saha, PhD

Academic Editor

PLOS ONE

Journal Requirements:

Please review your reference list to ensure that it is complete and correct. If you have cited papers that have been retracted, please include the rationale for doing so in the manuscript text, or remove these references and replace them with relevant current references. Any changes to the reference list should be mentioned in the rebuttal letter that accompanies your revised manuscript. If you need to cite a retracted article, indicate the article’s retracted status in the References list and also include a citation and full reference for the retraction notice.

Additional Editor Comments (if provided):

One of the reviewers has suggested some minor changes for the paper. The authors are requested to incorporate these changes in the revised version of the paper.

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation.

Reviewer #1: (No Response)

Reviewer #2: All comments have been addressed

**********

2. Is the manuscript technically sound, and do the data support the conclusions?

Reviewer #1: Yes

Reviewer #2: Yes

**********

3. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #2: N/A

**********

4. Have the authors made all data underlying the findings in their manuscript fully available?

Reviewer #1: Yes

Reviewer #2: No

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English?

Reviewer #1: Yes

Reviewer #2: Yes

**********

6. Review Comments to the Author

Reviewer #1: I read the author's response to the pre-trained models' question. The computational resource issue is a valid concern for academia. But again, many of the models do not take many resources (GPUs) and time only to fine-tune them on a particular dataset. Does it mean the author proposed model is less complex and has fewer parameters than the existing pre-trained models?

In this case, it will be good to compare parameters and time taken to train (fine-tune) the model (if computational resources allow) between the proposed model and pre-trained models such as ViLBERT, VisualBERT, etc.

The dataset bias issue with the pre-trained model also needs to be discussed in the main paper.

Line 158 "....many researchers are still based on the end-to-end training method..." spelling mistakes and appropriate citations are missing.

The remaining concerns have been addressed. However, proofreading is required, and the uses of notation and equations need to be checked carefully.

Reviewer #2: (No Response)

**********

7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

**********

PLoS One. 2022 Nov 28;17(11):e0277693. doi: 10.1371/journal.pone.0277693.r004

Author response to Decision Letter 1

29 Sep 2022

Manuscript ID: PONE-D-21-39541R1

Paper Title: An effective spatial relational reasoning networks for visual question answering

Dear Editor,

Thank you for your letter and for the reviewers’ comments concerning our manuscript entitled “An effective spatial relational reasoning networks for visual question answering” (PONE-D-21-39541R1). These comments are all valuable and very helpful for revising and improving our paper, as well as the important guiding significance to our research. We have carefully studied the comments point-by-point and revised the paper accordingly.

Thanks again for your patience and guidance!

Best regards,

Xiang Shen, Dezhi Han, Chongqing Chen, Gaofeng Luo, Zhongdai Wu

Journal Requirements Response:

Response: First of all, thank you very much for your comments. According to the journal requirements, we rechecked the references cited in this article. The references we cite are not retracted papers. In addition, according to the requirements of reviewer #1, we re-added the pre-training model related to this paper and compared the parameter quantity and accuracy of the SRRN model proposed in this paper. For the convenience of reviewing, the newly added references are marked in yellow.

Reviewer's Responses to Questions

Comments to the Author

Reviewer #1: (No Response)

Reviewer #2: All comments have been addressed

Response: Thanks again to the editors and reviewers for their comments and suggestions on our paper. We carefully thought about and responded to all the questions raised, and marked them with different colors in the manuscript for easy review.

________________________________________

2. Is the manuscript technically sound, and do the data support the conclusions?

Reviewer #1: Yes

Reviewer #2: Yes

Response: I am very grateful for your affirmation of our work.

________________________________________

3. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #2: N/A

Response：Thank you very much for reviewing and commenting on our paper again, in order to allow readers to better understand the method we propose. Based on editor and reviewer comments, we further refined the article and marked it in the manuscript.

________________________________________

4. Have the authors made all data underlying the findings in their manuscript fully available?

Reviewer #1: Yes

Reviewer #2: No

Response：Thank you very much for your review and comments on our article again. In order to allow readers to better understand and reproduce our proposed method, and conduct in-depth research, The code will be available at https://github.com/shenxiang-vqa/SRRN.

________________________________________

5. Is the manuscript presented in an intelligible fashion and written in standard English?

Reviewer #1: Yes

Reviewer #2: Yes

Response：I am very grateful for your affirmation of our work.

________________________________________

6. Review Comments to the Author

Response: First of all, thank you very much for reviewing our paper again and giving us suggestions. Based on your suggestions and comments, we can better improve the article. As you said, most of the existing multi-modal pre-training models only need to be fine-tuned on specific task datasets to get good results, and pre-training models are gradually becoming the mainstream method for multi-modal tasks.

Based on your suggestion, we investigate and study the task of pretraining models for visual question answering. Through investigation and research, it is found that pre-trained models have their advantages. However, some pre-trained models have more parameters than our proposed (SRRN) model, and they are not as good as end-to-end models on visual question answering tasks. As shown in Table 1, we compare the existing classical visual question answering pre-training model with the SRRN model, and the SRRN model has advantages over the pre-training model in terms of both the amount of parameters and the accuracy. Through experimental comparison, it is found that we use the end-to-end model not only in terms of the number of parameters and far lower than the pre-trained model, but only 1 TitanX GPUs are required to train the model. It shows that the parameters and complexity of our proposed model are better than some existing pre-trained models. (Tabular data refer to Table 1 in the “Multi-stage Pre-training over Simplified Multimodal Pre-training Models”.)

model parameter Test-dev Test-std

Unified VLP - 70.50 70.70

VilBERT 218.9M 70.55 70.92

VisualBERT 85.05M 70.80 71.00

VL-BERT 134.8M 71.16 -

DFAF-BERT 173.2M 70.59 70.81

MLI-BERT 120.0M 71.19 71.27

SRRN-r2-s 58.98M 70.73 -

SRRN-r1-s 58.19M 70.92 71.18

According to your comment, we added the model parameters and model accuracy in the ablation experiment part of the paper (Section 4.3.3) to compare with some existing VQA pre-training models. Although the proposed model adds the spatial object relation inference module, it does not increase the parameter quantity and complexity of the model, and good results can be obtained by training under the same experimental conditions. The parameters trained by the two models are shown in the figure below, where the left side represents the parameter quantity for training the "SRRN-r1-s" model, and the right side represents the parameter quantity for training the "SRRN-r2" model. Section 4.3.3 that we have added to the manuscript are marked in light blue.

The dataset bias issue with the pre-trained model also needs to be discussed in the main paper.

Response：First of all，thank you very much for your question, based on your suggestion. We discussed the issue of dataset bias for pretrained models in the paper. The pre-trained datasets come from various corpora, and the features learned from different corpora have generalization and generality, and can be better used for different multimodal tasks after fine-tuning. We use an end-to-end model for visual question answering tasks through specific datasets (such as VQA 2.0 and GQA), making it easier to capture and extract text and image features, increasing the robustness of the model. Re-added in Section 4.1, in lines 443 to 450 of the manuscript.

Line 158 "....many researchers are still based on the end-to-end training method..." spelling mistakes and appropriate citations are missing.

Response：First of all, I am very sorry for the spelling mistakes in the manuscript due to our carelessness, and thank you very much for pointing out our problems. Based on your suggestion, we carefully check and correct the errors in the manuscript. Re-edited content in lines 129 to 132 of the manuscript.

The remaining concerns have been addressed. However, proofreading is required, and the uses of notation and equations need to be checked carefully.

Reviewer #2: (No Response)

Response：Thank you again for your careful review and comments on our paper, based on your suggestions. We again carefully check the manuscript to ensure the correctness of the manuscript.

________________________________________

Attachment

Submitted filename: Response to Reviewers.docx

Click here for additional data file.^{(229.4KB, docx)}

PLoS One. doi: 10.1371/journal.pone.0277693.r005

Decision Letter 2

Sriparna Saha

2 Nov 2022

An effective spatial relational reasoning networks for visual question answering

PONE-D-21-39541R2

Dear Dr. Han,

We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements.

Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication.

An invoice for payment will follow shortly after the formal acceptance. To ensure an efficient process, please log into Editorial Manager at http://www.editorialmanager.com/pone/, click the 'Update My Information' link at the top of the page, and double check that your user information is up-to-date. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org.

If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

Kind regards,

Sriparna Saha, PhD

Academic Editor

PLOS ONE

Additional Editor Comments (optional):

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

Reviewer #1: All comments have been addressed

**********

2. Is the manuscript technically sound, and do the data support the conclusions?

Reviewer #1: Yes

**********

3. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

**********

4. Have the authors made all data underlying the findings in their manuscript fully available?

Reviewer #1: Yes

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English?

Reviewer #1: Yes

**********

6. Review Comments to the Author

Reviewer #1: All the comments have been addressed by the authors. However, the GitHub code is incomplete any files/packages are missing e.g. core package, run.py etc. Please make the code repository in a runnable state.

**********

7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

**********

PLoS One. doi: 10.1371/journal.pone.0277693.r006

Acceptance letter

Sriparna Saha

16 Nov 2022

PONE-D-21-39541R2

An effective spatial relational reasoning networks for visual question answering

Dear Dr. Han:

I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department.

If your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information please contact onepress@plos.org.

If we can help with anything else, please email us at plosone@plos.org.

Thank you for submitting your work to PLOS ONE and supporting open access.

Kind regards,

PLOS ONE Editorial Office Staff

on behalf of

Dr. Sriparna Saha

Academic Editor

PLOS ONE

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Attachment

Submitted filename: Response to Reviewers.docx

Click here for additional data file.^{(26.1KB, docx)}

Attachment

Submitted filename: Response to Reviewers.docx

Click here for additional data file.^{(229.4KB, docx)}

Data Availability Statement

[pone.0277693.ref001] 1.Antol, Stanislaw, Agrawal, Aishwarya, Lu, Jiasen, et al. Vqa: Visual question answering[C]. International Conference on Computer Vision. 2015: 2425–2433.

[pone.0277693.ref002] 2. Malinowski M, Fritz M. A multi-world approach to question answering about real-world scenes based on uncertain input[J]. Advances in neural information processing systems. 2014: 1682–1690. [Google Scholar]

[pone.0277693.ref003] 3.Xu K, Ba J, Kiros R, et al. Show, attend and tell: Neural image caption generation with visual attention[C]. International conference on machine learning, 2015: 2048–2057.

[pone.0277693.ref004] 4.Xu K, Ba J, Kiros R, et al. Modeling text with graph convolutional network for cross-modal information retrieval[C]. Pacific Rim Conference on Multimedia, 2018: 223–234.

[pone.0277693.ref005] 5.Chen H, Ding G, Lin Z, et al. Cross-modal image-text retrieval with semantic consistency[C]. Proceedings of the 27th ACM International Conference on Multimedia.2019: 1749–1757.

[pone.0277693.ref006] 6.Zhang Z, Lin Z, Zhao Z, et al. Cross-modal interaction networks for query-based moment retrieval in videos[C]. Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, 2019: 655–664.

[pone.0277693.ref007] 7.Lee KH, Xi C, Gang H, et al. Stacked Cross Attention for Image-Text Matching[J]. Proceedings of the European Conference on Computer Vision, 2018:201–216.

[pone.0277693.ref008] 8. Lu J, Yang J, Batra D, et al. Hierarchical question-image co-attention for visual question answering[J]. Advances in neural information processing systems, 2016, 29: 289–297. [Google Scholar]

[pone.0277693.ref009] 9. Shen X, Han D, Chang CC, et al. Dual Self-Guided Attention with Sparse Question Networks for Visual Question Answering[J]. IEICE TRANSACTIONS on Information and Systems, 2022, 105(4): 785–796. doi: 10.1587/transinf.2021EDP7189 [DOI] [Google Scholar]

[pone.0277693.ref010] 10.Teney D, Anderson P, He X, et al. Tips and tricks for visual question answering: Learnings from the 2017 challenge[C]. Proceedings of the IEEE conference on computer vision and pattern recognition, 2018: 4223–4232.

[pone.0277693.ref011] 11.Yang Z, He X, Gao J, et al. Stacked Attention Networks for Image Question Answering[J]. Proceedings of the IEEE conference on computer vision and pattern recognition. 2016: 21–29.

[pone.0277693.ref012] 12.Yu D, Fu J, Tao M, et al. Multi-level Attention Networks for Visual Question Answering[C]. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017: 4709–4717.

[pone.0277693.ref013] 13.Anderson P, He X, Buehler C, et al. Bottom-up and top-down attention for image captioning and visual question answering[C]. Proceedings of the IEEE conference on computer vision and pattern recognition,2018: 6077–6086.

[pone.0277693.ref014] 14.Kim JH, Jun J, Zhang BT. Bilinear Attention Networks[J]. arXiv preprint arXiv:1805.07932, 2018.

[pone.0277693.ref015] 15.Nguyen DK, Okatani T. Improved Fusion of Visual and Language Representations by Dense Symmetric Co-Attention for Visual Question Answering[C]. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018: 6087–6096.

[pone.0277693.ref016] 16.Gao P, Jiang Z, You H, et al. Dynamic fusion with intra-and inter-modality attention flow for visual question answering[C]. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019: 6639–6648.

[pone.0277693.ref017] 17.Yu Z, Yu J, Cui Y, et al. Deep Modular Co-Attention Networks for Visual Question Answering[C]. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019: 6281–6290.

[pone.0277693.ref018] 18.Zhao G, Lin J, Zhang Z, et al. Explicit sparse transformer: Concentrated attention through explicit selection[J]. arXiv preprint arXiv:1912.11637, 2019.

[pone.0277693.ref019] 19.Hu H, Gu J, Zhang Z, et al. Relation networks for object detection[C]. Proceedings of the IEEE conference on computer vision and pattern recognition, 2018:3588–3597.

[pone.0277693.ref020] 20.Li L, Gan Z, Cheng Y, et al. Relation-aware graph attention network for visual question answering[C]. Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019:10313–10322.

[pone.0277693.ref021] 21.Meng LJaPA. Armour: Generalizable Compact Self-Attention for Vision Transformers[J]. arXiv preprint arXiv:2108.01778, 2021.

[pone.0277693.ref022] 22.Goyal Y, Khot T, Summers-Stay D, et al. Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering[C]. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017:6904–6913.

[pone.0277693.ref023] 23.Hudson DA, Manning CD. Gqa: A new dataset for real-world visual reasoning and compositional question answering[C]. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition,2019:6700–6709.

[pone.0277693.ref024] 24.Patro B, Namboodiri VP. Differential attention for visual question answering[C]. Proceedings of the IEEE conference on computer vision and pattern recognition, 2018: 7680–7688.

[pone.0277693.ref025] 25.Zhu C, Zhao Y, Huang S, et al. Structured attentions for visual question answering[C]. Proceedings of the IEEE International Conference on Computer Vision, 2017:1291–1300.

[pone.0277693.ref026] 26.Fan H, Zhou J. Stacked latent attention for multimodal reasoning[C]. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018:1072–1080.

[pone.0277693.ref027] 27.Yu Z, Yu J, Fan J, et al. Multi-modal Factorized Bilinear Pooling with Co-attention Learning for Visual Question Answering[C]. Proceedings of the IEEE international conference on computer vision. 2017:1821–1830.

[pone.0277693.ref028] 28.Kim JH, On KW, Lim W, et al. Hadamard Product for Low-rank Bilinear Pooling[J]. arXiv preprint arXiv:1610.04325, 2016.

[pone.0277693.ref029] 29.Ben-Younes H, Ca Dene R, Cord M, et al. MUTAN: Multimodal Tucker Fusion for Visual Question Answering[J]. Proceedings of the IEEE conference on computer vision and pattern recognition, 2017:1–9.

[pone.0277693.ref030] 30. Yu Z, Yu J, Xiang C, et al. Beyond bilinear: Generalized multimodal factorized high-order pooling for visual question answering[J]. IEEE transactions on neural networks and learning systems, 2018, 29(12):5947–5959. doi: 10.1109/TNNLS.2018.2817340 [DOI] [PubMed] [Google Scholar]

[pone.0277693.ref031] 31. Lu Jiasen and Batra Dhruv and Parikh Devi and Lee Stefan. Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Advances in neural information processing systems,2019,32. [Google Scholar]

[pone.0277693.ref032] 32.Su W, Zhu X, Cao Y, et al. Vl-bert: Pre-training of generic visual-linguistic representations[J]. arXiv preprint arXiv:1908.08530, 2019.

[pone.0277693.ref033] 33.Tan H, Bansal M. Lxmert: Learning cross-modality encoder representations from transformers[J]. arXiv preprint arXiv:1908.07490, 2019.

[pone.0277693.ref034] 34.Li LH, Yatskar M, Yin D, et al. Visualbert: A simple and performant baseline for vision and language[J]. arXiv preprint arXiv:1908.03557, 2019.

[pone.0277693.ref035] 35.Teney D, Liu L, Van Den Hengel A. Graph-structured representations for visual question answering[C]. Proceedings of the IEEE conference on computer vision 740 and pattern recognition, 2017:1–9.

[pone.0277693.ref036] 36.Santoro A, Raposo D, Barrett DG, et al. A simple neural network module for relational reasoning[J]. arXiv preprint arXiv:1706.01427, 2017.

[pone.0277693.ref037] 37. Zhang W, Yu J, Hu H, et al. Multimodal feature fusion by relational reasoning and attention for visual question answering[J]. Information Fusion, 2020, 55:116–126. doi: 10.1016/j.inffus.2019.08.009 [DOI] [Google Scholar]

[pone.0277693.ref038] 38.Chen S, Jin Q, Wang P, et al. Say as you wish: Fine-grained control of image caption generation with abstract scene graphs[C]. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020:9962–9971.

[pone.0277693.ref039] 39.Yang X, Tang K, Zhang H, et al. Auto-encoding scene graphs for image captioning[C]. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019:10685–10694.

[pone.0277693.ref040] 40.Zhang Y, Hare J, Prügel-Bennett AJaPA. Learning to count objects in natural images for visual question answering[J]. arXiv preprint arXiv:1802.05766, 2018.

[pone.0277693.ref041] 41.Trott A, Xiong C, Socher RJaPA. Interpretable counting for visual question answering[J]. arXiv preprint arXiv:1712.08697, 2017.

[pone.0277693.ref042] 42.Wang P, Wu Q, Shen C, et al. The vqa-machine: Learning how to use existing vision algorithms to answer new questions[C]. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,2017:1173–1182.

[pone.0277693.ref043] 43.Bahdanau D, Cho K, Bengio YJCS. Neural Machine Translation by Jointly Learning to Align and Translate[J]. arXiv preprint arXiv:1409.0473, 2014.

[pone.0277693.ref044] 44.Zhu Y, Groth O, Bernstein M, et al. Visual7w: Grounded question answering in images[C]. Proceedings of the IEEE conference on computer vision and pattern recognition, 2016:4995–5004.

[pone.0277693.ref045] 45.Lu P, Li H, Zhang W, et al. Co-attending free-form regions and detections with multi-modal multiplicative feature embedding for visual question answering[C]. Proceedings of the AAAI Conference on Artificial Intelligence. 2018, 32(1).

[pone.0277693.ref046] 46. Choi MJ, Torralba A, Willsky AS. A tree-based context model for object recognition[J]. IEEE transactions on pattern analysis and machine intelligence, 2011, 34(2): 240–252. doi: 10.1109/TPAMI.2011.119 [DOI] [PubMed] [Google Scholar]

[pone.0277693.ref047] 47. Felzenszwalb PF, Girshick RB, Mcallester D, et al. Object detection with discriminatively trained part-based models[J]. IEEE transactions on pattern analysis and machine intelligence, 2009,32(9):1627–1645. doi: 10.1109/TPAMI.2009.167 [DOI] [PubMed] [Google Scholar]

[pone.0277693.ref048] 48.Divvala SK, Hoiem D, Hays JH, et al. An empirical study of context in object detection[C]. 2009 IEEE Conference on computer vision and Pattern Recognition,2009: 1271–1278.

[pone.0277693.ref049] 49.Galleguillos C, Rabinovich A, Belongie S. Object categorization using co-occurrence, location and appearance[C]. 2008 IEEE Conference on Computer Vision and Pattern Recognition, 2008:1–8.

[pone.0277693.ref050] 50. Gould S, Rodgers J, Cohen D, et al. Multi-class segmentation with relative location prior[J]. International journal of computer vision, 2008, 80(3): 300–316. doi: 10.1007/s11263-008-0140-x [DOI] [Google Scholar]

[pone.0277693.ref051] 51.Yao T, Pan Y, Li Y, et al. Exploring visual relationship for image captioning[C]. Proceedings of the European conference on computer vision (ECCV),2018: 684–699.

[pone.0277693.ref052] 52.Fang H, Gupta S, Iandola F, et al. From captions to visual concepts and back[C]. Proceedings of the IEEE conference on computer vision and pattern recognition,2015: 1473–1482.

[pone.0277693.ref053] 53.Johnson J, Krishna R, Stark M, et al. Image retrieval using scene graphs[C]. Proceedings of the IEEE conference on computer vision and pattern recognition,2015: 3668–3678.

[pone.0277693.ref054] 54.Schuster S, Krishna R, Chang A, et al. Generating semantically precise scene graphs from textual descriptions for improved image retrieval[C]. Proceedings of the fourth workshop on vision and language,2015: 70–80.

[pone.0277693.ref055] 55.Farhadi A, Sadeghi A. Recognition using visual phrases[C]. Computer Vision and Pattern Recognition (CVPR),2011.

[pone.0277693.ref056] 56.Ramanathan V, Li C, Deng J, et al. Learning semantic relationships for better action retrieval in images[C]. Proceedings of the IEEE conference on computer vision and pattern recognition, 2015:1100–1109.

[pone.0277693.ref057] 57.Lu C, Krishna R, Bernstein M, et al. Visual relationship detection with language priors[C]. European conference on computer vision, 2016:852–869.

[pone.0277693.ref058] 58.Zhang H, Kyaw Z, Chang S-F, et al. Visual translation embedding network for visual relation detection[C]. Proceedings of the IEEE conference on computer vision and pattern recognition,2017: 5532–5540.

[pone.0277693.ref059] 59.Yang Z, Yu J, Yang C, et al. Multi-modal learning with prior visual relation reasoning[J]. arXiv preprint arXiv:1812.09681, 2018, 3(7).

[pone.0277693.ref060] 60.Cadene R, Ben-Younes H, Cord M, et al. Murel: Multimodal relational reasoning for visual question answering[C]. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,2019: 1989–1998.

[pone.0277693.ref061] 61. Yu J, Zhang W, Lu Y, et al. Reasoning on the relation: Enhancing visual representation for visual question answering and cross-modal retrieval[J]. IEEE Transactions on Multimedia, 2020, 22(12): 3196–3209 doi: 10.1109/TMM.2020.2972830 [DOI] [Google Scholar]

[pone.0277693.ref062] 62.Pennington J, Socher R, Manning CD. Glove: Global vectors for word representation[C]. Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP),2014: 1532–1543.

[pone.0277693.ref063] 63. Ren S, He K, Girshick R, et al. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks[J]. IEEE transactions on pattern analysis and machine intelligence, 2016, 39(6): 1137–1149. doi: 10.1109/TPAMI.2016.2577031 [DOI] [PubMed] [Google Scholar]

[pone.0277693.ref064] 64. Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need[J]. Advances in neural information processing systems. 2017: 5998–6008. [Google Scholar]

[pone.0277693.ref065] 65.Ren S, He K, Girshick R, et al. Multimodal encoder-decoder attention networks for visual question answering[J]. IEEE Access, 2020, 8: 35662–35671.

[pone.0277693.ref066] 66.Zhou, Luowei and Palangi, Hamid and Zhang, Lei and Hu, Houdong and Corso, Jason and Gao, Jianfeng. Unified vision-language pre-training for image captioning and vqa. Proceedings of the AAAI Conference on Artificial Intelligence, 2020,34(7):13041–13049.

[pone.0277693.ref067] 67.Gao, Peng and You, Haoxuan and Zhang, Zhanpeng and Wang, Xiaogang and Li, Hongsheng. Multi-modality latent interaction network for visual question answering. Proceedings of the IEEE/CVF international conference on computer vision, 2019:5825–5835.

[pone.0277693.ref068] 68.Zhou B, Tian Y, Sukhbaatar S, et al. Simple Baseline for Visual Question Answering[J]. arXiv preprint arXiv:1512.02167, 2015.

[pone.0277693.ref069] 69.Yang Z, Qin Z, Yu J, et al. Scene graph reasoning with prior visual relationship for visual question answering[J]. arXiv preprint arXiv:1812.09681, 2018.

[pone.0277693.ref070] 70. Zhang W, Yu J, Zhao W, et al. DMRFNet: Deep Multimodal Reasoning and Fusion for Visual Question Answering and explanation generation[J]. Information Fusion, 2021, 72: 70–79. doi: 10.1016/j.inffus.2021.02.006 [DOI] [Google Scholar]

[pone.0277693.ref071] 71.Hu R, Rohrbach A, Darrell T, et al. Language-conditioned graph networks for relational reasoning[C]. Proceedings of the IEEE/CVF International Conference on Computer Vision,2019: 10294–10303.

PERMALINK

An effective spatial relational reasoning networks for visual question answering

Xiang Shen

Dezhi Han

Chongqing Chen

Gaofeng Luo

Zhongdai Wu

Roles

Abstract

1. Introduction

Fig 1. Example of visual question answering tasks.

2. Related works

2.1 Visual question answering

2.2 Attention mechanisms

2.3 Visual relational reasoning

3. Methodology

Fig 2. Overall flowchart of the proposed SRRN model.

3.1 Question and image representation

3.2 Spatial reasoning of visual objects

3.3 Semantic reasoning of visual objects

3.3.1 Sparse mechanism encoder

Fig 3. Sparse attention mechanism encoder.

Fig 4. The sparse attention mechanism.

3.3.2 Co-attention modular

Fig 5. Co-attention structure diagram.

3.4 Modal fusion and answer predictions

4. Experiments

4.1 Dataset

4.2 Details of the experimental setup

4.3 Ablation study

4.3.1 SRRN variants

Table 1. Performance of different SRRN variant models on VQA 2.0 dataset.

Fig 6. The change of accuracy and loss function in the process of model training.

4.3.2 SRRN parameter ablation

Fig 7.

Table 2. The performance of different layers of encoder and decoder.

4.3.3 Number of model parameters comparison

Table 3. Comparison of pre-trained model parameters and SRRN model on the VQA 2.0 dataset.

4.4 Comparison with state-of-the-arts

Table 4. Performance comparison results on VQA 2.0 dataset.

Table 5. Performance comparison results on GQA dataset.

4.5 Visualization

Fig 8. Example of visualization on VQA 2.0 and GQA datasets.

5. Conclusion

Data Availability

Funding Statement

References

Decision Letter 0

Sriparna Saha

Roles

Author response to Decision Letter 0

Decision Letter 1

Sriparna Saha

Roles

Author response to Decision Letter 1

Decision Letter 2

Sriparna Saha

Roles

Acceptance letter

Sriparna Saha

Roles

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases