SGAT: Shuffle and graph attention based Siamese networks for visual tracking

Jun Wang; Limin Zhang; Wenshuang Zhang; Yuanyun Wang; Chengzhi Deng

doi:10.1371/journal.pone.0277064

. 2022 Nov 23;17(11):e0277064. doi: 10.1371/journal.pone.0277064

SGAT: Shuffle and graph attention based Siamese networks for visual tracking

Jun Wang ^1,², Limin Zhang ^1,², Wenshuang Zhang ^1,², Yuanyun Wang ^1,^2,^*, Chengzhi Deng ^1,²

Editor: Sathishkumar V E³

PMCID: PMC9683595 PMID: 36417392

Abstract

Siamese-based trackers have achieved excellent performance and attracted extensive attention, which regard the tracking task as a similarity learning between the target template and search regions. However, most Siamese-based trackers do not effectively exploit correlations of the spatial and channel-wise information to represent targets. Meanwhile, the cross-correlation is a linear matching method and neglects the structured and part-level information. In this paper, we propose a novel tracking algorithm for feature extraction of target templates and search region images. Based on convolutional neural networks and shuffle attention, the tracking algorithm computes the similarity between the template and a search region through a graph attention matching. The proposed tracking algorithm exploits the correlations between the spatial and channel-wise information to highlight the target region. Moreover, the graph matching can greatly alleviate the influences of appearance variations such as partial occlusions. Extensive experiments demonstrate that the proposed tracking algorithm achieves excellent tracking results on multiple challenging benchmarks. Compared with other state-of-the-art methods, the proposed tracking algorithm achieves excellent tracking performance.

1 Introduction

Visual tracking [1–3] is a fundamental research topic in computer vision. It aims to estimate target states in subsequent frames by given the initial state in the first frame. It is widely used in various applications, such as video surveillance [4], human-computer interaction [5], augmented reality [6], and so on. Recently, Convolutional Neural Network (CNN) is successfully used in visual tracking. Deep trackers [7, 8] achieve robust tracking performance and real-time tracking speed. However, due to complicated appearance variations, visual tracking is still a challenging task.

In recent years, Siamese network is widely used in visual tracking. Siamese-based trackers regard tracking problems as a similarity measuring between the template patch and a search patch. SiamFC [1] applies Siamese network to visual tracking for the first time, and achieves excellent tracking performance. Especially, it has a good balance between tracking accuracy and real-time tracking speed. Based on SiamFC, DSiam [9] learns target appearance variations to improve adaptation capabilities. SiamPCF [10] proposes a novel anchor-free visual tracking algorithm, which use points instead of bounding box to descripe a target and achieves excellent tracking performance. FSNet [11] develops a feature selection convolutional neural network to reduce the computational complexity and accelerate offline training.

Inspired by region proposal network (RPN) [12], Li et al. [13] considers visual tracking as two subtasks including the classification and regression. Later works [14, 15], improve SiamRPN in tracking accuracy and reduce redundant parameters. These RPN-based trackers use anchors to obtain candidate boxes. However, these trackers are sensitive to the size, number and aspect ratios of anchor boxes. Also, the tuning of hyper-parameters is time-consuming. To solve the above problems, recently, anchor-free based trackers such as SiamFC++ [16] and SiamCAR [17] are proposed. All these anchors-free trackers discard anchors and proposals, and greatly reduces the time-consuming on tuning of hyper-parameters.

It is worth mentioning that both anchor and anchor-free based trackers use a powerful CNN such as AlexNet [18], ResNet [19] and GoogLeNet [20] to extract features in a Siamese network. However, most Siamese-based trackers use the features of the last convolution layer or cascaded multi-layers as the target representations of the template and the search region, which do not effectively use the structured and part-level information. Motivated by these considerations, attention mechanism is applied in CNN to improve the feature representation, which focuses on essential appearance features while distractors.

Another core component of Siamese-based trackers is the similarity learning. SiamFC introduces the Siamese network as the feature extractor and adopts the cross-correlation operator to compute the similarity between the target template and a search region. In SiamPRN++, a depth-wise correlation is used to reduce the number of model parameters and makes the off-line training model more stable. However, both the cross-correlation and depth cross-correlation take the template features as a whole for linear matching on the search regions, so that the adjacent sliding windows produce a similar response. Guo et al. [21] propose a graph attention module(GAM) to realize a part-to-part matching between the template and a search region.

Different from the previous work [22], we design a novel feature extraction network based on GoogleNet to exploit correlations of the spatial and channel-wise information. Additionally, in order to alleviate the influences of appearance variations, we use a different similarity computing to obtain more accurate score maps. Inspired by above-mentioned works, in this paper, we propose a novel tracking algorithm based on shuffle attention mechanism and graph matching in Siamese network. The shuffle attention mechanism in the backbone network reconstructs the basic features extracted from CNN, and makes the feature representation focusing on the regions of interest through spatial and channel-wise transformations. Different from the cross-correlation based similarity learning, the part-to-part graph attention matching further improves the tracking robustness in complex scenes, such as occlusion.

The contribution of this paper can be summarized as follows:

We propose an end-to-end deep model based on CNN and shuffle attention unit to enhance the capacity of feature representations. The model effectively exploits the correlations of the spatial and channel-wise information without extra overhead.
We develop a novel tracking framework based on Siamese network, consisting of the designed deep model, graph attention matching and prediction head. Compared with the traditional cross-correlation based trackers, the proposed tracking algorithm exploits structured and part-level information, which greatly alleviate the influences of appearance variations such as fast motion and partial occlusions.
Extensive experimental results demonstrate that the proposed tracker has excellent performance on multiple benchmarks including OTB-100 [23], GOT-10k [24], UAV123 [25] and LaSOT [26], and outperforms many SOTA trackers. At the same time, the proposed tracking algorithm meets the real-time requirement with an average speed of 60 FPS.

The rest of this paper is arranged as follows. In Section 2, we review the related works. The details of the proposed method are described in Section 3. The experimental results on four benchmarks are presented followed by the ablation study and qualitative evaluations in Section 4. At last, we draw a conclusion in Section 5.

2 Related works

In this section, we mainly review some representative works and techniques that are closely related to the proposed algorithm including Siamese-based tracking algorithms and some attention mechanisms.

2.1 Siamese-based visual tracking

In recent years, trackers based on Siamese network attract incremental attention for their leading performance [27–29]. Tao et al. [30] propose to learn a similarity function to locate a target. However, because the processing of candidate sampling is not efficient enough, SINT [30] cannot meet the real-time tracking speed (about 2FPS). In [1], a fully convolution Siamese network framework is proposed to compute the similarity between the target template and a search region in an embedding space, which achieves beyond real-time tracking. Many later works regard SiamFC as guideline to improve tracking accuracy. Guo et al. [9] propose a dynamic Siamese network that can effectively learn the appearance variations of a target and suppress the background information through a fast transformation learning model.

Fan et al. [31] propose a dual-margin model for accuracy and robust visual tracking, which formulated the target state prediction problem as a dual-margin model including an intra-object margin and an inter-object margin. Li et al. [32] propose a thermal infrared tracker based on a hierarchical spatially-aware twin network that regards the infrared tracking problem as a similarity verification task.

Recently, Li et al. [13] propose to connect the region proposal extraction subnetwork to the Siamese network framework, which distinguishes the target from the surrounding background by classification branch and estimate the bounding box by regression branch. SiamRPN [13] achieves promising tracking accuracy and beyond real-time speed. Many outstanding works such as DaSiamRPN [33], and C-RPN [34] use SiamRPN as a baseline, and improve the tracking accuracy by designing different network models.

However, most of the Siamese network trackers connecting RPN subnetworks are based on anchors, which are sensitive to the size, number and aspect ratios of anchor boxes. Xu et al. [16] analyze the prior tracking methods and propose a set of practical target state estimation criteria. It effectively solves wrong matching problem due to anchor objects. Guo et al. [17] propose a new fully convolution Siamese network framework to solve the end-to-end tracking problem in a per-pixel manner, which framework is both proposal and anchor-free. Zhang et al. [35] propose a novel object-aware and anchor-free networks. Compared with anchor-based trackers, it can regress to the target region in a large spatial range. Under a fully convolution network, Chen et al. [36] propose Siamese Box Adaptive Network for object tracking, and regard the tracking task as a parallel classification and regression problem.

2.2 Attention mechanism

Recent works [37, 38] have made many contributions in improving the performance of convolutional neural networks, and these efforts can be roughly divided into two categories. The first category aims to improve the performance of CNN from the spatial domain. The representative works mainly include dilated convolution [39, 40] and deformation convolution [41, 42]. The above works mainly expands the receptive field of the networks by using the predefined gap, and mainly focuses on the target region adaptively through a series of row and column transformations. The disadvantages of these methods are that they do not make use of the clue information between channels.

The second category joints space transformation and channel attention to enhance the performance of CNN. Representative works include SENet [38], CBAM [43], and so on. These methods usually use additional subnetworks to redistribute the weight of the feature map. These networks highlight the target region and suppress the background interference, and achieve excellent feature representation performance. However, these methods usually have additional parameters and overhead.

By jointing feature extraction network and attention mechanism to represent the template and search region patch, Siamese-based algorithms have achieved significant performance improvement. Zhang et al. [19] propose a residual attentional Siamese network (RASNet) to adaptively reconstruct a model. He et al. [2] build a twofold Siamese network (SA-Siam) to represent the appearance and semantic information, and channel attention is introduced to enhance the capture of semantic information. Hua et al. [44] propose a lightweight UAV algorithm based on attention mechanism and strategy gradient to improve overall tracking accuracy and robustness. Different from these works, we reconstruct the basic features by combining spatial and channel attentions. Finally, a nonlinear graph attention matching is used instead of cross-correlation to capture spatial semantic information.

3 Method

In this section, we firstly introduce the SGAT algorithm in detail. As shown in Fig 1, it includes three main components: 1) feature extraction network with shared weight, which is used for depth feature extraction of the target template and search regions; 2) the shuffle attention mechanism model (SA Unit), which reconstructs the basis features to focus on the target region and suppress the background interference through the spatial and channel-wise transformation; 3) graph attention matching (GM), which computes the similarity between the target template and a search region, and joints classification and regression branches to locate the target position in the current frame.

3.1 Siamese-based object tracking

The trackers based on Siamese network have become more popular due to their leading tracking performance [17, 45]. These trackers regard the tracking task as a similarity matching problem between the target template and a search region, and greatly improves the accuracy and robustness.

In Siamese-based trackers, firstly, the initial frame is preprocessed to obtain the template image x and the search images z are cropped in the subsequent frames. Then, these image patches are input into the convolution neural network with shared weights for feature extraction. Nextly, the similarity between the template feature and search region features are measured through cross-correlation, and the score maps are obtained. Finally, the target center position offset and scale change are calculated according to the value of the score map. The score map is calculated as follows:

\begin{matrix} f (z, x) = φ (z) * φ (x) + b I, \end{matrix}

(1)

where φ represents a convolution embedding function, * represents the cross-correlation calculation layer, and $b I$ means a signal which takes value $b \in R$ in every location.

Although the cross-correlation is a simple and efficient method, it is essentially the operation of vector inner product. This is a linear matching process without using important semantic information, which makes theses trackers degrade the localization accuracy in complex scenarios, such as occlusion and motion blur. Later works [17, 46] are devoting to alleviate the negative effects of cross-correlation. Chen et al. [45] propose Transformer-like structural feature fusion network to replace the cross-correlation process and achieve outstanding performance.

In order to alleviate negative effects of cross-correlation, we learn a similarity measuring via a graph attention matching and shuffle attention in Siamese network. Firstly, the basic features of the template and the search region are extracted by convolution neural network. Nextly, the basic features are divided into different groups along the channel dimension, and each group is reconstructed after the channel and spatial-wise transformation. Finally, we compute the similarity between the reconstructed features of the target template and the search region. Especially, the graph attention match method takes full advantages of the structured and part-level information.

Finally, the proposed SGAT algorithm locates the target position of the current frame by classification and regression prediction head. Among them, the cross-entropy loss function and IoU loss function are used to update the model parameters by back-propagation in the classification and regression stages, respectively.

3.2 Shuffle attention module

Due to attention mechanisms improving CNN’s ability in representing a target, they have attracted extensive attention and have been successfully applied to tracking tasks. However, existing attention mechanisms often do not fully exploit the feature dependencies of the spatial and channel dimensions and add additional overhead to the network model. Therefore, we propose an end-to-end deep model for feature extraction by combining convolutional neural networks and a shuffle attention unit. As shown in Fig 2, the designed deep model effectively exploit the correlations between the spatial and channel-wise to highlight the target region without extra overhead.

Feature grouping. It is assumed that there is a basis feature $r \in R^{C \times H \times W}$ obtained by convolutional neural network, where C, H and W represent the channel number, height and width, respectively.

In the shuffle attention unit, a basic feature r is divided into D groups along the channel dimension, denoted as $r_{k} \in R^{C / D \times H \times W}$ , k ∈ {1, 2, ⋯, D}, where r_k represents the k sub-feature. In this way, the feature r is divided into multiple sub-features r_k, and then the weight coefficient of each sub-feature is learned through off-line training. Meanwhile, each sub-feature is internally divided into two branches, a and b, r_k = [r_ka, r_kb], which uses the information correlations of spatial and channel-wise to learn the weight coefficient and reduce the redundancy of local features.

Channel-wise transformation. Channel transformation focuses on ‘what’ is important in an input image. The typical channel attention is SE block, which can effectively capture the correlation between channels. However, SE blocks usually increase the number of parameters of the model, which is not accord with the principle of lightweight design in tracking tasks. To generate channel weights efficiently, the spatial dimension of an input feature map is usually compressed, and adopt average-pooling to integrate spatial information. Based on prior information, we adopt a novel channel transformation method that resizes the channel-wise block through global average pooling. The channel-wise block is obtained as follows:

\begin{matrix} r_{k b}^{'} = \frac{1}{H \times W} \sum_{m = 1}^{H} \sum_{n}^{W} r_{k b} (m, n), \end{matrix}

(2)

where H and W indicates the height and width of the feature map, r_kb(m, n) means sub-feature r_kb at spatial location (m, n).

In addition, the resized channel block is guided adaptively, and finally the final output of channel attention is obtained as follows:

\begin{matrix} r_{k b}^{''} = σ (w_{1} r_{k b}^{'} + b) \cdot r_{k b}, \end{matrix}

(3)

where w₁ and b are parameters for feature scaling and shift, respectively, σ is a sigmoid activation function.

Spatial-wise transformation. As a supplement to channel-wise transformation, spatial transformation aims to locate ‘where’ is an important region. To effectively carry out spatial transformation, the max-pooling and average-pooling are usually used to deal with input feature along channel dimension. In this paper, the specific implementation steps are as follows: firstly, group normalization (GN) is used to preprocess the spatial features. Then, linear transformation and activation function are combined to enhance the ability of feature representation and suppress the interference of background region. The transformed spatial features are as follows:

\begin{matrix} r_{k a}^{''} = σ (w_{2} G N (r_{k a}) + b) \cdot r_{k a}, \end{matrix}

(4)

where w₂ and b are parameters for feature scaling and shift, respectively, GN means a group normalization.

Reconstruction features. After the spatial transformation, the sub-features of each group includes the spatial and channel-wise context information. Then, the sub-features are reorganized in shuffle attention unit along the channel dimension, i.e., $r = [r_{1}^{''}, r_{2}^{''}, \dots, r_{k}^{''}]$ , and $r_{k}^{''} = [r_{k a}^{''}, r_{k b}^{''}]$ . Among them, we use concatenate function for sub-features reconstruction.

3.3 Similarity measuring

In the past, Siamese-based trackers usually use cross-correlation as similarity matching [1, 12], which is the method to match the template as a whole in the search region. However, this method is a linear matching process, which does not take advantage of nonlinear semantic information. The template block is usually represented by rectangular box as a unit, which introduces background noise into the template representation. These reasons lead to the performance bottleneck of Siamese-based tracker. To take full advantages of capture the structured and part-level information, we learn a graph attention matching based similarity measuring instead of cross-correlation. By decomposing the target template and search region features into multiple grids, and then computing the similarity of different template and search region grids, which greatly alleviate the challenging of pose variations of target. After obtaining the reconstructed features of the template and search region by a deep end-to-end model, we assume 1 × 1 × C grid of the feature map as a node. For node i on the template and node j in the search region, the correlation scores are:

\begin{matrix} e_{i, j} = f (g^{i} (φ (x)), g^{j} (φ (z))), \end{matrix}

(5)

where gⁱ, g^j are the reconstruct feature vetor of node i and node j.

In order to improve the information propagate between different nodes, we adopt softmax function to normalize e_i,j as follow:

\begin{matrix} α_{i, j} = \frac{exp (e_{i, j})}{\sum_{β \in η_{t}} exp (e_{i, β})}, \end{matrix}

(6)

where η_t is a node set that include all template nodes.

Therefore, we obtain the discriminative feature representation as follow:

\begin{matrix} {\hat{g}}^{j} (φ (z)) = F (\sum_{i \in β_{t}} α_{i, j} W_{v} g^{i} (φ (x)) ∥ (W_{v} g^{j} (φ (z)))), \end{matrix}

(7)

where W_v is a linear matrix transformation, F(⋅) is the ReLU activation function.

Since the more similar the local features between the search region and the target template, the more like it is to be considered as a foreground. Therefore, we choose the inner product to measure the similarity, which is suitable to represent this relationship. The final score map is shown as:

\begin{matrix} f (g^{i} (φ (x)), g^{j} (φ (z))) = {(w_{s} g^{i} (φ (z)))}^{T} (w_{z} g^{j} (φ (x))), \end{matrix}

(8)

where w_s and w_x are the linear transformation matrices, g represents the corresponding feature vector node of the template and the search region, and (⋅)^T means the matrix transpose.

4 Experiments results

4.1 Implementation details

The proposed SGAT is implemented in Python using Pytorch on one NVIDIA Quadro P4000 GPU, Intel Xecon E5-2600 v4 CPU (2.00GHz) and 32GB RAM. Due to the limitation of insufficient computer hardware resources in the laboratory, we reduce batch size to 24. We train deep model on the training splits of GOT-10k and COCO datasets, and cut to 511*12 size through preprocessing. The sizes of search region patches and the template patch are 287*287 and 127*127, respectively. The backbone parameters are initialized with the weights that pretrained on ImageNet.

The SGAT algorithm uses GoogLeNet as the backbone network. Compared with the traditional feature extraction network, GoogLeNet can extract more richness features and use maxpooling to reduce the parameter redundancy of the upper layers.

Evaluation metric. We adopt the one-pass evaluation (OPE) metric of accuracy and success rate to evaluate the performance of the trackers. The precision is evaluated by the center location error (CLE) between the predicted location and the ground truth location. The precision plots are drawn in according to the frame percentages of CLE under the specified thresholds. Besides, the success rate is defined as the intersection over union (IoU) between the predicted bounding boxes and the ground truth. Meanwhile, when the IoU exceeds a certain threshold, it is considered to track the target accurately, and the success plot is drawn by the frame percentage.

4.2 Ablation study

To verify the effectiveness of the core components of the designed network framework, we choose OTB-100 benchmark to verify different schemes. In shuffle attention module, we divide the basis features into multiple sub-features along the channel dimensions. The shuffle unit reconstructs each sub-feature by spatial and channel-wise transformations. Finally, the sub-features are combined by using the dependence along channel dimensions. We analyze the advantages and disadvantages of different types of backbone networks, and prove the performance gain brought by different types of backbone networks through experiments in Table 1. Meanwhile, we compare with the traditional similarity computation methods, and verify the effectiveness of graph attention matching method.

Table 1. Ablation experiments on OTB-100 benchmark.

GM denotes graph matching, Xcorr denotes cross correlation, respectively, and SA means shuffle attention unit.

Dataset	Backbone	Embedding Type	SA	Success rate	Precision
OTB-100	GoogLeNet	GM		0.671	0.855
	GoogLeNet	GM	✓	0.688	0.886
	GoogLeNet	Xcorr	✓	0.627	0.821
	ResNet	Xcorr	✓	0.621	0.805

Open in a new tab

Backbone architecture. Reviewing the object tracking algorithms based on Siamese network, backbone networks can be mainly summarized into two kinds: 1) Shallow feature extraction networks, such as AlexNet and VggNet. The advantages of these networks are prone to converge and have small numbers of parameters. The disadvantage is that the generalization ability of the model is not enough; 2) Deep feature extraction networks, such as ResNet-152. Obviously, the deep networks can improve model ability in fitting and generalization to the data. However, the disadvantage is that there are too many parameters and the model is not lightweight enough.

In view of the above analysis, the feature extraction network we selected is GoogLeNet(Inception.V3). Compared with traditional backbone networks, such as AlexNet and ResNet, it can provide multi-scale feature fusion to increase the scale adaptability of the network, and use some tricks to reduce the feature redundancy of each layer. At the same time, we train a total of 20 epochs models to achieve convergence.

Comparision correlation. Existing tracking algorithms based on Siamese network use cross correlation to compute the similarity between the template patch and a search region patch, and has achieved great performance improvement. However, cross correlation is a linear matching process that neglects the structured and part-level information, which may be the bottleneck of Siamese-based trackers.

Therefore, instead of cross correlation, we use a novel graph matching to compute the similarity between the template patch and the search region patch. In this way, the similarity is calculated from part to part, which exploits the structured and part-level information. Meanwhile, we use the above shuffle attention mechanism to redistribute the weights of the features extracted from the backbone network, and highlight the target region of interest.

4.3 Comparision with state-of-the-art trackers

We compare the proposed SGAT with other state-of-the-art trackers, including Ocean [35], SiamFC++ [16], SiamRPN++ [14], SiamBAN [36], SiamCAR [17], ATOM [47], SPM [48], CLNet [49] on OTB-100, GOT-10k, LaSOT and UAV123. Nextly, we will analyze the tracking performance on these tracking benchmarks.

OTB-100. OTB-100 benchmark consists of 98 challenging video sequences, in which Jogging and Skating2 video sequences have two different initialization tracking objects, respectively. All video sequences correspond to one or more different attributes, including illumination variation (IV), occlusion (OCC), deformation (DEF), out-of-view (OV), low resolution (LR), out-of-plane rotation (OPR), in-plane rotation (IPR), fast motion (FM), background clutter (BC), motion blur (MB) and scale variation (SV), and a total of 11 attributes. As shown in Fig 3, since similarity learning based on graph matching effectively exploits the structured information, the SGAT algorithm achieves the best results in the success rate and different attributes. In Fig 4, the proposed SGAT achieves an appropriate compromise between real-time speed and success rate. The results demonstrate the effectiveness of the appearance model designed by combining CNN and shuffle attention mechanism.

Fig 4 — Here, the x-axis represents the 10th power of the tracking speed and the y-axis represents the success rate.

GOT-10k. GOT-10k is a recently released large-scale tracking benchmark, which includes a total of 10,000 video sequences, 563 object classes and 87 motion forms (e.g. running, swimming, skiing, crawling, cycling), and including 180 challenging video sequences in the test set. In particular, all tracking results must be evaluated in the specified server, which increases the fair contrast of the algorithm. It is worth mentioning that compared with other benchmarks, GOT-10k restricts the use of training sets for training. As shown in Table 2, we list comparisons with other the state-of-the-art trackers in terms of average overlap(AO) and success rates(SR) of thresholds 0.5 and 0.75. Success rates(SR)0.5 indicates the rate of successful tracking frames with an overlap of more than 0.5, while success rates(SR)0.75 indicates the rate of successful tracking frames with an overlap of more than 0.75. The SGAT achieves the best performance. In Fig 5, we present the AO and compared with the SOTA trackers. Benefit from the part to part similarity computation between target template and searching region, the designed algorithm achieves best performance in term of AO and success rates(SR)0.5.

Table 2. Comparison with state-of-the-art trackers on GOT-10k benchmark.

AO, success rates(SR)0.5 and success rates(SR)0.75 represent the average overlap and the success rate at the threshold of 0.5 and 0.75.

Trackers	AO	SR0.5	SR0.75
MDNet [8]	29.9	30.3	9.9
ECO [50]	31.6	30.9	11.1
CCOT [51]	32.5	32.8	10.7
SiamFC [1]	34.8	35.3	9.8
THOR [52]	44.7	53.8	20.4
SiamRPN-R18 [13]	48.3	58.1	27.0
SPM [48]	51.3	59.3	35.9
SiamRPN++ [14]	51.7	61.5	32.9
ATOM [47]	55.6	63.4	40.2
DiMP-18 [53]	57.9	67.2	44.6
SiamCAR [17]	57.9	67.7	43.7
Ocean-offline [35]	59.2	69.5	47.3
SiamFC++ [16]	59.5	69.5	47.3
SGAT(Our)	59.5	70.1	46.6

Open in a new tab

Fig 5 — A in SiamFC++_A indicates that the backbone network AlexNet.

LaSOT. LaSOT is another large-scale single object tracking dataset, which includes 1,400 video sequences and 180 test sets, with an average of more than 2,500 frames per video. LaSOT is very suitable for further evaluation of the robustness of the trackers, because long-term tracking will verify the degradation of the model and deal with some challenging factors, including occlusion, out of field of view, etc. As shown in Figs 6 and 7, the proposed SGAT achieves the best performance in terms of success rate, precision and normalized precision. In Table 3, comparing with twelve state-of-the-art trackers, the SGAT algorithm outperforms other algorithms in success rate, precision and normalized precision.

Fig 7 — SiamFC_pp and SiamRPN_pp represent trackers SiamFC++ and SiamRPN++, respectively.

Table 3. A performance comparison with other competitive methods on the test split of LaSOT, where Suc., Pre. and Norm.Pre. represent the success rate, precision, normalized precision, respectively.

Trackers	Suc.	Pre.	Norm.Pre.
StructSiam [54]	33.5	33.3	41.8
SiamFC [1]	33.6	33.9	42.0
VITAL [55]	39.0	36.0	48.4
MDNet [8]	39.7	37.3	46.0
SPLT [56]	42.6	39.6	49.4
C-RPN [34]	45.5	44.3	54.2
SiamMask [57]	46.7	46.9	55.2
SiamFC++ [16]	50.0	47.4	57.1
SiamRPN++ [14]	49.5	49.3	57.0
CLNet [49]	49.9	49.4	57.4
ATOM [47]	49.9	49.7	57.0
SiamBAN [36]	51.4	52.1	59.8
SGAT(Our)	52.1	52.1	60.9

Open in a new tab

UAV123. UAV123 is a benchmark dataset designed for UAV tracker evaluation, including 123 challenging video sequences, with an average of 935 frames per video sequence. Due to the characteristics of UAV, the main challenge factors of the test set are occlusion and small targets, and most images have low resolution attributes. In Table 4, the average overlap of the developed SGAT has reached 0.807 and the precision rate has reached 0.616, which outperform some exist mainstream tracking algorithms.

Table 4. Comparison with other competitive methods on the test split of UAV123 in terms of success rate and precision rates.

	Our	SiamRPN++	DaSiamRPN	UPDT	SiamRPN	ECO	SiamFC
Success	0.616	0.610	0.569	0.547	0.557	0.524	0.485
Precisionrate	0.807	0.803	0.781	0.780	0.768	0.741	0.693

Open in a new tab

Qualitative evaluation. In Fig 8, we show the results compared with other trackers on four challenging sequences. At the 110th frame in the sequence blurbody, compared with SiamFC++ and SiamDW, due to the shuffle attention model reconstructs the basis features, the proposed SGAT can accurately regress the boundary box of the target. In the box and coke sequences, the target appears partial occlusion, accompanied by fast motion and illumination change. Compared with the tracking results of SiamFC++ and SiamDW, the SGAT algorithm can accurately regress the target bounding box. The same situation occurs in the sequence woman. When the target is occluded for a short time, the SGAT can locate the target again effectively. At the 488th frame, the woman is occluded by cars and trees for a short time during the movement. The proposed SGAT achieves the best result on this sequence by combining the deep model and a similarity learning based on graph attention matching. Each sub-feature is reconstructed by spatial and channel transformations to highlight the target region of interest and suppress background information. In addition, a similarity learning based on graph matching can alleviate the inference of appearance variation.

Limitations. As shown in Fig 9, in complex tracking environment, trackers may occur tracking drift and tracking failure. In some extreme scenarios, the SGAT cannot complete the target tracking task well. For example, after the 106th frame in the sequence soccer and the 143rd frame in the sequence bird1, when there are many similar target interferences and long-term occlusion in the scene, the SGAT will lose the target and lead to tracking failure, which is also an urgent problem faced by the existing trackers. Next, we will focus on the following two aspects: 1) modeling the target using spatial-temporal context information to ensure that the target can be located when occlusion occurs; 2) Adding a learnable memory unit to alleviate the problem that the target often disappears in long-term tracking.

5 Conclusion

In this paper, we propose a simple and efficient visual tracking algorithm based on Siamese networks, which achieves more tracking performance in balancing between accuracy and real-time speed. In the designed deep model, the multiple sub-features are divided along the channel dimension and are processed in parallel, and take full advantage of the feature dependence of spatial and channel dimension. Sub-features are aggregated to exploit of the correlation between spatial and channel-wise information. In addition, we compute similarity between the template and a search region to obtain score map by a novel graph attention matching method, and this way effectively exploits the target structure and part-level information. Extensive experiments demonstrate that the proposed tracker achieves excellent tracking performance on multiple benchmarks and outperforming many state-of-the-art trackers.

Acknowledgments

The authors thank the dataset providers for providing the datasets.

Data Availability

(1) OTB2015: Wu Y, Lim J, Yang MH. Object Tracking Benchmark. In: IEEE Transactions on Pattern Analysis and Machine Intelligence; 2015. p. 1834–1848. DOI: 10.1109/TPAMI.2014.2388226 URL: https://ieeexplore.ieee.org/document/7001050 (2) GOT-10k: Huang L, Zhao X, Huang K. Got-10k: A large high-diversity benchmark for generic object tracking in the wild. In: IEEE Transactions on Pattern Analysis and Machine Intelligence; 2019. p. 1562–1577. DOI: 10.1109/TPAMI.2019.2957464 URL: https://ieeexplore.ieee.org/abstract/document/8922619 (3) LaSOT: Fan, H, Ling, H, Lin, L, Yang, F. LaSOT: A High-Quality Benchmark for Large-Scale Single Object Tracking. In: IEEE Conference on Computer Vision and Pattern Recognition. 2019, p. 5374–5383. DOI: 10.1109/cvpr.2019.00552 URL: https://doi.org/10.48550/arXiv.1809.07845 (4) UAV123: M. Mueller, N. Smith, B. Ghanem. A benchmark and simulator for uav tracking. In: European conference on computer vision, Springer, 2016, p. 445–461. DOI: 10.1007/978-3-319-46448-0_27 URL: http://dx.doi.org/10.1007/978-3-319-46448-0_27.

Funding Statement

Yuanyun Wang, Wenshuang Zhang, Limin Zhang are funded by the Jiangxi Science and Technology Research Project of Education within the Department of China (No: GJJ190955), and the National Natural Science Foundation of China (No: 61861032) for the study design, the experiments and the paper publishing. Jun Wang is funded by the National Natural Science Foundation of China (No: 61865012) for the study and the publication.

References

1.Bertinetto, L, Valmadre J, Henriques JF, Vedaldi A, Torr P. Fully-convolutional siamese networks for object tracking. European conference on computer vision. 2016;850–865.
2. He A, Luo C, Tian X, Zeng W. A twofold siamese network for real-time object tracking. IEEE conference on computer vision and pattern recognition. 2018;4834–4843. [Google Scholar]
3. Liu Q, Yuan D, Fan N, et al. Learning dual-level deep representation for thermal infrared tracking. IEEE Transactions on Multimedia. 2022;1–8. doi: 10.1109/TMI.2022.3215547 [DOI] [Google Scholar]
4.Siebel N T, Maybank S. The advisor visual surveillance system. ECCV 2004 workshop applications of computer vision. 2004;1–8.
5. Bonin-Font F, Ortiz A, Oliver G. Visual navigation for mobile robots: A survey. Journal of intelligent and robotic systems. 2008;263–296. doi: 10.1007/s10846-008-9235-4 [DOI] [Google Scholar]
6.Wei J, Ye G, Mullen T, Grundmann M, Ahmadyan A, Hou T. Instant motion tracking and its applications to augmented reality. arXiv preprint arXiv:1907.06796. 2019;1–8.
7.Zhang Z, Peng H. Deeper and wider siamese networks for real-time visual tracking. IEEE conference on computer vision and pattern recognition. 2019;4591–4600.
8.Nam H, Han B. Learning Multi-domain Convolutional Neural Networks for Visual Tracking. IEEE conference on computer vision and pattern recognition. 2016;4293–4302.
9.Qing G, Wei F, et al. Learning dynamic siamese network for visual object tracking. IEEE international conference on computer vision. 2017;1763–1771.
10. Zeng Y, Zeng B, Yin X, Chen G. SiamPCF: siamese point regression with coarse-fine classification network for visual tracking. Applied Intelligence. 2022;4973–4986. doi: 10.1007/s10489-021-02651-5 [DOI] [Google Scholar]
11. Cui Z, Lu N. Feature selection accelerated convolutional neural networks for visual tracking. Applied Intelligence. 2021;8230–8244. doi: 10.1007/s10489-021-02234-4 [DOI] [Google Scholar]
12. Ren S, He K, Girshick R, Sun J. Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in neural information processing systems. 2015;1–8. [DOI] [PubMed] [Google Scholar]
13.Li B, Yan J, Wu W, Zhu Z, Hu X. High performance visual tracking with siamese region proposal network. IEEE conference on computer vision and pattern recognition. 2018;8971–8980.
14.Li B, Wu W, Wang Q, Zhang F, Xing J, Yan J. Siamrpn++: Evolution of siamese visual tracking with very deep networks. IEEE Conference on Computer Vision and Pattern Recognition. 2019;4282–4291.
15. Wang J, Meng C, Deng C, Wang Y. Learning attention modules for visual tracking. Signal, Image and Video Processing. 2022;1–8. doi: 10.1007/s11760-022-02233-z [DOI] [PMC free article] [PubMed] [Google Scholar]
16.Xu Y, Wang Z, Li Z, Yuan Y, Yu G. Siamfc++: Towards robust and accurate visual tracking with target estimation guidelines. AAAI Conference on Artificial Intelligence. 2020;12549–12556.
17.Guo D, Wang J, Cui Y, Wang Z, Chen S. SiamCAR: Siamese fully convolutional classification and regression for visual tracking. IEEE conference on computer vision and pattern recognition. 2020;6269–6277.
18. Krizhevsky A, Sutskever I, Hinton GE. Imagenet classification with deep convolutional neural networks. Communications of the ACM. 2017;84–90. doi: 10.1145/3065386 [DOI] [Google Scholar]
19.He, K, Zhang, X, Ren S and Sun, J. Deep residual learning for image recognition. IEEE conference on computer vision and pattern recognition. 2016;770–778.
20.Szegedy C, Vanhoucke V, Ioffe S, Shlens J, Wojna Z. Rethinking the inception architecture for computer vision. IEEE conference on computer vision and pattern recognition. 2016;2818–2826.
21.Guo D, Shao Y, Cui Y, Wang Z, Zhang L, Shen C. Graph attention tracking. IEEE conference on computer vision and pattern recognition. 2021;9543–9552.
22. Yuanyun W, Wenshuang Z, Limin Z, Jun W. Siamese network with a depthwise over-parameterized convolutional layer for visual tracking. PLOS ONE. 2022;1–21. [DOI] [PMC free article] [PubMed] [Google Scholar]
23. Wu Y, Lim J, Yang MH. Object Tracking Benchmark. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2015;1834–1848. doi: 10.1109/TPAMI.2014.2388226 [DOI] [PubMed] [Google Scholar]
24. Huang L, Zhao X, Huang K. Got-10k: A large high-diversity benchmark for generic object tracking in the wild. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2019;1562–1577. [DOI] [PubMed] [Google Scholar]
25.Mueller M, Smith N, Ghanem B. A benchmark and simulator for uav tracking. European conference on computer vision. 2016;445–461.
26.Fan H, Lin L, Yang F, Chu P, Deng G, Yu S, et al. Lasot: A high-quality benchmark for large-scale single object tracking. IEEE conference on computer vision and pattern recognition. 2019;5374–5383.
27.Hui, Le, et al. 3D Siamese Transformer Network for Single Object Tracking on Point Clouds. arXiv preprint arXiv:2207.11995. 2022.
28.Tang F, Ling Q. Ranking-Based Siamese Visual Tracking. IEEE Conference on Computer Vision and Pattern Recognition. 2022;8741–8750.
29. Liu Q, Li X, He Z, et al. Learning deep multi-level similarity for thermal infrared object tracking. IEEE Transactions on Multimedia. 2020;2114–2126. [Google Scholar]
30.Tao R, Gavves E, Smeulders A W M. Siamese instance search for tracking. IEEE conference on computer vision and pattern recognition. 2016;1420–1429.
31. Fan N, Li X, Zhou Z, Liu Q, He Z. Learning dual-margin model for visual tracking. Neural Networks. 2021;344–354. doi: 10.1016/j.neunet.2021.04.004 [DOI] [PubMed] [Google Scholar]
32. Li X, Liu Q, Fan N, He Z, Wang H. Hierarchical Spatial-aware Siamese Network for Thermal Infrared Object Tracking. Knowledge-Based Systems. 2019;71–81. doi: 10.1016/j.knosys.2018.12.011 [DOI] [Google Scholar]
33.Zhu Z, Wang Q, Li B, Wu W, Yan J, Hu W. Distractor-aware siamese networks for visual object tracking. European conference on computer vision. 2018;101–117.
34.Fan H, Ling H. Siamese cascaded region proposal networks for real-time visual tracking. IEEE conference on computer vision and pattern recognition. 2019;7952–7961.
35.Zhang Z, Peng H, Fu J, Li B, Hu W. Ocean: Object-aware anchor-free tracking. European Conference on Computer Vision. 2020;771–787.
36.Chen Z, Zhong B, Li G, Zhang S, Ji R. Siamese box adaptive network for visual tracking. IEEE conference on computer vision and pattern recognition. 2020;6668–6677.
37.Li L, Wang K, Li S, Feng X, Zhang L. Lst-net: Learning a convolutional neural network with a learnable sparse transform. European Conference on Computer Vision. 2020;562–579.
38.Huang G, Liu Z, Van Der Maaten L, Weinberger K Q. Densely connected convolutional networks. IEEE conference on computer vision and pattern recognition. 2017;4700–4708.
39.Yu F, Koltun V. Multi-scale context aggregation by dilated convolutions. arXiv preprint arXiv:1511.07122, 2015.
40. Chen L, Papandreou G, Kokkinos I, Murphy K, Yuille AL. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE transactions on pattern analysis and machine intelligence. 2017;834–848. [DOI] [PubMed] [Google Scholar]
41.Dai J, Qi H, Xiong Y, Li Y, Zhang G, Hu H, Wei Y. Deformable convolutional networks. IEEE international conference on computer vision. 2017;764–773.
42.Zhu X, Hu H, Lin Stephe, Dai J. Deformable convnets v2: More deformable, better results. IEEE conference on computer vision and pattern recognition. 2019;9308–9316.
43.Woo S, Park J, Lee J, Kweon I. Cbam: Convolutional block attention module. European conference on computer vision. 2018;3–19.
44. Hua X, Wang X, Rui T, Shao F, Wang D. Light-weight UAV object tracking network based on strategy gradient and attention mechanism. Knowledge-Based Systems. 2021;1–8. [Google Scholar]
45.Chen X, Yan B, Zhu J, Wang D, Yang X, Lu H. Transformer tracking. IEEE Conference on Computer Vision and Pattern Recognition. 2021;8126–8135.
46.Wang N, Zhou W, Wang J, Li H. Transformer meets tracker: Exploiting temporal context for robust visual tracking. IEEE Conference on Computer Vision and Pattern Recognition. 2021;1571–1580.
47.Danelljan M, Bhat G, Khan F, Felsberg M. Atom: Accurate tracking by overlap maximization. IEEE Conference on Computer Vision and Pattern Recognition. 2019; 4660–4669.
48.Wang G, Luo C, Xiong Z, Zeng W. Spm-tracker: Series-parallel matching for real-time visual object tracking. IEEE conference on computer vision and pattern recognition. 2019;3643–3652.
49.Dong X, Shen J, Shao L, Porikli F. CLNet: A compact latent network for fast adjusting Siamese trackers. European Conference on Computer Vision. 2020;378–395.
50.Danelljan M, Bhat G, Shahbaz K F, Felsberg M. Eco: Efficient convolution operators for tracking. IEEE conference on computer vision and pattern recognition. 2017;6638–6646.
51.Danelljan M, Robinson A, Shahbaz K F, Felsberg M. Beyond correlation filters: Learning continuous convolution operators for visual tracking. European conference on computer vision. 2016;472–488.
52.Sauer A, Aljalbout E, Haddadin S. Tracking Holistic Object Representations. arXiv preprint arXiv:1907.12920, 2019.
53.Bhat G, Danelljan M, Gool L V, Timofte R. Learning discriminative model prediction for tracking. IEEE international conference on computer vision. 2019;6182–6191.
54.Zhang Yunhua, Wang L, Qi J, Wang D, Feng M, Lu H. Structured siamese network for real-time visual tracking. European conference on computer vision. 2018;351–366.
55.Song Y, Ma C, Wu X, Gong L, Bao L, Zuo W, et al. Vital: Visual tracking via adversarial learning. IEEE conference on computer vision and pattern recognition. 2018;8990–8999.
56.Yan B, Zhao H, Wang D, Lu Hu, Yang X. Skimming-Perusal Tracking: A Framework for Real-Time and Robust Long-Term Tracking. IEEE International Conference on Computer Vision. 2019;2385–2393.
57.Wang Q, Zhang L, Bertinetto L, Hu W, Torr P HS. Fast online object tracking and segmentation: A unifying approach. IEEE conference on Computer Vision and Pattern Recognition. 2019;1328–1338.

PLoS One. doi: 10.1371/journal.pone.0277064.r001

Decision Letter 0

Sathishkumar V E

14 Sep 2022

PONE-D-22-21523SGAT: Shuffle and Graph Attention based Siamese Networks for Visual TrackingPLOS ONE

Dear Dr. Wang,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

Please submit your revised manuscript by Oct 29 2022 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.
A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.
An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols.

We look forward to receiving your revised manuscript.

Kind regards,

Sathishkumar V E

Academic Editor

PLOS ONE

Journal Requirements:

When submitting your revision, we need you to address these additional requirements.

1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at

https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and

https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf

2. Please note that PLOS ONE has specific guidelines on code sharing for submissions in which author-generated code underpins the findings in the manuscript. In these cases, all author-generated code must be made available without restrictions upon publication of the work. Please review our guidelines at https://journals.plos.org/plosone/s/materials-and-software-sharing#loc-sharing-code and ensure that your code is shared in a way that follows best practice and facilitates reproducibility and reuse.

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Yes

Reviewer #2: Yes

**********

2. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #2: Yes

**********

3. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

**********

4. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: Yes

**********

5. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: In this paper, the authors proposed a tacking algorithm for visual tracking using shuffle and graph matching attention mechanism. Correlations between the spatial and channel-wise information to highlight the target region is explored. The results are compared with benchmark datasets. The paper is well written. The interpretation and description of the experimental results are also explained clearly. However, this manuscript have some weak points, it should be further improved before consider for publication. Some of my observations are

1. Abstract is very general. It has to be elaborated with characteristics of the results obtained.

2. No need to give specification for cl and reg in figure 1. It has to be explained through literature.

3. In table 1 also expansion for GM, SA are not needed. Expansions need to be given during first refereed place.

4. In table 1, what Success represents? What about accuracy?

5. Figure 3, more explanation is needed like the reason for SGAT algorithm’s best performance compared to all other methods

6. Figure 4 is not clear

7. Why the authors choose the threshold values 0.5 and 0.75. Justification is required

8. The need for evaluation based on AO is required

9. What is the need for representing success rate as SR? Uniformity in representing evaluation measures are not followed.

10. Table 3 need to be elaborated. It is mentioned thatUAV123 is used for evaluation. Why other state-of-the-art datasets are not used for evaluation?

11. In some places, success is represented as AUC. Need to maintain uniformity.

12. Justification is required for fig 9.

13. Recent references need to be included.

Reviewer #2: In this paper, the authors propose a shuffle attention based Siamese tracker. The idea makes sense and the paper is easy to follow. Extensive experimental results demonstrate that the proposed method achieves good performance. However, there are several problems and questions of the paper should be solved.

(1) In the abstract, the authors should introduce the core idea of the proposed method and point out the advantages of the method.

(2) The motivation is not clearly in the introduction. What's the problem of this paper solved?

(3) Why the shuffle attention is better than the original channel and spatial attentions in tracking ?

(4) There are several Siamese trackers and attentions should be discussed to enrich the related work, such as Learning dual-margin model for visual tracking, Learning Deep Multi-Level Similarity for Thermal Infrared Object Tracking, Hierarchical spatial-aware siamese network for thermal infrared object tracking, and Learning dual-level deep representation for thermal infrared tracking.

(5) What's the GM in Fig.1. There are missing the introduction about this.

(6) How to divide the group in the shuffle attention module? The authors should explain the reason and conduct an ablation study.

**********

6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

**********

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

PLoS One. 2022 Nov 23;17(11):e0277064. doi: 10.1371/journal.pone.0277064.r002

Author response to Decision Letter 0

3 Oct 2022

Paper No.: No.: PONE-D-22-21523

Title: SGAT: Shuffle and Graph Attention based Siamese Networks for Visual

Tracking

Authors: Jun Wang, Limin Zhang, Wenshuang Zhang, Yuanyun Wang*, Chengzhi

Deng

Dear Editor:

We would like to thank the reviewers and you for your great efforts in helping us to

improve the quality of the paper. After carefully considering the reviewers’ comments

and suggestions, we have significantly revised the paper with more details and

descriptions.

A detailed summary of the revisions and some specific comments/responses are given

in the following.

In short, we feel that we have addressed all crucial concerns of the reviewers.

However，if you have any questions or further requirements, please do not hesitate to

Best regards,

Yuanyun Wang

October, 3, 2022

A Response and Summary of the Revisions: No.: PONE-D-22-21523

Authors’ Responses:

*****************

Associate Editor

*****************

（1）Please ensure that your manuscript meets PLOS ONE's style requirements,

including those for file naming. The PLOS ONE style templates can be found at

https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main

_body.pdf and https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_a

uthors_affiliations.pdf

RESPONSE: Thanks. We have carefully read the guidelines and style templates.

Then, we have modified the manuscript throughout the paper according to the

templates.

(2) Please note that PLOS ONE has specific guidelines on code sharing for

submissions in which author-generated code underpins the findings in the manuscript.

In these cases, all author-generated code must be made available without restrictions

upon publication of the work. Please review our guidelines at

https://journals.plos.org/plosone/s/materials-and-software-sharing#loc-sharing-code a

nd ensure that your code is shared in a way that follows best practice and facilitates

reproducibility and reuse.

RESPONSE: Thanks for pointing this out. We will provide the appropriate code in

three months.

*****************

Reviewer #1

*****************

(1) Abstract is very general. It has to be elaborated with characteristics of the results

obtained.

RESPONSE: Thanks for the comments. We have re-organized the abstract in the

revised manuscript with more details according to the results obtained.

...In this paper, we propose a novel tracking algorithm for feature extraction of target

templates and search region images through convolutional neural networks and

shuffle attention, and computes the similarity between the template and a search

region through a graph attention matching. ...

…Extensive experiments demonstrate that the proposed tracking algorithm achieves

excellent tracking results on multiple challenging benchmarks. Compared with other

state-of-the-art methods, the proposed tracking algorithm achieves excellent tracking

performance. …

(2) No need to give specification for cl and reg in figure 1. It has to be explained

through literature.

RESPONSE: Thanks for the valuable suggestion. We removed the branches cls and

reg in the revised manuscript as shown in Fig 1. We have explained both the

classification and regression branches in Section 2, i.e., related works. In addition, this

prediction head is the usual way of mainstream trackers, including SiamRPN[12] and

SiamCAR[16], etc.

SiamRPN has the classification (cl) and regression (reg) branches. The classification

branch distinguishes the target from the surrounding background. And the regression

branch refines the target location.

(3) In table 1 also expansion for GM, SA are not needed. Expansions need to be given

during first refereed place.

RESPONSE: Thanks for the comments. We give the expansions of GM and SA

during first refereed place.

In Table 1, we demonstrate the effectiveness of GM by ablation experiments. For

more details, GM and SA are explained in Page 4 at the first refereed place, please see

Section 3 and Section 4.2. In addition, we fixed the GM constant and validated the

effectiveness of the SA module.

…2) the shuffle attention mechanism model (SA Unit), which reconstructs the basis

features to focus on the target region and suppress the background interference

through the spatial and channel-wise transformation; 3) graph attention matching

(GM), which computes the similarity between the target template and a search region,

and joints classification and regression branches to locate the target position in the

current frame. …

(4) In table 1, what Success represents? What about accuracy?

RESPONSE: Thanks for pointing this out. In the evaluation metric, we explain the

relevant definitions of success and accuracy. Among them, precision and accuracy

denote the same concept. For more details, please see the Section 4.1.

…The precision is evaluated by the center location error (CLE) between the predicted

location and the ground truth location. The precision plots are drawn in according to

the frame percentages of CLE under the specified thresholds. Besides, the success rate

is defined as the intersection over union (IoU) between the predicted bounding boxes

and the ground truth. Meanwhile, when the IoU exceeds a certain threshold, it is

considered to track the target accurately, and the success plot is drawn by the frame

percentage. …

(5) Figure 3, more explanation is needed like the reason for SGAT algorithm’s best

performance compared to all other methods.

RESPONSE: Thanks for the suggestions. We have added some sentences for analysis

of superior performance in Fig 3, in the revised manuscript. For more details, please

see the Section 4.3.

…since similarity learning based on graph matching effectively exploits the structured

information, the SGAT algorithm achieves the best results in the success rate and

different attributes. …

(6) Figure 4 is not clear.

RESPONSE: Thanks for pointing this out. We have drawn the figure again. And we

have added some corresponding descriptions for Figure 4. For more details, please see

the Fig 4 and Section 4.3.

…Here, the x-axis represents the 10th power of the tracking speed and the y-axis

represents the success rate. For example, when x is taken as 2, the tracking speed is

200 frames per second. …

(7) Why the authors choose the threshold values 0.5 and 0.75. Justification is

required.

RESPONSE: Thanks. In the paper associated with the GOT-10k dataset, the authors

explicitly state the use of thresholds of 0.5 and 0.75, and subsequent mainstream

trackers use this threshold for fair comparisons. The provision that all target trackers

use the same training and testing sets provided by the dataset ensures a fair

comparison of all trackers. GOT-10K training and testing sets are non-overlapping.

After uploading the tracking results to the GOT-10K official website, the website

automatically analyzes the tracking results. The assessment metrics provided include

mean overlap rate (AO) and success rate (SR). SR0.5 indicates the rate of successful

tracking frames with an overlap of more than 0.5, while SR0.75 indicates the rate of

successful tracking frames with an overlap of more than 0.75. For more details, please

see Section 4.3 in Page 9.

(8) The need for evaluation based on AO is required.

RESPONSE: Thanks for pointing this out.

We performed AO-based evaluation of the trackers on the GOT-10k dataset in Table

2 in Section 4.3. Please see the corresponding explain and analysis in Section 4.3

AO represents the average overlap between all estimated bounding boxes and

ground-truth boxes. By using AO as an evaluation index, we can further evaluate the

tracking performance of our tracker. Meantime, we used success rate for evaluation

on OTB2015, LaSOT and UAV123 datasets, which is IoU-based method.

(9) What is the need for representing success rate as SR? Uniformity in representing

evaluation measures are not followed.

RESPONSE: Thanks for the suggestions. We have revised all SR to success rate

throughout the revised manuscript.

(10) Table 3 need to be elaborated. It is mentioned that UAV123 is used for

evaluation. Why other state-of-the-art datasets are not used for evaluation?

RESPONSE: Thanks for pointing this out.

In UAV test set, the main challenge factors are occlusion and small targets, and most

images have low resolution attributes. To the best of our knowledge, the

state-of-the-art trackers are usually compared on the UAV dataset. Additionally, we

also evaluate our tracker in OTB-100, GOT-10k and LaSOT. Extensive experimental

results demonstrate that the proposed tracker has excellent performance on multiple

benchmarks including OTB-100, GOT-10k, UAV123 and LaSOT, and outperforms

many SOTA trackers. For more details, please see the Section 4.3.

(11) In some places, success is represented as AUC. Need to maintain uniformity.

RESPONSE: Thanks for suggestions. We revised the manuscript to maintain a

uniform as success rate.

(12) Justification is required for fig 9.

RESPONSE: Thanks for suggestions.

Figure 9 shows the inability of our tracker to perform accurate localization in some

complex scenarios, and we discuss it in more detail in Limitations.

... As shown in Fig 9, in complex tracking environment, trackers may occur tracking

drift and tracking failure. In some extreme scenarios, the SGAT cannot complete the

target tracking task well. For example, after the 106th frame in the sequence soccer

and the 143rd frame in the sequence bird1, when there are many similar target

interferences and long-term occlusion in the scene, the SGAT will lose the target and

lead to tracking failure, which is also an urgent problem faced by the existing trackers.

Next, we will focus on the following two aspects: 1) modeling the target using

spatial-temporal context information to ensure that the target can be located when

occlusion occurs; 2) Adding a learnable memory unit to alleviate the problem that the

target often disappears in long-term tracking. …

(13) Recent references need to be included.

RESPONSE: Thanks for comments. We added and discussed some key references in

the revised manuscript, for example, Ref. [25], [26], [55], [56].

…In recent years, trackers based on Siamese network attract incremental attention for

their leading performance [25-28]. …

…Fan et al. [55] propose a dual-margin model for accuracy and robust visual tracking,

which formulated the target state prediction problem as a dual-margin model

including an intra-object margin and an inter-object margin. Li et al. [56] propose a

thermal infrared tracker based on a hierarchical spatially-aware twin network that

regards the infrared tracking problem as a similarity verification task. …

[25] Hui, Le, et al. 3D Siamese Transformer Network for Single Object Tracking on

Point Clouds. arXiv preprint arXiv:2207.11995. 2022.

[26] Tang F, Ling Q. Ranking-Based Siamese Visual Tracking. IEEE Conference

on Computer Vision and Pattern Recognition. 2022: 8741-8750.

[55] Fan N, Li X, Zhou Z, Liu Q, He Z. Learning dual-margin model for visual

tracking. Neural Networks. 2021: 344-354.

[56] Li X, Liu Q, Fan N, He Z, Wang H. Hierarchical Spatial-aware Siamese Network

for Thermal Infrared Object Tracking. Knowledge-Based Systems. 2019: 71-81.

*****************

Reviewer #2

*****************

(1) In the abstract, the authors should introduce the core idea of the proposed method

and point out the advantages of the method.

RESPONSE: Thanks for the comments.

We added more details in the abstract, including the core ideas of the paper and the

advantages of the proposed approach.

...In this paper, we propose a novel tracking algorithm for feature extraction of target

templates and search region images. Based on convolutional neural networks and

shuffle attention, the tracking algorithm computes the similarity between the template

and a search region through a graph attention matching. The proposed tracking

algorithm exploits the correlations between the spatial and channel-wise information

to highlight the target region. Moreover, the graph matching can greatly alleviate the

influences of appearance variations such as partial occlusions. …

(2) The motivation is not clearly in the introduction. What's the problem of this paper

solved?

RESPONSE: Thanks for pointing this out. We further explain the motivation of

design our tracking algorithm and describe the problem solved.

a. Most Siamese-based trackers use the features of the last convolution layer or

cascaded multi-layers as the target representations of the template and the search

region, which do not effectively use the structured and part-level information. To

address this problem, we propose to combine the advantages of CNN and shuffle

attention for feature representation of target templates and search region images.

b. Both the cross-correlation and depth cross-correlation take the template features as

a whole for linear matching on the search regions, so that the adjacent sliding

windows produce a similar response. To solve this problem, we introduced graph

matching approach for similarity learning to mine more structured information.

Our tracking algorithm effectively uses the structured and part-level information and

exploits structured and part-level information, which greatly alleviate the influences

of appearance variations such as fast motion and partial occlusions.

For more details, please see the Section 1.

(3) Why the shuffle attention is better than the original channel and spatial attentions

in tracking?

RESPONSE: Thanks.

The original spatial and channel attention does not take full advantage of the

correlational attention between space and channel, making it less efficient, e.g.,

CBAM. The shuffle attention by dividing into different blocks along the channel is a

lighter and more efficient way of integrating spatial and channel attention.

(4) There are several Siamese trackers and attentions should be discussed to enrich the

related work, such as Learning dual-margin model for visual tracking, Learning Deep

Multi-Level Similarity for Thermal Infrared Object Tracking, Hierarchical

spatial-aware siamese network for thermal infrared object tracking, and learning

dual-level deep representation for thermal infrared tracking.

RESPONSE: Thanks. We have updated the literature and further enriched the related

work as suggested.

…In recent years, trackers based on Siamese network attract incremental attention for

their leading performance [25-28]. …

…Fan et al. [55] propose a dual-margin model for accuracy and robust visual tracking,

which formulated the target state prediction problem as a dual-margin model

including an intra-object margin and an inter-object margin. Li et al. [56] propose a

thermal infrared tracker based on a hierarchical spatially-aware twin network that

regards the infrared tracking problem as a similarity verification task. …

[27] Fan N, Li X, Zhou Z, et al. Learning dual-margin model for visual tracking.

Neural Networks. 2021: 344-354.

[28] Liu Q, Li X, He Z, et al. Learning deep multi-level similarity for thermal infrared

object tracking. IEEE Transactions on Multimedia. 2020: 2114-2126.

[55] Fan N, Li X, Zhou Z, Liu Q, He Z. Learning dual-margin model for visual

tracking. Neural Networks. 2021: 344-354.

[56] Li X, Liu Q, Fan N, He Z, Wang H. Hierarchical Spatial-aware Siamese Network

for Thermal Infrared Object Tracking. Knowledge-Based Systems. 2019: 71-81.

(5) What's the GM in Fig 1. There are missing the introduction about this.

RESPONSE: Thanks for pointing this out. We describe the details of the GM in

section 3.3.

…we learn a graph attention matching (GM) based similarity measuring instead of

cross-correlation. By decomposing the target template and search region features into

multiple grids, and then computing the similarity of different template and search

region grids, which greatly alleviate the challenging of pose variations of target. …

... we assume 1 × 1 × C grid of the feature map as a node. For node i on the template

and node j in the search region, the correlation scores are. …

(6) How to divide the group in the shuffle attention module? The authors should

explain the reason and conduct an ablation study.

RESPONSE: Thanks for the comments. We have added some sentences to descript

how to divide the group in the shuffle attention module.

…As shown in Figure 2, the designed deep model effectively exploit the correlations

between the spatial and channel-wise to highlight the target region without extra

overhead. …

…Channel transformation focuses on 'what' is important in an input image. The

typical channel attention is SE block, which can effectively capture the correlation

between channels. However, SE blocks usually increase the number of parameters of

the model, which is not accord with the principle of lightweight design in tracking

tasks. To generate channel weights efficiently, the spatial dimension of an input

feature map is usually compressed, and adopt average-pooling to integrate spatial

information. Based on prior information, we adopt a novel channel transformation

method that resizes the channel-wise block through global average pooling. The

channel-wise block is obtained as follows. …

…As a supplement to channel-wise transformation, spatial transformation aims to

locate 'where' is an important region. To effectively carry out spatial transformation,

the max-pooling and average-pooling are usually used to deal with input feature along

channel dimension. In this paper, the specific implementation steps are as follows:

firstly, group normalization (GN) is used to preprocess the spatial features. Then,

linear transformation and activation function are combined to enhance the ability of

feature representation and suppress the interference of background region. The

transformed spatial features are as follows. …

…In shuffle attention module, we divide the basis features into multiple sub-features

along the channel dimensions. The shuffle unit reconstructs each sub-feature by

spatial and channel-wise transformations. Finally, the sub-features are combined by

using the dependence along channel dimensions. …For more details, please see the

Section 3.2 and 4.2.

Attachment

Submitted filename: AuthorsReply_Revised.pdf

Click here for additional data file.^{(209KB, pdf)}

PLoS One. doi: 10.1371/journal.pone.0277064.r003

Decision Letter 1

Sathishkumar V E

12 Oct 2022

PONE-D-22-21523R1SGAT: Shuffle and Graph Attention based Siamese Networks for Visual TrackingPLOS ONE

Dear Dr. Wang,

Please include a reference to your own previously published work, https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0273690 and clarify the motivations for this manuscript, in light of this previous work.

Please submit your revised manuscript by Nov 25 2022 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.
A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.
An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

We look forward to receiving your revised manuscript.

Kind regards,

Hanna Landenmark

Staff Editor, PLOS ONE

on behalf of

Sathishkumar V E

Academic Editor, PLOS ONE

Journal Requirements:

Please review your reference list to ensure that it is complete and correct. If you have cited papers that have been retracted, please include the rationale for doing so in the manuscript text, or remove these references and replace them with relevant current references. Any changes to the reference list should be mentioned in the rebuttal letter that accompanies your revised manuscript. If you need to cite a retracted article, indicate the article’s retracted status in the References list and also include a citation and full reference for the retraction notice.

Additional Editor Comments (if provided):

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation.

Reviewer #1: All comments have been addressed

Reviewer #2: All comments have been addressed

**********

2. Is the manuscript technically sound, and do the data support the conclusions?

Reviewer #1: Yes

Reviewer #2: Yes

**********

3. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #2: Yes

**********

4. Have the authors made all data underlying the findings in their manuscript fully available?

Reviewer #1: Yes

Reviewer #2: Yes

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English?

Reviewer #1: Yes

Reviewer #2: Yes

**********

6. Review Comments to the Author

Reviewer #1: Authors addressed all the comments specified by the reviewer. So, the paper can be accepted at this stage

Reviewer #2: The response solves my doubts.

There are two same references [27] and [55] and there is missing a related reference 'Learning dual-level deep representation for thermal infrared tracking'.

**********

7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

**********

PLoS One. 2022 Nov 23;17(11):e0277064. doi: 10.1371/journal.pone.0277064.r004

Author response to Decision Letter 1

13 Oct 2022

Paper No.: No.: PONE-D-22-21523R1

Title: SGAT: Shuffle and Graph Attention based Siamese Networks for Visual

Tracking

Authors: Jun Wang, Limin Zhang, Wenshuang Zhang, Yuanyun Wang, Chengzhi

Deng

Dear Editor:

We would like to thank the reviewers and you for your great efforts in helping us to

improve the quality of the paper. After carefully considering the reviewers’ comments

and suggestions, we have revised the paper with some details and references.

A detailed summary of the revisions and some specific comments/responses are given

in the following.

In short, we feel that we have addressed all crucial concerns of the reviewers.

However，if you have any questions or further requirements, please do not hesitate to

Best regards,

Yuanyun Wang

October,13, 2022

Response and Summary of the Revisions: No.: PONE-D-22-21523R1

*****************

Associate Editor

*****************

(1) Please include a reference to your own previously published work,

https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0273690 and clarify

the motivations for this manuscript, in light of this previous work.

RESPONSE: Thanks. We included the precious work [22] in Reference.

[22] Yuanyun W, Wenshuang Z, Limin Z, Jun W. Siamese network with a depthwise

over-parameterized convolutional layer for visual tracking. PLOS ONE. 2022;1-21.

And, we added some sentences to clarity the motivation in this manuscript and the

difference between this and the previous work.

Different from the previous work [22], we design a novel feature extraction network

based on GoogleNet to exploit correlations of the spatial and channel-wise

information. Additionally, in order to alleviate the influences of appearance variations,

we use a different similarity computing to obtain more accurate score maps. Inspired

by above-mentioned works, in this paper, we propose a novel tracking algorithm

based on shuffle attention mechanism and graph matching in Siamese network. The

shuffle attention mechanism in the backbone network reconstructs the basic features

extracted from CNN, and makes the feature representation focusing on the regions of

interest through spatial and channel-wise transformations. Different from the

cross-correlation based similarity learning, the part-to-part graph attention matching

further improves the tracking robustness in complex scenes, such as occlusion.

*****************

Reviewer #1

*****************

(1) Authors addressed all the comments specified by the reviewer. So, the paper can

be accepted at this stage.

RESPONSE: Thanks for your comment.

*****************

Reviewer #2

*****************

(1) There are two same references [27] and [55] and there is missing a related

reference 'Learning dual-level deep representation for thermal infrared tracking'.

RESPONSE: Thanks for pointing this out. We removed the same reference, and have

included the key reference [3] in Introduction, Page 1. Some details are as follows:

[3] Liu Q, Yuan D, Fan N, et al. Learning dual-level deep representation for thermal

infrared tracking. IEEE Transactions on Multimedia. 2022;1-8.

Visual tracking [1-3] is a fundamental research topic in computer vision. It aims to

estimate target states in subsequent frames by given the initial state in the first frame.

It is widely used in various applications, such as video surveillance [4],

human-computer interaction [5], augmented reality [6], and so on. Recently,

Convolutional Neural Network (CNN) is successfully used in visual tracking. Deep

trackers [7,8] achieve robust tracking performance and real-time tracking speed.

However, due to complicated appearance variations, visual tracking is still a

challenging task.

Attachment

Submitted filename: AuthorsReply_Revised.pdf

Click here for additional data file.^{(121.9KB, pdf)}

PLoS One. doi: 10.1371/journal.pone.0277064.r005

Decision Letter 2

Sathishkumar V E

19 Oct 2022

SGAT: Shuffle and Graph Attention based Siamese Networks for Visual Tracking

PONE-D-22-21523R2

Dear Dr. Wang,

We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements.

Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication.

An invoice for payment will follow shortly after the formal acceptance. To ensure an efficient process, please log into Editorial Manager at http://www.editorialmanager.com/pone/, click the 'Update My Information' link at the top of the page, and double check that your user information is up-to-date. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org.

If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

Kind regards,

Sathishkumar V E

Academic Editor

PLOS ONE

Additional Editor Comments (optional):

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

Reviewer #1: All comments have been addressed

**********

2. Is the manuscript technically sound, and do the data support the conclusions?

Reviewer #1: Yes

**********

3. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

**********

4. Have the authors made all data underlying the findings in their manuscript fully available?

Reviewer #1: Yes

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English?

Reviewer #1: Yes

**********

6. Review Comments to the Author

Reviewer #1: All the comments have been addressed properly by the authors. The paper can be accepted at this stage.

**********

7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

**********

PLoS One. doi: 10.1371/journal.pone.0277064.r006

Acceptance letter

Sathishkumar V E

25 Oct 2022

PONE-D-22-21523R2

SGAT: Shuffle and Graph Attention based Siamese Networks for Visual Tracking

Dear Dr. Wang:

I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department.

If your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information please contact onepress@plos.org.

If we can help with anything else, please email us at plosone@plos.org.

Thank you for submitting your work to PLOS ONE and supporting open access.

Kind regards,

PLOS ONE Editorial Office Staff

on behalf of

Dr. Sathishkumar V E

Academic Editor

PLOS ONE

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Attachment

Submitted filename: AuthorsReply_Revised.pdf

Click here for additional data file.^{(209KB, pdf)}

Attachment

Submitted filename: AuthorsReply_Revised.pdf

Click here for additional data file.^{(121.9KB, pdf)}

Data Availability Statement

[pone.0277064.ref001] 1.Bertinetto, L, Valmadre J, Henriques JF, Vedaldi A, Torr P. Fully-convolutional siamese networks for object tracking. European conference on computer vision. 2016;850–865.

[pone.0277064.ref002] 2. He A, Luo C, Tian X, Zeng W. A twofold siamese network for real-time object tracking. IEEE conference on computer vision and pattern recognition. 2018;4834–4843. [Google Scholar]

[pone.0277064.ref003] 3. Liu Q, Yuan D, Fan N, et al. Learning dual-level deep representation for thermal infrared tracking. IEEE Transactions on Multimedia. 2022;1–8. doi: 10.1109/TMI.2022.3215547 [DOI] [Google Scholar]

[pone.0277064.ref004] 4.Siebel N T, Maybank S. The advisor visual surveillance system. ECCV 2004 workshop applications of computer vision. 2004;1–8.

[pone.0277064.ref005] 5. Bonin-Font F, Ortiz A, Oliver G. Visual navigation for mobile robots: A survey. Journal of intelligent and robotic systems. 2008;263–296. doi: 10.1007/s10846-008-9235-4 [DOI] [Google Scholar]

[pone.0277064.ref006] 6.Wei J, Ye G, Mullen T, Grundmann M, Ahmadyan A, Hou T. Instant motion tracking and its applications to augmented reality. arXiv preprint arXiv:1907.06796. 2019;1–8.

[pone.0277064.ref007] 7.Zhang Z, Peng H. Deeper and wider siamese networks for real-time visual tracking. IEEE conference on computer vision and pattern recognition. 2019;4591–4600.

[pone.0277064.ref008] 8.Nam H, Han B. Learning Multi-domain Convolutional Neural Networks for Visual Tracking. IEEE conference on computer vision and pattern recognition. 2016;4293–4302.

[pone.0277064.ref009] 9.Qing G, Wei F, et al. Learning dynamic siamese network for visual object tracking. IEEE international conference on computer vision. 2017;1763–1771.

[pone.0277064.ref010] 10. Zeng Y, Zeng B, Yin X, Chen G. SiamPCF: siamese point regression with coarse-fine classification network for visual tracking. Applied Intelligence. 2022;4973–4986. doi: 10.1007/s10489-021-02651-5 [DOI] [Google Scholar]

[pone.0277064.ref011] 11. Cui Z, Lu N. Feature selection accelerated convolutional neural networks for visual tracking. Applied Intelligence. 2021;8230–8244. doi: 10.1007/s10489-021-02234-4 [DOI] [Google Scholar]

[pone.0277064.ref012] 12. Ren S, He K, Girshick R, Sun J. Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in neural information processing systems. 2015;1–8. [DOI] [PubMed] [Google Scholar]

[pone.0277064.ref013] 13.Li B, Yan J, Wu W, Zhu Z, Hu X. High performance visual tracking with siamese region proposal network. IEEE conference on computer vision and pattern recognition. 2018;8971–8980.

[pone.0277064.ref014] 14.Li B, Wu W, Wang Q, Zhang F, Xing J, Yan J. Siamrpn++: Evolution of siamese visual tracking with very deep networks. IEEE Conference on Computer Vision and Pattern Recognition. 2019;4282–4291.

[pone.0277064.ref015] 15. Wang J, Meng C, Deng C, Wang Y. Learning attention modules for visual tracking. Signal, Image and Video Processing. 2022;1–8. doi: 10.1007/s11760-022-02233-z [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0277064.ref016] 16.Xu Y, Wang Z, Li Z, Yuan Y, Yu G. Siamfc++: Towards robust and accurate visual tracking with target estimation guidelines. AAAI Conference on Artificial Intelligence. 2020;12549–12556.

[pone.0277064.ref017] 17.Guo D, Wang J, Cui Y, Wang Z, Chen S. SiamCAR: Siamese fully convolutional classification and regression for visual tracking. IEEE conference on computer vision and pattern recognition. 2020;6269–6277.

[pone.0277064.ref018] 18. Krizhevsky A, Sutskever I, Hinton GE. Imagenet classification with deep convolutional neural networks. Communications of the ACM. 2017;84–90. doi: 10.1145/3065386 [DOI] [Google Scholar]

[pone.0277064.ref019] 19.He, K, Zhang, X, Ren S and Sun, J. Deep residual learning for image recognition. IEEE conference on computer vision and pattern recognition. 2016;770–778.

[pone.0277064.ref020] 20.Szegedy C, Vanhoucke V, Ioffe S, Shlens J, Wojna Z. Rethinking the inception architecture for computer vision. IEEE conference on computer vision and pattern recognition. 2016;2818–2826.

[pone.0277064.ref021] 21.Guo D, Shao Y, Cui Y, Wang Z, Zhang L, Shen C. Graph attention tracking. IEEE conference on computer vision and pattern recognition. 2021;9543–9552.

[pone.0277064.ref022] 22. Yuanyun W, Wenshuang Z, Limin Z, Jun W. Siamese network with a depthwise over-parameterized convolutional layer for visual tracking. PLOS ONE. 2022;1–21. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0277064.ref023] 23. Wu Y, Lim J, Yang MH. Object Tracking Benchmark. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2015;1834–1848. doi: 10.1109/TPAMI.2014.2388226 [DOI] [PubMed] [Google Scholar]

[pone.0277064.ref024] 24. Huang L, Zhao X, Huang K. Got-10k: A large high-diversity benchmark for generic object tracking in the wild. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2019;1562–1577. [DOI] [PubMed] [Google Scholar]

[pone.0277064.ref025] 25.Mueller M, Smith N, Ghanem B. A benchmark and simulator for uav tracking. European conference on computer vision. 2016;445–461.

[pone.0277064.ref026] 26.Fan H, Lin L, Yang F, Chu P, Deng G, Yu S, et al. Lasot: A high-quality benchmark for large-scale single object tracking. IEEE conference on computer vision and pattern recognition. 2019;5374–5383.

[pone.0277064.ref027] 27.Hui, Le, et al. 3D Siamese Transformer Network for Single Object Tracking on Point Clouds. arXiv preprint arXiv:2207.11995. 2022.

[pone.0277064.ref028] 28.Tang F, Ling Q. Ranking-Based Siamese Visual Tracking. IEEE Conference on Computer Vision and Pattern Recognition. 2022;8741–8750.

[pone.0277064.ref029] 29. Liu Q, Li X, He Z, et al. Learning deep multi-level similarity for thermal infrared object tracking. IEEE Transactions on Multimedia. 2020;2114–2126. [Google Scholar]

[pone.0277064.ref030] 30.Tao R, Gavves E, Smeulders A W M. Siamese instance search for tracking. IEEE conference on computer vision and pattern recognition. 2016;1420–1429.

[pone.0277064.ref031] 31. Fan N, Li X, Zhou Z, Liu Q, He Z. Learning dual-margin model for visual tracking. Neural Networks. 2021;344–354. doi: 10.1016/j.neunet.2021.04.004 [DOI] [PubMed] [Google Scholar]

[pone.0277064.ref032] 32. Li X, Liu Q, Fan N, He Z, Wang H. Hierarchical Spatial-aware Siamese Network for Thermal Infrared Object Tracking. Knowledge-Based Systems. 2019;71–81. doi: 10.1016/j.knosys.2018.12.011 [DOI] [Google Scholar]

[pone.0277064.ref033] 33.Zhu Z, Wang Q, Li B, Wu W, Yan J, Hu W. Distractor-aware siamese networks for visual object tracking. European conference on computer vision. 2018;101–117.

[pone.0277064.ref034] 34.Fan H, Ling H. Siamese cascaded region proposal networks for real-time visual tracking. IEEE conference on computer vision and pattern recognition. 2019;7952–7961.

[pone.0277064.ref035] 35.Zhang Z, Peng H, Fu J, Li B, Hu W. Ocean: Object-aware anchor-free tracking. European Conference on Computer Vision. 2020;771–787.

[pone.0277064.ref036] 36.Chen Z, Zhong B, Li G, Zhang S, Ji R. Siamese box adaptive network for visual tracking. IEEE conference on computer vision and pattern recognition. 2020;6668–6677.

[pone.0277064.ref037] 37.Li L, Wang K, Li S, Feng X, Zhang L. Lst-net: Learning a convolutional neural network with a learnable sparse transform. European Conference on Computer Vision. 2020;562–579.

[pone.0277064.ref038] 38.Huang G, Liu Z, Van Der Maaten L, Weinberger K Q. Densely connected convolutional networks. IEEE conference on computer vision and pattern recognition. 2017;4700–4708.

[pone.0277064.ref039] 39.Yu F, Koltun V. Multi-scale context aggregation by dilated convolutions. arXiv preprint arXiv:1511.07122, 2015.

[pone.0277064.ref040] 40. Chen L, Papandreou G, Kokkinos I, Murphy K, Yuille AL. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE transactions on pattern analysis and machine intelligence. 2017;834–848. [DOI] [PubMed] [Google Scholar]

[pone.0277064.ref041] 41.Dai J, Qi H, Xiong Y, Li Y, Zhang G, Hu H, Wei Y. Deformable convolutional networks. IEEE international conference on computer vision. 2017;764–773.

[pone.0277064.ref042] 42.Zhu X, Hu H, Lin Stephe, Dai J. Deformable convnets v2: More deformable, better results. IEEE conference on computer vision and pattern recognition. 2019;9308–9316.

[pone.0277064.ref043] 43.Woo S, Park J, Lee J, Kweon I. Cbam: Convolutional block attention module. European conference on computer vision. 2018;3–19.

[pone.0277064.ref044] 44. Hua X, Wang X, Rui T, Shao F, Wang D. Light-weight UAV object tracking network based on strategy gradient and attention mechanism. Knowledge-Based Systems. 2021;1–8. [Google Scholar]

[pone.0277064.ref045] 45.Chen X, Yan B, Zhu J, Wang D, Yang X, Lu H. Transformer tracking. IEEE Conference on Computer Vision and Pattern Recognition. 2021;8126–8135.

[pone.0277064.ref046] 46.Wang N, Zhou W, Wang J, Li H. Transformer meets tracker: Exploiting temporal context for robust visual tracking. IEEE Conference on Computer Vision and Pattern Recognition. 2021;1571–1580.

[pone.0277064.ref047] 47.Danelljan M, Bhat G, Khan F, Felsberg M. Atom: Accurate tracking by overlap maximization. IEEE Conference on Computer Vision and Pattern Recognition. 2019; 4660–4669.

[pone.0277064.ref048] 48.Wang G, Luo C, Xiong Z, Zeng W. Spm-tracker: Series-parallel matching for real-time visual object tracking. IEEE conference on computer vision and pattern recognition. 2019;3643–3652.

[pone.0277064.ref049] 49.Dong X, Shen J, Shao L, Porikli F. CLNet: A compact latent network for fast adjusting Siamese trackers. European Conference on Computer Vision. 2020;378–395.

[pone.0277064.ref050] 50.Danelljan M, Bhat G, Shahbaz K F, Felsberg M. Eco: Efficient convolution operators for tracking. IEEE conference on computer vision and pattern recognition. 2017;6638–6646.

[pone.0277064.ref051] 51.Danelljan M, Robinson A, Shahbaz K F, Felsberg M. Beyond correlation filters: Learning continuous convolution operators for visual tracking. European conference on computer vision. 2016;472–488.

[pone.0277064.ref052] 52.Sauer A, Aljalbout E, Haddadin S. Tracking Holistic Object Representations. arXiv preprint arXiv:1907.12920, 2019.

[pone.0277064.ref053] 53.Bhat G, Danelljan M, Gool L V, Timofte R. Learning discriminative model prediction for tracking. IEEE international conference on computer vision. 2019;6182–6191.

[pone.0277064.ref054] 54.Zhang Yunhua, Wang L, Qi J, Wang D, Feng M, Lu H. Structured siamese network for real-time visual tracking. European conference on computer vision. 2018;351–366.

[pone.0277064.ref055] 55.Song Y, Ma C, Wu X, Gong L, Bao L, Zuo W, et al. Vital: Visual tracking via adversarial learning. IEEE conference on computer vision and pattern recognition. 2018;8990–8999.

[pone.0277064.ref056] 56.Yan B, Zhao H, Wang D, Lu Hu, Yang X. Skimming-Perusal Tracking: A Framework for Real-Time and Robust Long-Term Tracking. IEEE International Conference on Computer Vision. 2019;2385–2393.

[pone.0277064.ref057] 57.Wang Q, Zhang L, Bertinetto L, Hu W, Torr P HS. Fast online object tracking and segmentation: A unifying approach. IEEE conference on Computer Vision and Pattern Recognition. 2019;1328–1338.

PERMALINK

SGAT: Shuffle and graph attention based Siamese networks for visual tracking

Jun Wang

Limin Zhang

Wenshuang Zhang

Yuanyun Wang

Chengzhi Deng

Roles

Abstract

1 Introduction

2 Related works

2.1 Siamese-based visual tracking

2.2 Attention mechanism

3 Method

Fig 1. Architecture of the proposed Shuffle and Graph Attention Tracker(SGAT).

3.1 Siamese-based object tracking

3.2 Shuffle attention module

Fig 2. The details of the channle and spatial blocks.

3.3 Similarity measuring

4 Experiments results

4.1 Implementation details

4.2 Ablation study

Table 1. Ablation experiments on OTB-100 benchmark.

4.3 Comparision with state-of-the-art trackers

Fig 3. Comparison with state-of-the-art trackers on OTB-100 in terms of success plots.

Fig 4. Success rate vs. tracking speed on OTB-100.

Table 2. Comparison with state-of-the-art trackers on GOT-10k benchmark.

Fig 5. Comparison with other state-of-the-art trackers on the GOT-10k benchmark, the proposed tracker SGAT achieve the best tracking performance.

Fig 6. Comparision with state-of-the-art trackers on LaSOT in terms of the precision rate, success rate and normalized precision plots.

Fig 7. Comparision with different trackers on each attribute of LaSOT.

Table 3. A performance comparison with other competitive methods on the test split of LaSOT, where Suc., Pre. and Norm.Pre. represent the success rate, precision, normalized precision, respectively.

Table 4. Comparison with other competitive methods on the test split of UAV123 in terms of success rate and precision rates.

Fig 8. Qualitative results on four challenging sequences with other state-of-the-art trackers.

Fig 9. Two cases of failure, in which the red box mean ground truth and the green box mean SGAT tracker.

5 Conclusion

Acknowledgments

Data Availability

Funding Statement

References

Decision Letter 0

Sathishkumar V E

Roles

Author response to Decision Letter 0

Decision Letter 1

Sathishkumar V E

Roles

Author response to Decision Letter 1

Decision Letter 2

Sathishkumar V E

Roles

Acceptance letter

Sathishkumar V E

Roles

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases