Texture and semantic integrated small objects detection in foggy scenes

Zhengyun Fang; Hongbin Wang; Shilin Li; Yi Hu; Xingbo Han

doi:10.1371/journal.pone.0270356

. 2022 Aug 18;17(8):e0270356. doi: 10.1371/journal.pone.0270356

Texture and semantic integrated small objects detection in foggy scenes

Zhengyun Fang ^1,^#, Hongbin Wang ^2,^3,^*,^#, Shilin Li ^4,^‡, Yi Hu ^2,^3,^‡, Xingbo Han ^2,^3,^‡

Editor: Gulistan Raja⁵

PMCID: PMC9387851 PMID: 35980969

Abstract

In recent years, small objects detection has received extensive attention from scholars for its important value in application. Some effective methods for small objects detection have been proposed. However, the data collected in real scenes are often foggy images, so the models trained with these methods are difficult to extract discriminative object features from such images. In addition, the existing small objects detection algorithms ignore the texture information and high-level semantic information of tiny objects, which limits the improvement of detection performance. Aiming at the above problems, this paper proposes a texture and semantic integrated small objects detection in foggy scenes. The algorithm focuses on extracting discriminative features unaffected by the environment, and obtaining texture information and high-level semantic information of small objects. Specifically, considering the adverse impact of foggy images on recognition performance, a knowledge guidance module is designed, and the discriminative features extracted from clear images by the model are used to guide the network to learn foggy images. Second, the features of high-resolution images and low-resolution images are extracted, and the adversarial learning method is adopted to train the model to give the network the ability to obtain the texture information of tiny objects from low-resolution images. Finally, an attention mechanism is constructed between feature maps of the same scale and different scales to further enrich the high-level semantic information of small objects. A large number of experiments have been conducted on data sets such as “Cityscape to Foggy” and “CoCo”. The mean prediction accuracy (mAP) has reached 46.2% on “Cityscape to Fogg”, and 33.3% on “CoCo”, which fully proves the effectiveness and superiority of the proposed method.

Introduction

Object detection is one of the most important tasks in machine vision. It aims to find the object in images by computer and determine its category and position. With the advent of big data technology and deep learning in recent years, great breakthroughs have been made in object detection tasks [1–6]. However, the data collected in real scenes often contains small-sized objects. This kind of object contains very little information which may be lost after multi-layer convolution, so its discriminative features are difficult to extract, resulting in low model detection performance.

The existing small objects detection algorithms can effectively address the above problems. Such methods can be roughly divided into three categories: small objects detection methods based on multi-scale feature extraction [7–11], high-resolution feature-assisted small objects detection methods [12–16] and small objects detection methods guided by content information [17–21]. These methods focus on how to effectively improve the representational capability of the model to extract the discriminative features of small objects. Nevertheless, due to the influence of weather conditions, the data collected in real world are often foggy images. The quality of such images are often impared, which will cause the model’s characterization ability to decline during small objects detection. This makes it difficult for the above methods to extract the discriminative object features. In addition, the existing small objects detection algorithms ignore the texture information and high-level semantic information of small objects, which partly limits the detection performance. As shown in Fig 1 below, the street view images collected in a foggy scene are low in sharpness and contrast, so the detection model trained directly on such images does not have the ability to extract the discriminant features of the object. It can also be seen from Fig 1 that it is difficult to obtain the texture information of the road indication sign in the low-resolution foggy images. Moreover, the road signs account for a small proportion in the image, which causes the model to lose the high-level semantic information of small objects during the continuous convolution.

Based on the above discussion, It is difficult to extract discriminative object features from the model trained by the existing methods. In addition, the existing small objects detection algorithms ignore the texture information and high-level semantic information of small objects, which limits the improvement of detection performance. So we proposed a knowledge-guided and information-rich model for small objects detection in foggy scenes. The model consists of three modules: knowledge guidance, texture information acquisition, and semantic information enrichment. Among them, the knowledge guidance module focuses on alleviating the adverse effects of foggy environment on detection performance; the texture information acquisition module is responsible for enriching the texture information of small objects in the feature map; the semantic information enrichment module obtains the high-level semantic information of small objects by constructing the attention mechanism between multi-scale feature maps.

To sum up, the contribution of this paper and the advantages of the proposed method are as follows:

In view of the difficulties caused by foggy scenes to the detection task, a knowledge guidance module is designed. It is used to extract the features of the clear image to assist the network to learn the foggy image, so that this model can better extract the discriminative object features.
Our designed a texture information acquisition module, which can enable the network to acquire the texture information of small objects from low-resolution images, and improve the model’s representational ability to a certain extent.
In order to further complete the information of small objects, a semantic information enrichment module is proposed. In this module, a multi-scale feature attention mechanism is constructed to obtain the rich high-level semantic information of small objects. The validity and superiority of the proposed method are fully verified by experiments on two public data sets, “Cityscape to Foggy” and “CoCo”.

Related works

At present, CNN networks are generally used as feature extractors in object detection networks. In order to increase the receptive field, the CNN network will continuously shrink the feature map, which makes the feature information extracted from small objects originally containing very little information less or disappear. It is a huge challenge for detecting small objects in images. To address this problem, some researchers have proposed small objects detection based on multi-scale feature extraction, small objects detection assisted by high-resolution features, and small objects detection guided by content information.

Small objects detection based on multi-scale feature extraction

Most of the existing object detection approaches are for general objects detection. Due to the loss of location information, the detection effect of this kind of method in small objects task has poor performance. In order to obtain the semantic information and position information of the object at the same time, small objects detection methods based on multi-scale feature extraction are proposed. This kind of methods extract features of different-scale images and fuses them to obtain rich object information. For example, Lin et al. [7] designed a top-down feature pyramid network, and merged the high-level semantic information and low-level location information obtained by the network, which significantly improves the performance of small objects detection. Given that deep-level features are difficult to retain the spatial and semantic information of small objects, Liu et al. [8] proposed an image pyramid guidance network and embedded it in each stage of the backbone network, greatly alleviating the information imbalance. In addition to the imbalance between semantic information and spatial information, the imbalance in training process is also a major contributing factor to the low performance of object detection. Thus, Pang et al. [9] proposed IoU sampling, balanced feature pyramid, and balanced L1 loss methods to respectively solve the sample imbalance, feature imbalance and objective function imbalance during the object detection and training.

Some researchers explored how to improve model efficiency while ensuring accuracy. For example, Tan et al. [10] proposed EfficientDet for object detection task. They designed a weighted bi-directional feature pyramid network to integrate multi-scale features more efficiently and quickly, and proposed a hybrid scaling method to uniformly scale the width, depth and resolution of the network. Yet recently, it has been found that the feature pyramid networks can cause serious aliasing effects in the process of fusing feature maps. To solve this problem, Luo et al. [11] proposed an object detection method with enhanced channel information, which introduces a channel attention module to eliminate aliasing. In addition, they also proposed a subpixel jump fusion method, which effectively reduces the information loss in the process of channel reduction. This method only performs channel attention operations on each feature layer separately, and cannot address the problem of high-level semantic information loss of small objects. Considering that small objects have insufficient information in the original image, some researchers applied super-resolution to input image processing, and proposed a small objects detection method assisted by high-resolution features.

Small objects detection assisted by high-resolution features

High-resolution features often contain rich texture information, which can play a positive role in improving the performance of small objects detection tasks. Therefore, some scholars put forward small objects detection methods that extract high-resolution features, and learn low-resolution images through its auxiliary model. Among them, the generative adversarial method can effectively align the features of different levels [12, 13]. Specifically, Li et al. [12] built a generator to map the feature of a small objects into a high-resolution feature. To ensure the accuracy of the mapping, a discriminator was designed to distinguish between high-resolution and low-resolution features. In this way, the network has the ability to extract high-resolution features from low-resolution images. Considering the category scores and position confidence involved in the object detection task, Bai et al. [13] introduced the idea of multi-task discrimination to design a multi-task discriminator. This discriminator can determine the categories and coordinates of the mapping features while distinguishing high and low resolutions. To further improve the super-resolution quality of small objects, Noh et al. [14] used high-resolution object features as a supervisory signal, and designed a generator to generate fine-grained high-resolution features.

For remote sensing images, Ji et al. [15] proposed an end-to-end method to integrate the generation and detection tasks to achieve vehicle detection. The generating networks are often prone to model collapse. To overcome this problem, Shermeyer et al. [16] proposed to use deep convolutional neural network (VDSR) or random forest super-resolution framework to perform super resolution processing on images of different levels and then train the object detection model. However, this approach is time-consuming as it needs to pre-train the super resolution model. The above problems can be addressed by an end-to-end training method. Such a method directly uses the high-resolution features as the supervision information and gives the model the ability to extract high-resolution image features from low-resolution ones by adversarial means.

Small objects detection guided by content information

Content information guidance is a very popular research topic in object detection tasks in recent years. Excellent content information guided small objects detection algorithms have been proposed one after another. For example, Chen et al. [17] proposed two kinds of information at the image level and the relationship level between the objects, and designed a spatial memory network to store this information, and used this information to assist the model for detection in the iterative process. Early objects detection algorithms failed to consider the local and global information of the image at the same time. In response to this problem, Zhu et al. [18] developed the global structure-local area network to extract the global features and local features of the image simultaneously for detection. In this process, the local features are obtained by the RoI pooling of position sensitivity. Once the model can learn the relationship between the objects, the detection effect can be enhanced to a certain extent. Therefore, Hu et al. [19] constructed an object relationship module which introduces the attention mechanism to describe the relationship between different objects.

Compared with general detection tasks, face detection is more challenging. This is mainly because of the small size, blurred quality and partial occlusion of the face in the image. To capture the contextual information in the image, Tang et al. [20] designed a new context anchor to extract high-level context features, and used a low-level feature pyramid network to integrate low-level information with them. To alleviate the mismatch between feature map resolution and receptive field size, Cao et al. [21] proposed a context information extraction module and an attention guidance module. The former is responsible for exploring and acquiring a large amount of contextual information, while the latter is to alleviate redundant context relationships and adaptively capture the dependencies between objects. Nevertheless, these methods are mostly designed for general object detection, and their detection performance on small objects is unsatisfactory. In addition, images acquired in real world are often impacted by weather conditions. For example, foggy images will result in poor image quality, thereby affecting the representational ability of the model.

In order to solve the above problems, this paper proposes a texture and semantic integrated small objects detection in foggy scenes. The model contains three modules: knowledge guidance, texture information acquisition, and semantic information enrichment. The three modules equip the model for small objects detection in foggy scenes.

A texture and semantic integrated small objects detection model in foggy scenes

Overview

The texture and semantic integrated small objects detection algorithm proposed in this paper consists of three modules. They are knowledge guidance, texture information acquisition, and semantic information enrichment, as shown in Fig 2. In the knowledge guidance module, the discriminative features extracted by the network from the clear image are used to guide the model to learn the foggy image. Different from low-resolution images, high-resolution ones often contain texture information of small objects. If the network can also extract features containing texture information from low-resolution images, it can improve the model’s representational ability to a certain extent. To this end, a texture information acquisition module is designed. In this module, the features of high-resolution and low-resolution images are extracted separately. Next, adversarial learning is conducted between the two to align the features of low-resolution images with high-resolution features, so that the network can obtain texture information of small objects from low-resolution images.

In order to make the model trained on a clear image have a good effect in detecting foggy images, a knowledge guidance module is added to enable the feature extractor E to extract discriminative object features. In order to enrich the small objects information on the feature map, the texture information acquisition module is added, and the training feature extractor E can extract the feature map with rich texture information; for the three different scales feature map, because the small-scale and mesoscale features pass through more layers of convolution network, resulting in the loss of small objects semantic information. So the semantic information enrichment module is added to obtain the high-level semantic information of small objects. In this way, although the shallow features are convoluted continuously, the rich semantic information of small objects can still be retained. These three models will be introduced in detail below.

Knowledge guidance module

In practice, many images are collected in foggy scenes. Such images have low contrast and sharpness, so it is difficult for the network to extract discriminative features of the object. To overcome this problem, a knowledge guidance module is proposed. In this module, the discriminative features extracted by the network from the clear image are used to guide the model to learn the foggy image. As shown in Fig 3, it is assumed that the low-resolution image in a foggy scene is $X = {x_{n}}_{n = 1}^{N}$ and the clear low-resolution image is $X_{c} = {x_{n}^{c}}_{n = 1}^{N}$ , where N is the number of images. First, the feature extractor E is applied to extract the features F and F^c of foggy and clear low-resolution images. After that, F and F^c are fed into the Yolo layer to generate features of three scales, respectively denoted by F₁, F₂, F₃ and $F_{1}^{c}$ , $F_{2}^{c}$ , $F_{3}^{c}$ . The dimensions of each scale feature map are respectively $C_{1}^{*} H_{1}^{*} W_{1}$ , $C_{2}^{*} H_{2}^{*} W_{2}$ and $C_{3}^{*} H_{3}^{*} W_{3}$ . It should be pointed out that H_k = W_k(k = 1, 2, 3). Subsequently, B candidate boxes are selected from each pixel of the feature map. For low-resolution images in a foggy scene, objects in the candidate frame are classified to ensure the discriminability of features. The category loss and category confidence loss are as follows:

\begin{matrix} L_{c l s} (E) = - \sum_{i = 0}^{S^{2}} \sum_{j = 0}^{B} I_{i, j}^{o b j} \sum_{c \in c l a s s} p_{i} (c) l o g (\hat{p_{i}} (c)) \end{matrix}

(1)

\begin{matrix} L_{c o n f} (E) = \sum_{i = 0}^{S^{2}} \sum_{j = 0}^{B} I_{i, j}^{o b j} {(c_{i} - \hat{c_{i}})}^{2} + \sum_{i = 0}^{S^{2}} \sum_{j = 0}^{B} I_{i, j}^{n o o b j} {(c_{i} - \hat{c_{i}})}^{2} \end{matrix}

(2)

Here, S is the dimension of the feature map, i.e., H₁,H₂ and H₃. $I_{i, j}^{o b j}$ is the indicator function, which means that the object is in the candidate box j at the position i on the feature map. p_i(c) is the real category label, and $\hat{p_{i}} (c)$ is the predicted result. c_i and $\hat{c_{i}}$ are the confidence of the real and predicted categories, respectively.

In addition to the above two losses, the detection task also needs to determine the coordinates of the object in the picture. The corresponding loss is:

\begin{matrix} L_{c o o r d} (E) = \sum_{i = 0}^{S^{2}} \sum_{j = 0}^{B} I_{i, j}^{o b j} (2 - w_{i} \times h_{i}) [{(x_{i} - \hat{x_{i}})}^{2} + {(y_{i} - \hat{y_{i}})}^{2} + {(h_{i} - \hat{h_{i}})}^{2} + {(w_{i} - \hat{w_{i}})}^{2}] \end{matrix}

(3)

Where x_i,y_i,w_i and h_i respectively represent the center point position and scale in the real box, $\hat{x_{i}}$ , $\hat{y_{i}}$ , $\hat{w_{i}}$ and $\hat{h_{i}}$ are the the center point position and scale in the prediction box.

Therefore, for foggy low-resolution images, the detection loss is:

\begin{matrix} L_{f} (E) = \sum_{F_{1}, F_{2}, F_{3}} L_{c l s} + L_{conf} + L_{coord} \end{matrix}

(4)

Similarly, to ensure the discriminability of clear image features, loss constraint is performed on $F_{1}^{c}$ , $F_{2}^{c}$ and $F_{3}^{c}$ . There is:

\begin{matrix} L_{c} (E) = \sum_{F_{1}^{c}, F_{2}^{c}, F_{3}^{c}} L_{c l s} + L_{conf} + L_{coord} \end{matrix}

(5)

The features extracted by the network from clear images tend to be more discriminative. In order to give the model the ability to extract discriminative features from foggy images, a knowledge guidance mechanism is designed. In this mechanism, the strategy of knowledge distillation is adopted to promote the self-learning of the feature encoder E. There is:

\begin{matrix} L_{g} (E) = K L (F, F^{c}) \end{matrix}

(6)

Where KL is the KL divergence, which can be expressed as:

\begin{matrix} K L (F, F^{c}) = \sum_{r = 1}^{R} [p (F_{r}) log p (F_{r}) - p (F_{r}) log p (F_{r}^{c})] \end{matrix}

(7)

p(•) means to convert the feature map to a vector and use softmax function to process it, and R is the number of pixels in the feature map.

In order to make feature extractor E extracts deep features have stronger position representation; we designed a semantic position information fusion attention module, as shown in Fig 3. In this module, the third layer feature F_L extracted by the feature extractor is convoluted by 3 * 3 to generate the feature $F_{L}^{'}$ , which is consistent with the size of the ninth layer feature. Because the third layer features pass through few convolution layers and have rich location information, the spatial attention operation is carried out on $F_{L}^{'}$ to generate image feature $F_{L A}^{'}$ , so that the network can pay attention to the interest region in the third layer image feature. Because the ninth layer features undergo multi-layer convolution, the semantic information of the feature map is very rich. Channel attention operation is performed on F_H to generate the feature map F_HA, so that the network can pay attention to the interest semantic information. In order to enrich the position representation of F_HA, $F_{L A}^{'}$ and F_HA are spliced and fused, and then through 1 * 1 convolution, F_HA obtains the position information in $F_{L A}^{'}$ . The Eq (8) is as follows:

\begin{matrix} F_{f} = C O N V (C a t (F_{L}^{'} ⊙ F_{L A}^{'}, F_{H} ⊙ F_{H A})) \end{matrix}

(8)

Where, ⊙ represents element level multiplication, Cat represents splicing in channel dimension, CONV represents 1 * 1 convolution. Through this model processing, the feature extractor E can extract deep features with rich location information from the foggy image.

Texture information acquisition module

As discussed earlier, high-resolution images often contain the texture information of the object. If the network can also obtain such information in low-resolution images, the model’s characterization ability can be further improved. To achieve this goal, a texture information acquisition module is designed. This module contains two feature extractors E and E_h, and one discriminator D, as shown in Fig 4. Among them, the function of E is consistent with that in the knowledge guidance module, and it is used to extract the features of low-resolution foggy images; E_h is used to extract the features of high-resolution images in foggy scenes; the discriminator judges whether the extracted features are from high-resolution images or low-resolution images. For example, given a high-resolution image $X_{h} = {x_{i}^{h}}_{i = 1}^{N}$ , the feature extractor E_h is applied to extract the feature map F^h, and then adversarial learning is adopted to constrain the feature extractor E and the discriminator D.

\begin{matrix} L_{w} (E, D) = E_{x_{h} - p_{data (x_{ℏ})}} [log D (F^{h})] + E_{x - p_{data(x)}} [log (1 - D (E (x)))] \end{matrix}

(9)

\begin{matrix} L_{w e} (E, D) = E_{x_{h} \sim p_{data (X_{h})}} [log (1 - D (F^{h}))] + E_{x \sim p_{data (x)}} [log (D (E (x))] \end{matrix}

(10)

Here,E(x) is the feature extracted by feature extractor E from image x.

After the above operations, the feature extractor now has the ability to extract high-resolution features, enriching the texture information of the small objects.

Semantic information enrichment module

Feature maps of different scales contain different types of object information, and continuous convolution operations will cause the loss of high-level semantic information of small objects. To overcome this problem, this paper designs a semantic information enrichment module to obtain the high-level semantic information of small objects. As shown in Fig 5, the feature extractor E extracts the features F₁, F₂ and F₃ of low-resolution foggy images. In order to retain the key information of the object in the feature map, a self-attention mechanism is constructed on the feature map of each scale. Specifically, the feature maps of each scale are processed by resizing to change their dimensions to C₁*(H₁ W₁),C₂*(H₂ W₂) and C₃*(H₃ W₃) Subsequently, the feature maps are processed by transposition, multiplication, and normalization to get the attention matrices:

\begin{matrix} A_{1} = softmax (F_{1}^{T} F_{1}) \end{matrix}

(11)

\begin{matrix} A_{2} = softmax (F_{22}^{' T} F_{22}^{'}) \end{matrix}

(12)

Next, by transposing and multiplying the feature map of each scale with the corresponding attention matrix, we get:

\begin{matrix} F_{2}^{'} = F_{2} A_{1}^{T} \end{matrix}

(13)

\begin{matrix} F_{3}^{'} = F_{3} A_{2}^{T} \end{matrix}

(14)

$F_{1}^{'}$ , $F_{2}^{'}$ and $F_{3}^{'}$ are the feature maps after self-attention operation, and such feature maps contain the key information of the object on feature maps of each scale.

Unlike $F_{1}^{'}$ , the scales of $F_{2}^{'}$ and $F_{3}^{'}$ are small owing to continuous convolution operations, so the high-level semantic information of small objects is often lost. To retain this information, a feature fusion mechanism is constructed on feature maps of different scales. To ensure that $F_{2}^{'}$ contains the information of small objects in $F_{1}^{'}$ , the features of the two scales are fused to get:

\begin{matrix} A_{12} = F_{1} \oplus F_{2}^{'} \end{matrix}

(15)

Where A₁₂ is the feature after fusion of F₁ and $F_{2}^{'}$ , and its dimension is C₁*H₁*W₁. This operation introduces the key object information in $F_{1}^{'}$ to $F_{2}^{'}$ . On this basis, the fused features are subjected to up-sampling and convolution to get:

\begin{matrix} F_{22}^{'} = Conv (Upsample (A_{12})) \end{matrix}

(16)

Where the up-sampling multiple is 2, and conv is 1*1 convolution. By this way, the small objects information of feature map $F_{1}^{'}$ is introduced into $F_{22}^{'}$ . It should be pointed out that $F_{22}^{'}$ and F₂ have the same dimension of C₂*H₂*W₂. Similarly, to make $F_{3}^{'}$ contain the information in $F_{22}^{'}$ , the above operations are performed on $F_{22}^{'}$ . The first is to calculate the self-attention map of $F_{22}^{'}$ and then to obtain the channel attention matrix $A_{23} = F_{22}^{'} \oplus F_{3}^{'}$ . Subsequently, the fused feature A₂₃ are subjected to up-sampling and convolution to get $F_{33}^{'} = Conv (Upsample (A_{23}))$ . This feature map has the same dimension of $F_{3}^{'}$ .

After the above operations, the deep feature maps $F_{22}^{'}$ and $F_{33}^{'}$ can contain the small objects information in the feature map $F_{1}^{'}$ . Then category loss, confidence loss and coordinate position loss are performed on the obtained feature maps $F_{1}^{'}$ , $F_{22}^{'}$ and $F_{33}^{'}$ to get:

\begin{matrix} L_{s} (E) = \sum_{F_{1}^{'}, F_{22}, F_{33}^{'}} L_{c l s} + L_{conf} + L_{coord} \end{matrix}

(17)

Total loss

In summary, the total loss of the small objects detection algorithm proposed in this paper is:

\begin{matrix} L_{total} (E, E_{c}, E_{h}, D) = L_{f} (E) + L_{c} (E_{c}) + λ_{1} L_{g} (E, E_{c}) + λ_{2} L_{w} (E, D) + L_{w e} (E, D) \end{matrix}

(18)

Where λ₁ and λ₂ are hyper-parameters, which are used to weigh the proportions of the loss items L_g(E, E_c) and L_w(E, D) in the total loss.

Algorithm 1 the proposed algorithm.

Definition: two feature encoders E_h and E; one discriminator D

Input:

Labeled high-resolution clear image $X_{h} = {x_{i}^{h}}_{i = 1}^{N}$ , labeled low-resolution clear image $X_{c} = {x_{n}^{c}}_{n = 1}^{N}$ , label $P = {p_{n}^{c}}_{n = 1}^{N_{c}}$ , and unlabeled low-resolution foggy image $X = {x_{n}}_{n = 1}^{N}$

Output:

optimal model E*

Image preprocessing:

Scale the input image of any size to 608*608*3

Optimize:

for each k ∈ [1, n] do:

(a) Input image $X_{h} = {x_{i}^{h}}_{i = 1}^{N}$ into E_h feature $F_{i}^{h}$ and input images $X = {x_{n}}_{n = 1}^{N}$ and $X_{c} = {x_{n}^{c}}_{n = 1}^{N}$ into E to get features F_i and $F_{i}^{c}$ ;

(b) Input features $F_{i}^{h}$ and $F_{i}^{c}$ into discriminator D and then train E_h and D with formulas (9) and (10);

(d) Input feature F_i into Yolo layer and train E with formula (4);

end for

Experiments

Data set and evaluation indicators

In order to verify the effectiveness and superiority of the proposed algorithm, a large number of experiments were carried out on two available large-scale data sets, i.e. “Cityscape to Foggy” [22] and “CoCo” [23], and they were compared with some of the most advanced methods. “Cityscape to Foggy” is mainly used for semantic segmentation tasks. Cityscape dataset had 3457 images, including 2965 images in the training set and 492 in the test set; FoggyCityscape dataset had 3457 images, of which 2965 were in the training set and 492 in the test set. Therefore, eight categories of data-“Bus”, “Bicycle”, “Car”, “Motorcycle”, “Person”, “Rider”, “Train” and “Truck” were singled out from the “Cityscape to Foggy” dataset, and their labels were converted into formats suitable for object detection tasks. By contrast, the “CoCo” data set had 123287 fog-free images, of which 118287 in the training set and 5000 were in the test set. To verify that the algorithm proposed in this paper can effectively detect small objects in foggy scenes, the foggy algorithm [24] was applied to fog the COCO data set, and 123287 foggy images (118287 in the training set and 5000 in the test set) were obtained. All experiments in this paper used average prediction accuracy (mAP) as the evaluation index, and the formula is shown in Eq (19).

\begin{matrix} m A P = \frac{1}{C} \sum_{c} \sum_{k = 1}^{N} max_{\tilde{k} \geq k} P (\tilde{k}) Δ r (k) \end{matrix}

(19)

Where C is the total number of categories, N the sample size in one category, ${max}_{\tilde{k} \geq k} P (\tilde{k})$ is the accuracy of the maximum confidence greater than the confidence threshold, Δr(k) is the change in the recall rate when the sample size changes from k − 1 to k. On the coco data set, the following evaluation indicators are derived based on mAP: mAP₅₀ represents the average prediction accuracy of 0.5 for the intersection ratio of the target’s prediction frame and its real frame; mAP₇₅ represents the average prediction accuracy of 0.75 for the intersection ratio of the target’s prediction frame and its real frame; mAP_S means the average prediction accuracy of the target area less than 32*32 pixels; mAP_M means the average prediction accuracy of the target area greater than 32*32 but less than 96*96 pixels; mAP_L means the average prediction accuracy of the target area larger than 96*96 pixels.

Experimental details

The experiment was performed on a GTX 3090Ti graphics card. The feature encoders in this paper all used the Darknet-53 network as the feature extractors. E,E_c and E_h did not share parameters with each other. The discriminator D consists of three convolution-pooling-normalization blocks and one fully connected layer. During the experiment, the images were preprocessed by random flip, random fill, and random crop, and the size of the image was uniformly scaled to 608608. The model was trained for 100 generations, the batch size was 2, and the initial learning rate was 0.007. The learning rate adjustment algorithm adopted the cosine annealing algorithm.

In the test, only the feature encoder E was used to extract the features of low-resolution images in foggy scenes for prediction.

Method comparison

In this section, the proposed algorithm is compared with some of the most advanced methods [25–32] on “Cityscape to Foggy” and “CoCo” data sets to show its advantages. On “Cityscape to Foggy” the detection performance of eight categories of “Bus”, “Bicycle”, “Car”, “Motorcycle“, “Person”, “Rider”, “Train” and “Truck” are compared. The results are listed in Table 1.

Table 1. Experimental results of different methods on “Cityscape to Foggy” data set.

Methods	mAP
CPM-R-CNN [25]	25.8
Libra R-CNN [26]	31.9
PPA [27]	34.0
Vfnet [28]	31.2
Paa [29]	26.5
Gfl [30]	30.7
Sabl [31]	23.8
CST-DA [32]	32.1
Ours	46.2

Open in a new tab

It can be seen from Table 1 that compared with other object detection algorithms, the algorithm proposed in this paper performs the best under foggy conditions. The proposed algorithm is also compared with the state-of-the-art methods on the foggy “CoCo” data set, as shown in Table 2. It can be seen that the algorithm proposed has the best performance.

Table 2. Experimental results of different methods on the foggy “CoCo” data set.

Methods	mAP	mAP ₅₀	mAP ₇₅	mAP _S	mAP _M	mAP _L
Libra R-CNN [26]	30.5	38.6	32.0	14.0	35.7	42.1
PPA [27]	32.7	41.9	36.2	14.0	35.8	48.2
Vfnet [28]	32.9	48.3	35.2	14.3	36.0	48.0
Paa [29]	32.6	48.1	34.7	11.9	35.4	51.1
Gfl [30]	28.3	43.3	29.9	10.6	30.4	44.1
Sabl [31]	29.8	45.7	31.4	10.5	32.6	46.3
Ours	33.3	49.9	35.9	14.4	38.2	47.7

Open in a new tab

Ablation experiment

In order to prove the effectiveness of each module in the proposed algorithm, a series of ablation experiments were carried out. In this process, the model with all three modules removed was used as the benchmark, that is, only the detection loss constraint model for low-resolution images in foggy scenes was used. As shown in Table 3, the model using only benchmark training achieved an average prediction accuracy of 39.4% on “Cityscape to Foggy”.

Table 3. Ablation experimental results of the model on the “Cityscape to Foggy” dataset.

-	Cityscape to Foggy
Methods	mAP
Benchmark	39.4
Benchmark + knowledge guidance	44.1
Benchmark + semantic information enrichment	39.6
Benchmark + texture information acquisition	40.9
Benchmark + semantic information enrichment + texture information acquisition	42.8
Benchmark + knowledge guidance + texture information acquisition	44.7
Benchmark + knowledge guidance + semantic information enrichment	44.3
Benchmark + knowledge guidance + texture information acquisition + semantic information enrichment(ours)	46.2

Open in a new tab

Knowledge guidance module: It can be seen from Table 3 that when the knowledge guidance module is added, the performance is greatly improved compared with the benchmark, by a mean predication accuracy of 4.7% on “Cityscape to Foggy”. It indicates that the proposed knowledge guidance module can effectively alleviate the adverse effects of foggy environment on recognition performance.

Texture information acquisition module: When the texture information acquisition module is added to the model, the detection performance is further improved. The mean prediction accuracy on the “Cityscape to Foggy” dataset increases from 39.4% to 40.9%. This is mainly because the proposed texture information acquisition module enriches the texture information in the features and improves the representation ability of the model.

Semantic information enrichment module: The loss of high-level semantic information is partly to blame for the limited detection performance. Thus, when the semantic information enrichment module is added to the model, the detection performance is improved. The mean prediction accuracy on “Cityscape to Foggy” dataset increases from 39.4% to 39.6%. It is therefore can be concluded that the proposed semantic information enrichment module plays a positive role in enhancing detection performance.

It can be seen from Table 3 that after considering the three modules Knowledge guidance module, Texture information acquisition module and Semantic information enrichment module at the same time, the model achieves a better performance. The mean prediction accuracy on the “Cityscape to Foggy” dataset reached 46.2%.

Parameter analysis

In the algorithm proposed in this paper, λ₁ and λ₂ are hyper-parameters, which are used to weigh the proportions of loss terms L_g(E, E_c) and L_w(E, D) in the total loss respectively. To demonstrate that the parameter value is the optimal, the parameter analysis experiment was carried out, and the experimental results are shown in Fig 6 below. It should be noted that when one parameter is analyzed, the other parameter is fixed.

The impact of λ₁ on model performance: Parameter λ₁ is used to control the relative weight of the loss term L_g(E, E_c) in the total loss. As shown in Fig 6(a), when λ₁ ranges within [0.6,1.0], mAP is on the rise; it reaches the maximum when λ₁ = 1.0 and starts to decline thereafter. This indicates that the optimal value of λ₁ is 1.0.

The impact of λ₂ on model performance: Parameter λ₂ is used to control the relative weight of L_w(E, D). It can be observed from Fig 6(b) that as λ₂ rises from 0.6 to 0.8, the detection performance shows a negative growth, and varies greatly when λ₂ ranges from 0.6 to 1.6. At λ₂ = 1.4, mAP reaches its maximum. So the optimal value of λ₂ is 1.4.

Visual analysis

In order to better prove the effectiveness of this method, this paper uses yolov5 method and this method to carry out visual display on the cityscape to foggy dataset, as shown in Fig 7.

As can be seen from Fig 7, this method can not only well detect targets in foggy scenes, but also detect small targets in foggy scenes. It can be proved that this method not only solves the problem of domain migration between different data sets, but also can effectively extract the information of small targets in shallow features and improve the ability of model to detect small targets.

Conclusions

In this paper, we study the small objects detection task in foggy scene. It is difficult to extract discriminative object features from the model trained by the existing methods. In addition, the existing small objects detection algorithms ignore the texture information and high-level semantic information of small objects, which limits the improvement of detection performance. Aiming at the above problems, this paper proposes a Texture and semantic integrated small objects detection algorithm in foggy scenes. Specifically, the discriminant features extracted by the model from clear images are used to guide the model to learn, which alleviates the negative impact of foggy images on the detection performance. Given that the existing algorithms are difficult to obtain the texture information and high-level semantic information of small objects, the adversarial learning strategy is adopted to give the network the ability to obtain the texture information of small objects from low-resolution images. At the same time, a multi-scale feature map attention mechanism is constructed to further enrich the high-level semantic information of small objects. The effectiveness and superiority of the method proposed in this paper have been fully verified by a series of experiments.

Supporting information

S1 Dataset

(DOCX)

Click here for additional data file.^{(16.8KB, docx)}

Data Availability

A large number of experiments were carried out on two available large-scale data sets, i.e. "Cityscape to Foggy"(Hahner. M, Dai. D, Sakaridis. C, Zaech. J and Gool. L. V. Semantic Understanding of Foggy Scenes with Purely Synthetic Data[C]. IEEE Intelligent Transportation Systems Conference (ITSC), Auckland, New Zealand, 2019:3675-3681.) and "CoCo"(Lin. T, Maire. M, Belongie. S, Hays. J, Perona. P, Ramanan. D, Dollar. P and Zitnick. C. Microsoft COCO: Common Objects in Context[C]. European Conference on Computer Vision (ECCV), Zurich, Switzerland, 2014: 740-755.). The datasets can be downloaded from the below links: Cityscape to Foggy Datasets: https://people.ee.ethz.ch/~csakarid/SFSU_synthetic/ CoCo Datasets https://cocodataset.org/#home.

Funding Statement

The author(s) received no specific funding for this work.

References

1.Chenchen Zhu, Yihui He, Marios Savvides. Feature Selective Anchor-Free Module for Single-Shot Object Detection. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, USA, 2019. p.840–849.
2.Redmon. J, Divvala. S, Girshick. R and Farhadi. A. You Only Look Once: Unified, Real-Time Object Detection. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, USA, 2016. p.779–788.
3.Liu. W, Anguelov. D, Erhan. D, Szegedy. C, Reed S, Fu. C, et al. SSD: Single Shot MultiBox Detector. In: European Conference On Computer Vision (ECCV), Amsterdam, The Netherlands, 2016: 21–37.
4. Ren S, He K, Girshick R and Sun J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(6). p.1137–1149. doi: 10.1109/TPAMI.2016.2577031 [DOI] [PubMed] [Google Scholar]
5.He. K, Gkioxari. G, Dollar. P and Girshick. R. Mask R-CNN.? In: IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 2017. p.2980–2988.
6.Redmon. J and Farhadi. A. YOLO9000: Better, Faster, Stronger. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, USA, 2017. p.6517–6525.
7.Lin. T, Dollar. P, Girshick. R, He. K, Hariharan. B and Belongie. S. Feature Pyramid Networks for Object Detection. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, USA, 2017. p.936–944.
8.Liu. Z, Gao. G, Sun. L and Fang. L. IPG-Net: Image Pyramid Guidance Network for Small Object Detection.? In: IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Seattle, USA, 2020. p.4422–4430.
9.Pang. J, Chen. K, Shi. J, Feng. H, Ouyang. W and Lin. D. IPG-Net: Image Pyramid Guidance Network for Small Object Detection. In: IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Seattle, USA, 2020. p.4422–4430.
10.Tan. M, Pang. R and Le. Q. EfficientDet: Scalable and Efficient Object Detection. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, USA, 2020. p.10778–10787.
11.Luo. Y, Cao. X, Zhang. J, Cao. X, Guo. J, Shen. H, et al. CE-FPN: Enhancing Channel Information for Object Detection. arXiv preprint arXiv:2103.10643,2021.
12.Li. J, Liang. X, Wei. Y, Xu. T, Feng. J and Yan. S. Perceptual Generative Adversarial Networks for Small Object Detection. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, USA, 2017. p.1951–1959.
13.Bai. Y, Zhang. Y, Ding. M and Ghanem. B. SOD-MTGAN: Small Object Detection via Multi-Task Generative Adversarial Network. In: European Conference on Computer Vision (ECCV), Munich, Germany, 2018. p.210–226.
14.Noh. J, Bae. W, Lee. Seo. J and Kim. G. Better to Follow, Follow to Be Better: Towards Precise Supervision of Feature Super-Resolution for Small Object Detection. In: IEEE International Conference on Computer Vision (ICCV), Seoul, Korea, 2019. p.9724–9733.
15. Ji H, Gao Z, Mei T and Ramesh B. Vehicle Detection in Remote Sensing Images Leveraging on Simultaneous Super-Resolution. IEEE Geoscience and Remote Sensing Letters, 2020, 17(4):676–680. doi: 10.1109/LGRS.2019.2930308 [DOI] [Google Scholar]
16.Shermeyer. J and Etten. A. The Effects of Super-Resolution on Object Detection Performance in Satellite Imagery. In: IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Long Beach, USA, 2019. p.1432–1441.
17.Chen. X and Gupta. A. Spatial Memory for Context Reasoning in Object Detection. In: IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 2017. p.4106–4116.
18.Zhu. Y, Zhao. C, Wang. J, Zhao. X, Wu. Y and Lu. H. CoupleNet: Coupling Global Structure with Local Parts for Object Detection. In: IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 2017. p.4146–4154.
19.Hu. H, Gu. J, Zhang. Z, Dai. J and Wei. Y. Relation Networks for Object Detection. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, USA, 2018. p.3588–3597.
20.Tang. X, Du. D, He. Z and Liu. J. PyramidBox: A Context-assisted Single Shot Face Detector. In: European Conference on Computer Vision (ECCV), Munich, Germany, 2018. p.797–813.
21.Cao. J, Chen. Q, Guo. J and Shi. R. Attention-guided Context Feature Pyramid Network for Object Detection. arXiv preprint arXiv:2005.11475, 2020.
22.Hahner. M, Dai. D, Sakaridis. C, Zaech. J and Gool. L. V. Semantic Understanding of Foggy Scenes with Purely Synthetic Data. IEEE Intelligent Transportation Systems Conference (ITSC), Auckland, New Zealand, 2019. p.3675–3681.
23.Lin. T, Maire. M, Belongie. S, Hays. J, Perona. P, Ramanan. D, et al. Microsoft COCO: Common Objects in Context. In: European Conference on Computer Vision (ECCV), Zurich, Switzerland, 2014. p.740–755.
24. He K, Sun J and Tang X. Single Image Haze Removal Using Dark Channel Prior. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2011, 33(12):2341–2353. doi: 10.1109/TPAMI.2010.168 [DOI] [PubMed] [Google Scholar]
25.Bin Zhu, Qing Song, Lu Yang, Zhihui Wang, Chun Liu and Mengjie Hu. R-CNN: Calibrating Point-Guided Misalignment in Object Detection. In: WACV, 2021.
26.Jiangmiao Pang, Kai Chen, Jianping Shi, Huajun Feng, Wanli Ouyang and Dahua Lin. Libra R-CNN: Towards Balanced Learning for Object Detection. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019. p.821–830.
27.Kim, Kang and Lee, Hee Seok. Probabilistic Anchor Assignment with IoU Prediction for Object Detection. In: ECCV 2020. Lecture Notes in Computer Science, 2020, 12370: 355–371.
28.Haoyang Zhang, Ying Wang, Fears Dayoub and Niko Sunderhauf. VarifocalNet: An IoU-aware Dense Object Detector. In: CVPR, 2021.
29.Kang Kim1, Hee Seok Lee. Probabilistic Anchor Assignment with IoU Prediction for Object Detection. In: ECCV.2020.
30.Xiang Li, Wenhai Wang, Lijun Wu, Shuo Chen, Xiaolin Hu, Jun Li, et al. Generalized Focal Loss: Learning Qualified and Distributed Bounding Boxes for Dense Object Detection. In: NeurIPS, 2020.
31.Jiaqi Wang, Wenwei Zhang, Yuhang Cao, Kai Chen, J iangmiao Pang, Tao Gong, et al. Side-Aware Boundary Localization for More Precise Object Detection. In: ECCV 2020. Lecture Notes in Computer Science, 2020, 12349:403–419.
32.Zhao G, Li G, Xu R, et al. Collaborative Training between Region Proposal Localization and Classification for Domain Adaptive Object Detection. In: ECCV 2020. Lecture Notes in Computer Science, 2020, 12363:86–102.

PLoS One. doi: 10.1371/journal.pone.0270356.r001

Decision Letter 0

Gulistan Raja

3 Nov 2021

PONE-D-21-32064Knowledge-guided and information-rich small target detection in foggy scenesPLOS ONE

Dear Dr. Wang,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process. As you will infer from below that there was a disagreement between the reviewers regarding enthusiasm for this work. Reviewer 3 was of the view that manuscript partly describes a technically sound piece of work and recommended reject. On the other hand, reviewer 1 and 2 had made certain observations to improve your work and recommended minor revisions. After considering comments of reviewers, the editors decision is "major revision". Please incorporate all comments raised by reviewers.

Please submit your revised manuscript by Dec 18 2021 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.
A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.
An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols.

We look forward to receiving your revised manuscript.

Kind regards,

Gulistan Raja

Academic Editor

PLOS ONE

Journal Requirements:

When submitting your revision, we need you to address these additional requirements.

1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at

https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and

https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf

2. Please ensure that you refer to Figure 6 in your text as, if accepted, production will need this reference to link the reader to the figure.

3. We note that Figures 1 and 2 in your submission contain [map/satellite] images which may be copyrighted. All PLOS content is published under the Creative Commons Attribution License (CC BY 4.0), which means that the manuscript, images, and Supporting Information files will be freely available online, and any third party is permitted to access, download, copy, distribute, and use these materials in any way, even commercially, with proper attribution. For these reasons, we cannot publish previously copyrighted maps or satellite images created using proprietary data, such as Google software (Google Maps, Street View, and Earth). For more information, see our copyright guidelines: http://journals.plos.org/plosone/s/licenses-and-copyright.

We require you to either (1) present written permission from the copyright holder to publish these figures specifically under the CC BY 4.0 license, or (2) remove the figures from your submission:

a. You may seek permission from the original copyright holder of Figures 1 and 2 to publish the content specifically under the CC BY 4.0 license.

We recommend that you contact the original copyright holder with the Content Permission Form (http://journals.plos.org/plosone/s/file?id=7c09/content-permission-form.pdf) and the following text:

“I request permission for the open-access journal PLOS ONE to publish XXX under the Creative Commons Attribution License (CCAL) CC BY 4.0 (http://creativecommons.org/licenses/by/4.0/). Please be aware that this license allows unrestricted use and distribution, even commercially, by third parties. Please reply and provide explicit written permission to publish XXX under a CC BY license and complete the attached form.”

Please upload the completed Content Permission Form or other proof of granted permissions as an "Other" file with your submission.

In the figure caption of the copyrighted figure, please include the following text: “Reprinted from [ref] under a CC BY license, with permission from [name of publisher], original copyright [original copyright year].”

b. If you are unable to obtain permission from the original copyright holder to publish these figures under the CC BY 4.0 license or if the copyright holder’s requirements are incompatible with the CC BY 4.0 license, please either i) remove the figure or ii) supply a replacement figure that complies with the CC BY 4.0 license. Please check copyright information on all replacement figures and update the figure caption with source information. If applicable, please specify in the figure caption text when a figure is similar but not identical to the original image and is therefore for illustrative purposes only.

The following resources for replacing copyrighted map figures may be helpful:

USGS National Map Viewer (public domain): http://viewer.nationalmap.gov/viewer/

The Gateway to Astronaut Photography of Earth (public domain): http://eol.jsc.nasa.gov/sseop/clickmap/

Maps at the CIA (public domain): https://www.cia.gov/library/publications/the-world-factbook/index.html and https://www.cia.gov/library/publications/cia-maps-publications/index.html

NASA Earth Observatory (public domain): http://earthobservatory.nasa.gov/

Landsat: http://landsat.visibleearth.nasa.gov/

USGS EROS (Earth Resources Observatory and Science (EROS) Center) (public domain): http://eros.usgs.gov/#

Natural Earth (public domain): http://www.naturalearthdata.com/

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: Partly

**********

2. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: N/A

**********

3. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: Yes

**********

4. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: No

**********

5. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: The manuscript titled “Knowledge-guided and information-rich small target detection in foggy scenes” combines texture information and semantic information to detect small target in the foggy images. The work is interesting and the result support the conclusion well. However there are some flaws necessary to be polished to improve the quality of this manuscript. The novelty of the research is not outstandingly presented; the English language should be improved either.

Reviewer #2: In general, this paper is well written, but I have to point out that there are still some grammar and word mistake.

1.The author should polish the language of the paper.

2.It is suggested that the paper should be revised in strict accordance with the required format. For example: the position marked above the author's name and affiliations; each paragraph is indented and so on.

3.The recommended font must meet the requirements for the template.

4.The main motivation and contribution should be clearly highlighted in the introduction section.

5.The conclusion of the article is too simple, it should be described more fully to illustrate the feasibility and advantages of the method in this article.

6.It is recommended that the requirements of references and templates be strictly unified.

7.The naming format of the diagram is checked against the template.

8.As shown in Figure 2, the description of the algorithm should not be in the form of screenshots. It is recommended to replace it with Table.

9.The arrow flow of each diagram is recommended to be described in colloquial language for easy reading.

10.It is recommended to specify the color of the arrow and the meaning of the dotted and solid lines in each diagram.

11.The content and conclusion of the study are too simple, so I suggest to enrich the content of you work.

In order to make this article more fluent and easier to understand, I suggest the authors consider about those above eleven problems.

Reviewer #3: (1) Consider improving and adding a more suitable knowledge distillation model structure for application. The part of the ablation experiment is just the split and combination of each module, and more adequate experiments can be considered.

(2) Two feature extractors E are used to obtain Eh. Is it redundant? Can it be replaced by a general feature extractor, with two different resolutions as input, so that E and D confront each other;

(3) The algorithm flow is not clear enough, the algorithm format is wrong, it should not be a screenshot;

(4) English translation has grammatical errors or has not been investigated, such as target detection .

(5) There is a lack of detailed process description below the paper frame diagram.

**********

6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

Reviewer #3: No

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

Attachment

Submitted filename: Review Report.docx

Click here for additional data file.^{(14KB, docx)}

Attachment

Submitted filename: renamed_39106.pdf

Click here for additional data file.^{(74.4KB, pdf)}

PLoS One. 2022 Aug 18;17(8):e0270356. doi: 10.1371/journal.pone.0270356.r002

Author response to Decision Letter 0

23 Mar 2022

Thank you for allowing a resubmission of our manuscript, with an opportunity to address the reviewers’ comments.

Journal Requirements:

Case1: Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming.

Response: We revised this manuscript accordance with the PLOS ONE's style requirements.

Case2: Please ensure that you refer to Figure 6 in your text as, if accepted, production will need this reference to link the reader to the figure.

Response: We revised Figure 6 according to the algorithm’ style requirements, such as algorithm 1.

Case3: We note that Figures 1 and 2 in your submission contain [map/satellite] images which may be copyrighted.

Response: We revised them. Such as Figures 1 and 2. The pictures in Figure 1 and Figure 2 are taken by us, and there is no copyright problem.

Reviewer #1:

Case1: The manuscript titled “Knowledge-guided and information-rich small target detection in foggy scenes” combines texture information and semantic information to detect small target in the foggy images. The work is interesting and the result support the conclusion well. However there are some flaws necessary to be polished to improve the quality of this manuscript. The novelty of the research is not outstandingly presented; the English language should be improved either.

Response: The quality of this manuscript has been improved, the existing English language problems have been improved, and the novelty of the research is also improved.

Reviewer #2:

Case1: The author should polish the language of the paper.

Response: The existing English language problems have been improved.

Case2: It is suggested that the paper should be revised in strict accordance with the required format. For example: the position marked above the author's name and affiliations; each paragraph is indented and so on.

Response: We revised this manuscript accordance with the required format.

Case3: The recommended font must meet the requirements for the template.

Response: I rearranged the typesetting according to the PLOS ONE's Latex template.

Case4: The main motivation and contribution should be clearly highlighted in the introduction section.

Response: We revised this manuscript in the introduction section.

Case5: The conclusion of the article is too simple, it should be described more fully to illustrate the feasibility and advantages of the method in this article.

Response: We supplemented the conclusion of the article.

Case6: It is recommended that the requirements of references and templates be strictly unified.

Response: We revised it according to the requirements of references and templates.

Case7: The naming format of the diagram is checked against the template.

Response: We revised it.

Case8: As shown in Figure 2, the description of the algorithm should not be in the form of screenshots. It is recommended to replace it with Table.

Response: We revised it according to the algorithm’ style requirements, such as algorithm 1.

Case9: The arrow flow of each diagram is recommended to be described in colloquial language for easy reading.

Response: Thank you for your suggestions. In the revision manuscript, we have added arrow examples to facilitate readers' reading.

Case10: It is recommended to specify the color of the arrow and the meaning of the dotted and solid lines in each diagram.

Response: Thank you for your suggestions. In the revision manuscript, we have added arrows of different colors and the meaning of points.

Case11: The content and conclusion of the study are too simple, so I suggest to enrich the content of you work.

Response: Thank you very much for your suggestions. We added the semantic location information fusion attention module to make the deep features extracted by the feature extractor have stronger location representation, so as to assist the semantic information enhancement module and enrich the location information of large-scale features.

Reviewer #3:

Case1: Consider improving and adding a more suitable knowledge distillation model structure for application. The part of the ablation experiment is just the split and combination of each module, and more adequate experiments can be considered.

Response: Thank you for your suggestions. In the revision manuscript, we modified the model and supplemented the ablation experiments combined with different modules.

Case2: Two feature extractors E are used to obtain Eh. Is it redundant? Can it be replaced by a general feature extractor, with two different resolutions as input, so that E and D confront each other;

Response: Thank you for your question. Our idea is to enable feature extractor E to propose high-resolution features for low-resolution images. If the weight is shared, it is difficult to extract good high-resolution features for high-resolution images, so it is difficult to guide feature extractor E to extract high-resolution features for low-resolution images. Therefore, we set up two feature extractors Eh and E to ensure better quality of extracted high-resolution features.

Case3: The algorithm flow is not clear enough, the algorithm format is wrong, it should not be a screenshot;

Response: We revised it according to the algorithm’ style requirements, such as algorithm 1.

Case4: English translation has grammatical errors or has not been investigated, such as target detection.

Response: We have modified the English translation grammatical errors and the improper English expressions in this paper.

Case5: There is a lack of detailed process description below the paper frame diagram.

Response: Thank you very much for your suggestions. In the revision manuscript, we added a detailed process description under the framework chart of the paper.

Reviewer #4:

Point 1: In the title, the short-phrase ‘Knowledge-guided and information-rich small target detection in foggy scenes 'could be rewritten making it be more understandable. Here knowledge and information are abstracted from the image because of fog or are used in a method/process?

Response: We modified the title as “Texture and semantic integrated small objects detection in foggy scenes”.

Point 2: In the abstract, the conclusion of ‘the existing small target detection algorithms ignore the texture information’ is not very precise. U can correct it with ‘some of the existing small target detection algorithms’.

Response: There is something wrong with the content description. It should be “the existing small target detection algorithms ignore the texture information of tiny objects”. We modified it.

Point 3: The abstract should be rewritten carefully; then I could find how your experiment arranged logically.

Response: We have improved the abstract.

Point 4: Self-attention mechanism is applied in semantic information enhancement, and it is a very important part in this manuscript so I think it should be added in keywords.

Response: “Self-attention mechanism” is a keyword, because the PLOS ONE's template does not reflect the keywords, it is not reflected in this manuscript.

Point 5: In lines 134-136, I recommend to add citations supporting the statement of self-attention. Because it is not the conclusion of the article itself, so it should be expressed clearly.

Response: There is a corresponding introduction in the cited reference [19]( Hu. H, Gu. J, Zhang. Z, Dai. J and Wei. Y. Relation Networks for Object Detection[C]. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, USA, 2018:3588-3597.). Therefore, we do not specifically cite reference.

Point 6: According to the foggy images, have u used any method to defog these images? If u have had some experiments, please show the results in the manuscript. If not, u can defogging these images, and they can be input to ur proposed algorithm. If u don’t think it necessary, please tell me why.

Response: In the revision manuscript, we supplemented the ablation experiments combined with different modules in experiments section.

Attachment

Submitted filename: Response to reviewers.docx

Click here for additional data file.^{(20.9KB, docx)}

PLoS One. doi: 10.1371/journal.pone.0270356.r003

Decision Letter 1

Gulistan Raja

9 May 2022

PONE-D-21-32064R1Texture and semantic integrated small objects detection in foggy scenesPLOS ONE

Dear Dr. Wang,

The revised manuscript had been reviewed by 2 reviewers who reviewed the original version. Reviewer 2 was satisfied with the revisions made by authors and recommended accept. Reviewer 3 was also satisfied with revisions, however he had made some minor suggestions to further improve work and recommended minor revision. After consideration of comments of reviewers, my decision is "minor revision". Please incorporate the comments made by Reviewer 3.

Please submit your revised manuscript by Jun 23 2022 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.
A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.
An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

We look forward to receiving your revised manuscript.

Kind regards,

Gulistan Raja

Academic Editor

PLOS ONE

Journal Requirements:

Please review your reference list to ensure that it is complete and correct. If you have cited papers that have been retracted, please include the rationale for doing so in the manuscript text, or remove these references and replace them with relevant current references. Any changes to the reference list should be mentioned in the rebuttal letter that accompanies your revised manuscript. If you need to cite a retracted article, indicate the article’s retracted status in the References list and also include a citation and full reference for the retraction notice.

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation.

Reviewer #2: (No Response)

Reviewer #3: All comments have been addressed

**********

2. Is the manuscript technically sound, and do the data support the conclusions?

Reviewer #2: (No Response)

Reviewer #3: Yes

**********

3. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #2: (No Response)

Reviewer #3: Yes

**********

4. Have the authors made all data underlying the findings in their manuscript fully available?

Reviewer #2: (No Response)

Reviewer #3: Yes

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English?

Reviewer #2: (No Response)

Reviewer #3: Yes

**********

6. Review Comments to the Author

Reviewer #2: (No Response)

Reviewer #3: （1） The picture is not together with the corresponding title. The title of the table is in the wrong position and exceeds the limit of the body. The image resolution is too low and blurred to see the specific description of each module.

（2） In Figure 6, it is not recommended to use the line chart for the influence results of the super parameters, because the between each two parameters are irrational numbers, which are not easy to test, and do not necessarily increase or decrease linearly. If the line chart is used, it is obvious that the influence of the super parameters on map becomes a linear relationship, and the appearance is strange. It is recommended to use a histogram to show the impact of the tested parameters on map.

（3） There is no reference to others' methods in the method column of all tables.

（4） There are many data tables and module structure diagrams in the article, but there is no detection effect diagram of the experimental data set. In this paper, the small target detection in fog scene has achieved higher accuracy. It is suggested to give the target detection effect diagram of the model on the experimental data set and the target detection effect diagram of the baseline model for comparison.

（5） Table 2 shows the experimental results of different methods on the "coco" data set, in which there are AP, ap50, APS and other measurement indicators, which are not explained and explained in the article, which will cause misunderstanding.

**********

7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #2: No

Reviewer #3: Yes: Xing Wei

PLoS One. 2022 Aug 18;17(8):e0270356. doi: 10.1371/journal.pone.0270356.r004

Author response to Decision Letter 1

7 Jun 2022

Thank you for allowing a resubmission of our manuscript, with an opportunity to address the reviewers’ comments.

Reviewer #3:

Case1: The picture is not together with the corresponding title. The title of the table is in the wrong position and exceeds the limit of the body. The image resolution is too low and blurred to see the specific description of each module.

Response: When we submitted the manuscript of the paper, the editor made the following requirements, so it was modified.

Please note that PLOS does not allow figures to be included within the body of the manuscript. Before we can proceed, please remove the figures from the body of the manuscript, and upload them as separate figure files in accordance with our submission guidelines.

Case2: In Figure 6, it is not recommended to use the line chart for the influence results of the super parameters, because the between each two parameters are irrational numbers, which are not easy to test, and do not necessarily increase or decrease linearly. If the line chart is used, it is obvious that the influence of the super parameters on map becomes a linear relationship, and the appearance is strange. It is recommended to use a histogram to show the impact of the tested parameters on map.

Response: We change the line chart in Figure 6 into a histogram.

Case3: There is no reference to others' methods in the method column of all tables.

Response: We cited references for each method in Tables 1 and 2.

Case4: There are many data tables and module structure diagrams in the article, but there is no detection effect diagram of the experimental data set. In this paper, the small target detection in fog scene has achieved higher accuracy. It is suggested to give the target detection effect diagram of the model on the experimental data set and the target detection effect diagram of the baseline model for comparison.

Response: In the experiments section, we added visual analysis and visual comparison between yolov5 method and our method. It more intuitively proved the effectiveness of our method.

Case5: Table 2 shows the experimental results of different methods on the "coco" data set, in which there are AP, ap50, APS and other measurement indicators, which are not explained and explained in the article, which will cause misunderstanding.

Response: The meanings of AP and mAP are the same. In order not to cause ambiguity, AP has been modified to mAP in the paper, and then the evaluation index meanings of mAP50, mAP75, mAPS, mAPM and mAPL on the COCO data set are explained in the experimental evaluation index part.

PLoS One. doi: 10.1371/journal.pone.0270356.r005

Decision Letter 2

Gulistan Raja

9 Jun 2022

Texture and semantic integrated small objects detection in foggy scenes

PONE-D-21-32064R2

Dear Dr. Wang,

We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements.

Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication.

An invoice for payment will follow shortly after the formal acceptance. To ensure an efficient process, please log into Editorial Manager at http://www.editorialmanager.com/pone/, click the 'Update My Information' link at the top of the page, and double check that your user information is up-to-date. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org.

If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

Kind regards,

Gulistan Raja

Academic Editor

PLOS ONE

Additional Editor Comments (optional):

Reviewers' comments:

PLoS One. doi: 10.1371/journal.pone.0270356.r006

Acceptance letter

Gulistan Raja

10 Aug 2022

PONE-D-21-32064R2

Texture and semantic integrated small objects detection in foggy scenes

Dear Dr. Wang:

I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department.

If your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information please contact onepress@plos.org.

If we can help with anything else, please email us at plosone@plos.org.

Thank you for submitting your work to PLOS ONE and supporting open access.

Kind regards,

PLOS ONE Editorial Office Staff

on behalf of

Dr. Gulistan Raja

Academic Editor

PLOS ONE

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

S1 Dataset

(DOCX)

Click here for additional data file.^{(16.8KB, docx)}

Attachment

Submitted filename: Review Report.docx

Click here for additional data file.^{(14KB, docx)}

Attachment

Submitted filename: renamed_39106.pdf

Click here for additional data file.^{(74.4KB, pdf)}

Attachment

Submitted filename: Response to reviewers.docx

Click here for additional data file.^{(20.9KB, docx)}

Data Availability Statement

[pone.0270356.ref001] 1.Chenchen Zhu, Yihui He, Marios Savvides. Feature Selective Anchor-Free Module for Single-Shot Object Detection. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, USA, 2019. p.840–849.

[pone.0270356.ref002] 2.Redmon. J, Divvala. S, Girshick. R and Farhadi. A. You Only Look Once: Unified, Real-Time Object Detection. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, USA, 2016. p.779–788.

[pone.0270356.ref003] 3.Liu. W, Anguelov. D, Erhan. D, Szegedy. C, Reed S, Fu. C, et al. SSD: Single Shot MultiBox Detector. In: European Conference On Computer Vision (ECCV), Amsterdam, The Netherlands, 2016: 21–37.

[pone.0270356.ref004] 4. Ren S, He K, Girshick R and Sun J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(6). p.1137–1149. doi: 10.1109/TPAMI.2016.2577031 [DOI] [PubMed] [Google Scholar]

[pone.0270356.ref005] 5.He. K, Gkioxari. G, Dollar. P and Girshick. R. Mask R-CNN.? In: IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 2017. p.2980–2988.

[pone.0270356.ref006] 6.Redmon. J and Farhadi. A. YOLO9000: Better, Faster, Stronger. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, USA, 2017. p.6517–6525.

[pone.0270356.ref007] 7.Lin. T, Dollar. P, Girshick. R, He. K, Hariharan. B and Belongie. S. Feature Pyramid Networks for Object Detection. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, USA, 2017. p.936–944.

[pone.0270356.ref008] 8.Liu. Z, Gao. G, Sun. L and Fang. L. IPG-Net: Image Pyramid Guidance Network for Small Object Detection.? In: IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Seattle, USA, 2020. p.4422–4430.

[pone.0270356.ref009] 9.Pang. J, Chen. K, Shi. J, Feng. H, Ouyang. W and Lin. D. IPG-Net: Image Pyramid Guidance Network for Small Object Detection. In: IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Seattle, USA, 2020. p.4422–4430.

[pone.0270356.ref010] 10.Tan. M, Pang. R and Le. Q. EfficientDet: Scalable and Efficient Object Detection. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, USA, 2020. p.10778–10787.

[pone.0270356.ref011] 11.Luo. Y, Cao. X, Zhang. J, Cao. X, Guo. J, Shen. H, et al. CE-FPN: Enhancing Channel Information for Object Detection. arXiv preprint arXiv:2103.10643,2021.

[pone.0270356.ref012] 12.Li. J, Liang. X, Wei. Y, Xu. T, Feng. J and Yan. S. Perceptual Generative Adversarial Networks for Small Object Detection. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, USA, 2017. p.1951–1959.

[pone.0270356.ref013] 13.Bai. Y, Zhang. Y, Ding. M and Ghanem. B. SOD-MTGAN: Small Object Detection via Multi-Task Generative Adversarial Network. In: European Conference on Computer Vision (ECCV), Munich, Germany, 2018. p.210–226.

[pone.0270356.ref014] 14.Noh. J, Bae. W, Lee. Seo. J and Kim. G. Better to Follow, Follow to Be Better: Towards Precise Supervision of Feature Super-Resolution for Small Object Detection. In: IEEE International Conference on Computer Vision (ICCV), Seoul, Korea, 2019. p.9724–9733.

[pone.0270356.ref015] 15. Ji H, Gao Z, Mei T and Ramesh B. Vehicle Detection in Remote Sensing Images Leveraging on Simultaneous Super-Resolution. IEEE Geoscience and Remote Sensing Letters, 2020, 17(4):676–680. doi: 10.1109/LGRS.2019.2930308 [DOI] [Google Scholar]

[pone.0270356.ref016] 16.Shermeyer. J and Etten. A. The Effects of Super-Resolution on Object Detection Performance in Satellite Imagery. In: IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Long Beach, USA, 2019. p.1432–1441.

[pone.0270356.ref017] 17.Chen. X and Gupta. A. Spatial Memory for Context Reasoning in Object Detection. In: IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 2017. p.4106–4116.

[pone.0270356.ref018] 18.Zhu. Y, Zhao. C, Wang. J, Zhao. X, Wu. Y and Lu. H. CoupleNet: Coupling Global Structure with Local Parts for Object Detection. In: IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 2017. p.4146–4154.

[pone.0270356.ref019] 19.Hu. H, Gu. J, Zhang. Z, Dai. J and Wei. Y. Relation Networks for Object Detection. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, USA, 2018. p.3588–3597.

[pone.0270356.ref020] 20.Tang. X, Du. D, He. Z and Liu. J. PyramidBox: A Context-assisted Single Shot Face Detector. In: European Conference on Computer Vision (ECCV), Munich, Germany, 2018. p.797–813.

[pone.0270356.ref021] 21.Cao. J, Chen. Q, Guo. J and Shi. R. Attention-guided Context Feature Pyramid Network for Object Detection. arXiv preprint arXiv:2005.11475, 2020.

[pone.0270356.ref022] 22.Hahner. M, Dai. D, Sakaridis. C, Zaech. J and Gool. L. V. Semantic Understanding of Foggy Scenes with Purely Synthetic Data. IEEE Intelligent Transportation Systems Conference (ITSC), Auckland, New Zealand, 2019. p.3675–3681.

[pone.0270356.ref023] 23.Lin. T, Maire. M, Belongie. S, Hays. J, Perona. P, Ramanan. D, et al. Microsoft COCO: Common Objects in Context. In: European Conference on Computer Vision (ECCV), Zurich, Switzerland, 2014. p.740–755.

[pone.0270356.ref024] 24. He K, Sun J and Tang X. Single Image Haze Removal Using Dark Channel Prior. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2011, 33(12):2341–2353. doi: 10.1109/TPAMI.2010.168 [DOI] [PubMed] [Google Scholar]

[pone.0270356.ref025] 25.Bin Zhu, Qing Song, Lu Yang, Zhihui Wang, Chun Liu and Mengjie Hu. R-CNN: Calibrating Point-Guided Misalignment in Object Detection. In: WACV, 2021.

[pone.0270356.ref026] 26.Jiangmiao Pang, Kai Chen, Jianping Shi, Huajun Feng, Wanli Ouyang and Dahua Lin. Libra R-CNN: Towards Balanced Learning for Object Detection. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019. p.821–830.

[pone.0270356.ref027] 27.Kim, Kang and Lee, Hee Seok. Probabilistic Anchor Assignment with IoU Prediction for Object Detection. In: ECCV 2020. Lecture Notes in Computer Science, 2020, 12370: 355–371.

[pone.0270356.ref028] 28.Haoyang Zhang, Ying Wang, Fears Dayoub and Niko Sunderhauf. VarifocalNet: An IoU-aware Dense Object Detector. In: CVPR, 2021.

[pone.0270356.ref029] 29.Kang Kim1, Hee Seok Lee. Probabilistic Anchor Assignment with IoU Prediction for Object Detection. In: ECCV.2020.

[pone.0270356.ref030] 30.Xiang Li, Wenhai Wang, Lijun Wu, Shuo Chen, Xiaolin Hu, Jun Li, et al. Generalized Focal Loss: Learning Qualified and Distributed Bounding Boxes for Dense Object Detection. In: NeurIPS, 2020.

[pone.0270356.ref031] 31.Jiaqi Wang, Wenwei Zhang, Yuhang Cao, Kai Chen, J iangmiao Pang, Tao Gong, et al. Side-Aware Boundary Localization for More Precise Object Detection. In: ECCV 2020. Lecture Notes in Computer Science, 2020, 12349:403–419.

[pone.0270356.ref032] 32.Zhao G, Li G, Xu R, et al. Collaborative Training between Region Proposal Localization and Classification for Domain Adaptive Object Detection. In: ECCV 2020. Lecture Notes in Computer Science, 2020, 12363:86–102.

PERMALINK

Texture and semantic integrated small objects detection in foggy scenes

Zhengyun Fang

Hongbin Wang

Shilin Li

Yi Hu

Xingbo Han

Roles

Abstract

Introduction

Fig 1. Low-resolution image (left) in a foggy scene and clear high-resolution image (right).

Related works

Small objects detection based on multi-scale feature extraction

Small objects detection assisted by high-resolution features

Small objects detection guided by content information

A texture and semantic integrated small objects detection model in foggy scenes

Overview

Fig 2. The texture and semantic integrated small objects detection algorithm in foggy scenes.

Knowledge guidance module

Fig 3. Knowledge guidance module.

Texture information acquisition module

Fig 4. Texture information acquisition module.

Semantic information enrichment module

Fig 5. Semantic information rich module.

Total loss

Experiments

Data set and evaluation indicators

Experimental details

Method comparison

Table 1. Experimental results of different methods on “Cityscape to Foggy” data set.

Table 2. Experimental results of different methods on the foggy “CoCo” data set.

Ablation experiment

Table 3. Ablation experimental results of the model on the “Cityscape to Foggy” dataset.

Parameter analysis

Fig 6.

Visual analysis

Fig 7. Visualization results on the Cityscape to Foggy dataset.

Conclusions

Supporting information

Data Availability

Funding Statement

References

Decision Letter 0

Gulistan Raja

Roles

Author response to Decision Letter 0

Decision Letter 1

Gulistan Raja

Roles

Author response to Decision Letter 1

Decision Letter 2

Gulistan Raja

Roles

Acceptance letter

Gulistan Raja

Roles

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases