Abstract
In the latest research progress, deep neural networks have been revolutionized by frameworks to extract image features more accurately. In this study, we focus on an attention model that can be useful in deep neural networks and propose a simple but strong feature extraction deep network architecture, W-Net. The architecture of our W-Net network has two mutually independent path structures, and it is designed with the following advantages. (1) There are two independent effective paths in our proposed network structure, and the two paths capture more contextual information from different scales in different ways. (2) The two paths acquire different feature images, and in the upsampling approach, we use bilinear interpolation thus reducing the feature map distortion phenomenon and integrating the different images processed. (3) The feature image processing is at a bottleneck, and a hierarchical attention module is constructed at the bottleneck by reclassifying after the channel attention module and the spatial attention module, resulting in more efficient and accurate processing of feature images. During the experiment, we also tested iSAID, a massively high spatial resolution remote sensing image dataset, with further experimental data comparison to demonstrate the generality of our method for remote sensor image segmentation.
Introduction
Traditional deep learning-based networks solve the problem of learning features directly from data by using a small surrounding area a pixel as the CNN input to perform training and prediction, which is inefficient and inaccurate [1, 2]. A fully convolutional neural network (FCN) is the first pixel-level semantic segmentation method that outputs pixel-level classification results directly from arbitrarily sized inputs by upsampling layers within the network, which significantly improves the semantic segmentation results [3].
In the design of neural networks, researchers have focused on leveraging existing network structures, such as VGG, ResNet, ResNeXt, and DenseNet, to extract deep features. The above network structure was originally applied to the image classification problem, and the extracted features represent semantics [4–7]. Networks such as U-Net, GoogLeNet, SegNet, PSPNet, and DeepLab have been proposed to achieve more prominent results in semantic segmentation tasks through model improvements or module innovations [8–12].
As compared to other natural images, remote sensing images have a larger image scale and more small target objects, and the background and foreground imbalance background information is more complex and variable. Moreover, the traditional semantic segmentation algorithm has poor applicability, low segmentation accuracy, and blurred target boundaries. Whether we can design a new semantic segmentation for remote sensing images and obtain better segmentation results is a very challenging problem. We investigate a novel network architecture, called W-Net, which solves the above problem. We find that FCN can take input images of any size, that Image spatial information is preserved in the end-to-end implementation of segmented feature images, and that we can recover the class to which each pixel subsumes from these abstract features; U-Net uses a feature stitching approach, which contains more features in each dimension and can reuse shallow features with high spatial resolution. We propose a dual-path convolution module based on FCN and U-Net. After running through two independent paths, the images are sampled at different sizes and the feature images of multi-scale images are fused to have both a global field of perception and detailed information small about target objects. The attention mechanism is introduced into the model bottleneck. Instead of processing all images, the spatial attention model and channel attention model are combined to select fused feature maps in a targeted manner. Attention is focused on specific parts of the image, key information in the image is extracted while irrelevant information is ignored, the number of network layers is not too deep, and the performance of the whole model is improved.
Contribution
Our main contributions are three-fold.
(1) A new dual path network structure, W-Net, is proposed for remote sensing image segmentation.
(2) The design and effectiveness of the proposed dual-path convolution module and focus on prominent areas module in W-Net are verified by adequate ablation studies.
(3) Experiments are conducted on the largest remote sensing dataset iSAID to verify the generalization of W-Net.
Related works
Following the continuous innovation of deep learning technology, deep learning methods have been widely used in image processing fields such as object detection and image segmentation [13, 14]. In much research, it has been shown that an advanced neural network architecture is one of the most challenging and effective ways to enhance image segmentation task performance. However, compared to traditional methods, from the first truly deep learning semantic segmentation model, the FCN extends end-to-end convolutional networks to semantic segmentation, and the high-performance network structure shines in semantic segmentation tasks.
General semantic segmentation
The most intuitive way to enhance the performance of a network is to deepen the network structure by overlaying more layers, thereby allowing the network to extract image features more accurately. The design of VGG and ResNet follows exactly this approach. Compared to AlexNet, VGG has more than twice the depth [15]. In addition, ResNet has 22 times more layers than VGG. ResNet presents the first residual network structure using a constant mapping layer y = x with the output equal to the input. The problems of gradient disappearance caused by increasing the depth of the network structure are alleviated. DenseNet also uses a connection of different feature mappings. It guarantees the transmission of information, makes the network narrower and less parameterized, and connects all the layers together. GoogLeNet is also very deep but presents an inception module [16–18]. It stops increasing the model depth and starts feature extraction from the model by changing the model width. It uses filters of different sizes on each convolutional block to highlight different features in the same layer, allowing the network to show higher performance.
As the depth of the network structure continues to increase, new improvements to the network structure must be proposed to further address the accuracy of the feature maps. In the FCN, upsampling is performed using a deconvolutional layer, and the roughness of upsampling is improved by jump connections. Its fully connected layer structure is still used in the most advanced segmentation model. Rough segmentation maps are produced by the upper convolutional layers and some jump connections, and the FCN introduces more jump connections to improve the effect. However, the FCN network only replicates the encoder features, whereas the proposed SegNet network transfers the maximum pooling index to the decoder, thereby improving the resolution of the segmented images. This results in SegNet being more memory-efficient than FCN. To increase the appropriate perceptual field index without reducing spatial dimensionality, DeepLab v1 uses dilated convolution. It proposes a pyramidal collection of voids in spatial dimensions and substitutes the fully-connected layer in VGG with a convolutional layer to extract a combined picture of multidimensional features using a fully-connected conditional random field (CRF). Global scene classification is also important because it provides clues for segmenting the category distribution, and the pyramid pooling module proposed by the PSPNet network uses a large kernel pooling layer to capture this information. Images with different inputs (of different sizes) are filtered using convolution kernels of different sizes, and then the output is upsampled to restore the identical in size to the input, making the contextual information of the features more obvious. The DPN network is the first dual-path model that studies the advantages and limitations of ResNeXt and DenseNet, proposing a dual-path architecture to enrich path design. The structure of DPN is not too complex as it merges two networks through an inception-style structure, which is essentially a model integration approach, and is even more effective for image classification [19]. U2-Net is a fancy new network structure proposed on the basis of U-Net, which is modeled on the coding–decoding structure. A new module, ReSidual U-blocks (RSU), has been tested and has achieved remarkable results for segmenting the foreground of objects [20].
Remote sensing image segmentation
Semantic segmentation is widely used in remote sensing images, primarily using general semantic segmentation methods applied to specific areas such as road detection, crop yield estimation, and land change monitoring, and many improved techniques are also available [21–24]. Recently, FarSeg proposed a relational and optimization-based foreground modeling approach to solve the false alarm and foreground-background imbalance problems [25]. However, the performance of small object segmentation suffers from the lack of edge information used to distinguish adjacent object features. PFNet is based on the PFN framework to insert PFM between feature pyramids, and the constructed pyramidal propagation network solves the problems of uneven distribution of front and rear views of aerial images and numerous small objects under high resolution [26].
Research objectives
For object classification and localization, the traditional model uses only the feature maps from the last layer of the feature extraction network, resulting in a large downsampling rate. However, this is not applicable to remote sensing images with complex foreground information and a wide variety of small target objects. We propose a multi-scale feature fusion through a dual path composed of U-Net and FCN, and a mechanism for focusing attention on salient regions. In contrast, our method produces effective suppression of irrelevant background regions, solves the problems of poor foreground perception of small target objects in remote sensing images and ineffective propagation of semantic features at the edges of objects, and provides better model performance and results.
Proposed method
To extract high-resolution image features more effectively and make the model more focused on foreground features, we designed a W-Net network structure with the structure diagram shown in Fig 1. There are two modules that make up our semantic segmentation network: the first is a convolutional module composed of dual paths for extracting image features, and the second is a focused salient feature region module used to improve the accuracy of feature images after image classification. In Section 3.1, we introduce the two-path convolution module. In Section 3.2, the design of the network structure is introduced.
Fig 1. The W-Net network structure consists of two modules: Two-path convolution module and focus on prominent areas module.
Two-path convolution module
Fully convolutional network (FCN)
The FCN replaces the fully connected layer of the traditional CNN with a convolutional layer and proposes a jump connection to the convolutional layer to enhance the roughness of upsampling. An FCN without fully connected layers can accommodate arbitrarily sized inputs, but the output of such a network is a heat map rather than a category. At the same time, to solve the smaller image size due to convolution and pooling, the image size is recovered using upsampling, but the obtained results are not sufficiently fine and not sensitive enough to detail.
U-Net
This is a variant of an FCN, which increases the sensory field, it focuses on global features and more on local features, such as texture. While extracting features with each downsampling, some edge features are inevitably lost. By upsampling, of course, a larger feature maps can be obtained, but the edges of the feature map are missing information and upsampling does not retrieve these lost features.
Two-path convolution module
We propose a model of a two-path convolutional block, as shown in Fig 1. We refer to the FCN solution for global and local information. It is defined as a jump structure that exploits a feature spectrum that incorporates deep, coarse semantic information and shallow, fine-grained representational information to produce accurate and fine-grained segmentation. It outputs feature information as a heat map, and although the result is not sufficiently fine, it effectively preserves the edge information of the feature map. By comparison with the FCN, U-Net performs a subtle multi-scale feature integration and ensures that the feature acquisition positions are not shifted. U-Net performs exceptionally well on medical images and obtains pixel-level fine segmentation. However, in the process of continuous upsampling, the edge information of complex images cannot be effectively classified. Compared to medical images, remote sensing images contain a larger number of objects and more complex foreground information. Therefore, we study the advantages and disadvantages of the two models, consider fine-tuning the U-Net network structure, and propose a deeper U-Net network structure, DUNet, as shown in Fig 2. The feature maps obtained by connecting the two networks after feature extraction through two independent paths are then subjected to feature classification, thus solving more effectively the problem of inaccurate extraction of edge information from high-resolution image feature maps.
Fig 2. U-Net has poor ability to handle remote sensing images with excessive contextual information.
DUNet obtains better segmentation results by deepening the network structure.
The structure of the two-path convolution module is shown in Fig 3. In the two-path convolution module, the image is independently feature extracted through two paths. The feature map obtained by connecting the two feature maps after feature extraction is then subjected to feature classification, thus solving the problem of inaccurate extraction of edge information in the feature map of high-resolution images more effectively.
Fig 3. The two-path convolution module contains two independent paths of FCN and DUNet for feature extraction of the image, and the fused image retains more feature information.
Focus on prominent areas module
HSR remote sensing images have many object categories and differences in the size of different object scales, as well as problems of misalignment and overlap. Therefore, it is particularly important to focus attention on the region of significant features. In the dual-path convolution module, we extract feature maps containing more effective boundary information, at which point we focus on the impact of attentional mechanisms on deep neural networks in general. The central purpose of an attention mechanism is to obtain a network that focuses on what needs more attention. It has a variety of implementations, but the core of each of them is attention [27]. Attentional mechanisms can be divided into channel attentional mechanisms, spatial attentional mechanisms, and a combination of both. Our study suggests that the model can be enhanced by adding an attentional mechanism combining the channel attentional mechanism and spatial attentional mechanism at the bottleneck of the network structure. Therefore, we propose the focus on prominent areas module that contains the attention mechanism and image classification operation. An attention-focused information range is obtained by this module, and it effectively solves the problem of focusing on the foreground features of remote sensing images.
Experiments
Experimental setting
Datasets
Training dataset. Training data were collected by the Jilin Institute of High-Resolution Remote Sensing, and these images were obtained from the Landsat-7 platform. The original image size ranged from ∼ 800 * 800 pixels to ∼ 4000 * 13000 pixels for a total of 31 HSR remote sensing images. After the data enhancement process, a predefined training and validation set was divided into 8665 images containing three classes using the dataset.
Evaluation datasets. The iSAID dataset consisted of 2806 HSR remotely sensed images [28]. These images were acquired at multiple resolutions and from multiple sensors and platforms. The size of the original images ranged from ∼ 800 * 800 pixels to ∼ 4000 * 13000 pixels. The iSAID dataset contains 15 classes with a total of 655451 target instances, and the number of instances in a single image can reach up to 8000, with an average of 239. There are 1411 images in the pre-training set, 458 images in the validation set, and 937 images in the test set of iSAID. This is the largest dataset of HSR remotely sensed image segmentation in the direction of remote sensing image segmentation so far.
Evaluation metrics
Mean intersection over union (mIoU). According to the commonly available assessment methods, we utilize the average cross-merge rate as the major evaluation index of object segmentation in the following way:
| (1) |
where P is the predicted value, G is the true value, and K is the total number of categories. Overall, the formula represents the intersection of the predicted and true values of each category divided by the concurrent set and then averaged [29–31].
Pixel accuracy (PA). Pixel-based accuracy calculation is the most basic and simplest of the evaluation metrics. PA is the number of pixels correctly predicted as a percentage of the number of total pixels and is evaluated as follows:
| (2) |
where i denotes the true value, j denotes the predicted value, K denotes the total number of categories, and pij denotes the number of pixels that predict i to j.
Implementation details
The data were enhanced during the training process. We scaled and cropped each image to 512 * 512. The backbone network used for the FCN path in the dual-path convolution module of W-Net was HRNet_W18, which was pre-trained on ImageNet. DUNet is not using any existing backbone network and must be trained from scratch. We used the SGD optimizer to optimize our network, with the weight decay set to 0.0005 and the momentum set to 0.9. In all the experiments, these models were trained using a “poly” learning rate strategy for 40k iterations, where base_lr = 0.01 and power = 0.9. To assure that the output feature map was the same size as the input image, we used bilinear interpolation for all upsampling processes. Our network was implemented based on PaddlePaddle2.0.2, using a server with a 4-core CPU and a Tesla V100 GPU for training, with a batch image count of 8 during 40k iterations of training [32].
Comparison with general methods
In order to assess the effectiveness of W-Net more comprehensively, in Table 1 we compare W-Net with several CNN-based approaches for comprehensive experimental results. From classical models to state-of-the-art models, including Attention U-Net, BiSeNet, DANet, DeepLab v3, DeepLab v3+, Fast-SCNN, FCN, GCNet, Gated-SCNN, HarDNet, OCRNet, PSPNet, U2-Net, U2-Net+, U-Net, and U-Net++ [33–44]. The quantitative results in the comparison presented in Table 1 indicate that W-Net outperformed the other methods in high-resolution scenarios. We compare with 16 classical and advanced image segmentation methods at the present stage. According to the mIoU evaluation index, the result of our model is increased by 1.39% to 7.86% compared with other methods. The experimental results on the segmented dataset show that the method is more effective than the existing general semantic segmentation methods, and the segmentation results obtained are more accurate. Fig 4 demonstrates the accuracy and precision. This reveals that W-Net achieves an effective improvement in accuracy and precision, which benefits from efficient and precise module design.
Table 1. The accuracy of each model in object segmentation and the mIoU results are shown on the training set by qualitative experiments.
The best results in the experiments are indicated by the values in bold in each column.
| Method | PA(%) | mIoU(%) | IoU per category(%) | |||
|---|---|---|---|---|---|---|
| Background | Rice | Corn | Forest | |||
| Attention U-Net | 83.66 | 61.03 | 74.83 | 35.20 | 48.77 | 85.31 |
| BiSeNet | 81.28 | 58.23 | 71.26 | 35.98 | 42.03 | 83.65 |
| DANet | 80.78 | 57.17 | 71.09 | 33.67 | 41.14 | 82.79 |
| DeepLab v3 | 80.41 | 55.38 | 70.94 | 28.76 | 39.97 | 81.83 |
| DeepLab v3+ | 81.61 | 58.25 | 72.77 | 34.88 | 42.60 | 82.74 |
| Fast-SCNN | 81.69 | 57.96 | 71.60 | 32.59 | 43.80 | 83.85 |
| FCN | 82.63 | 60.21 | 73.08 | 38.78 | 43.40 | 85.58 |
| GCNet | 80.6 | 56.38 | 70.91 | 32.53 | 41.03 | 81.06 |
| Gated-SCNN | 82.01 | 58.70 | 73.89 | 30.99 | 48.00 | 81.89 |
| HarDNet | 82.60 | 59.39 | 72.66 | 33.94 | 46.06 | 84.89 |
| OCRNet | 82.46 | 59.84 | 72.83 | 37.19 | 46.30 | 83.05 |
| PSPNet | 80.72 | 56.96 | 71.05 | 31.04 | 40.47 | 82.19 |
| U-Net++ | 82.15 | 58.56 | 73.29 | 32.28 | 44.33 | 84.36 |
| U-Net | 82.77 | 59.96 | 74.31 | 33.62 | 48.90 | 83.00 |
| U2-Net+ | 82.68 | 59.47 | 73.56 | 33.21 | 45.23 | 85.87 |
| U2-Net | 83.58 | 61.85 | 74.32 | 39.05 | 48.41 | 85.63 |
| Ours | 84.52 | 63.24 | 75.71 | 38.94 | 52.38 | 85.92 |
Fig 4. Accuracy (PA) vs. precision (mIoU) on the training set.
Ablation studys
For verifying the validity of W-Net, sufficient experiments were conducted to investigate the ablation of network modules in relation to important parameters.
Two-path convolution module
FCN path. We need to consider the backbone network that the FCN path uses. To verify the accuracy of the backbone network segmentation on remote sensing images, we replaced different backbones to perform ablation studies. The HRNET backbone is more richly expressed semantically and more precisely spatially [45]. It can maintain high-resolution feature representation throughout the process, allowing high-resolution remote sensing images to retain feature information better. Therefore, as presented in Table 2, we conducted experiments using HRNet. The results show that the HRNet_W18 version reached 60.21% mIoU during training. Although the mIoU of HRNet_W32 reaches 59.69%, there was a noticeable increase in the size and calculation of the model as the number of channels increased. Therefore, we confirmed that an increase in the number of channels provides no advantage for the determination of the FCN backbone. We also tested the lightweight models HRNet_W18_Small_V1 and HRNet_W18_Small_V2, and found reduced accuracy. Therefore, we determined that using HRNet_W18 is the most effective FCN backbone for deep high-resolution representation learning.
Table 2. Accuracy (PA), precision (mIoU), and consistency (Kappa) obtained by testing using different backbones in the fully convolutional neural network (FCN).
| Method | backbone | mIoU(%) | PA(%) | Kappa |
|---|---|---|---|---|
| FCN | HRNet_W18 | 60.21 | 82.63 | 0.7296 |
| HRNet_W32 | 59.69 | 82.22 | 0.7250 | |
| HRNet_W44 | 58.40 | 81.97 | 0.7199 | |
| HRNet_W48 | 50.48 | 75.21 | 0.6255 | |
| HRNet_W60 | 58.90 | 81.97 | 0.7196 | |
| HRNet_W64 | 59.55 | 81.95 | 0.7206 | |
| HRNet_W18_Small_V1 | 56.73 | 80.58 | 0.7007 | |
| HRNet_W18_Small_V2 | 58.82 | 82.16 | 0.7232 |
DUNet path. As listed in Table 3, the mIoU value of the U2-Net network is 61.85%. U2-Net is a new module called ReSidual U-blocks (RSU) proposed on top of U-Net, which can segment the object foreground well and exceed the segmentation results of the U-Net network. The mIoU of the U-Net network is 59.96%, and it is obvious from Fig 4 that U-Net attains the top ranking in terms of accuracy and precision. The DUNet network we propose to solve the more complex situation of information elements in remote sensing images is significantly better than U2-Net in the modified DUNet structure that we tested to obtain the results. This confirms that our model effectively solves the aforementioned problems and that the U-Net-based model structure is effectively enhanced.
Table 3. Data comparison of the proposed DUNet network modules in terms of mIoU, PA, and Kappa.
| Method | mIoU(%) | PA(%) | Kappa |
|---|---|---|---|
| U-Net++ | 58.56 | 82.15 | 0.7228 |
| U-Net | 59.96 | 82.77 | 0.7325 |
| U2-Net+ | 59.47 | 82.68 | 0.7312 |
| U2-Net | 61.85 | 83.58 | 0.7455 |
| DUNet | 62.40 | 84.07 | 0.7528 |
Two-path convolution module. Tables 2 and 3 clearly indicate that the FCN model with the HRNet_W18 backbone and the newly proposed DUNet model achieves the highest value of mIoU, using the same parameters. To further verify the correctness of the concept of the dual-path convolution module, we formed a dual path by combining the two methods. An image was obtained as two feature images by the input of two paths, and then the feature maps were integrated. To understand the content of images more precisely, we must apply image classification (cls) such that meaningful tasks can be extracted from the images. The dual path uses an FCN with DUNet and applies cls. The mIoU results in Table 4 verify that cls is very useful after feature map integration. Therefore, we believe that our newly designed dual-path convolution module outperforms FCNs and U-Net in this image-segmentation task.
Table 4. In contrast to the use of FCN, U-Net, and DUNet methods for determining dual paths, verifying that image classification (cls) must be applied for the image segmentation task.
The values in bold font indicate the best type of dual path convolution module.
| Method | mIoU(%) | PA(%) |
|---|---|---|
| FCN | 60.21 | 82.63 |
| U-Net | 59.96 | 82.77 |
| FCN+U-Net | 58.34 | 81.97 |
| FCN+U-Net+cls | 62.39 | 84.04 |
| FCN+DUNet | 59.92 | 82.46 |
| FCN+DUNet+cls | 63.13 | 84.29 |
Focus on prominent areas module
We conducted experiments by ablation to verify the role of BAM’s proposed dilation value (d) and reduction ratio (r) in focusing on the significant region module. The results are presented in Table 5. In the BAM, in the ablation experiment, the setting of the expansion value d did not affect the parameter size, but the error was minimal for d = 4 [46]. Therefore, we fixed d = 4 and studied the effectiveness of the reduction ratio r on the two-path convolution module. With an mIoU of 63.24%, r = 4 is the optimal solution for the focused significant region module. In summary, the hyperparameter r of BAM of focus on prominent areas module in W-Net is set to 4.
Table 5. Experimental results of the optimal solution for the reduction ratio.
| r | 2 | 4 | 8 |
|---|---|---|---|
| mIoU(%) | 62.50 | 63.24 | 63.09 |
Experimental parameter optimization
It is important to confirm that the learning rate set in the gradient descent process can effectively control the update step in each iteration of the algorithm. Further, the problem that a learning rate set too large is prone to oscillation and set too small is too slow to converge should be solved. Therefore, we tuned the important hyperparameter base_lr in the poly-learning strategy and used different base_lr values for our experiments. We have better displayed the experimental results in Fig 5 by means of a line graph. As base_lr increases, the performance continuously improves. Using base_lr = 0.01, the optimization of W-Net improves the mIoU by 1.1% over the baseline, bringing the mIoU to 63.24%. However, when base_lr = 0.02, the performance decreases. Therefore, we set the hyperparameter base_lr in the learning rate to 0.01 in the model.
Fig 5. mIoU (%) for W-Net optimization using differentbase_lr.
Qualitative comparison
To understand the performance of our model more intuitively, we conducted comprehensive experiments on a larger-scale HSR remote sensing image dataset, iSAID. We contrasted W-Net with U-Net and U-Net’s related methods U-Net++, U2-Net, with species from classical to state-of-the-art methods, including Fast-SCNN, DNL-Net, FCN, FastFCN, DeepLabv3, DeepLabv3+, PSPNet, DDRNet, GINet, HRNet, and FarSeg [47–51]. In Table 6, the quantitative results can clarify that the segmentation results of W-Net in HSR remote sensing images are significantly better than other methods with an mIoU of 63.68%. The IoU obtained by each method is compared in Table 7 for each category. In the table, W-Net shows excellent results in 11 of the 15 categories.
Table 6. Semantic segmentation of the HSR remote sensing dataset iSAID dataset using state-of-the-art methods to compare the experimental mIoU.
| Method | backbone | mIoU(%) |
|---|---|---|
| Fast-SCNN | - | 47.63 |
| U-Net++ | - | 48.58 |
| U-Net | - | 49.25 |
| DNL-Net | ResNet50 | 49.44 |
| FCN | HRNet_W18 | 50.96 |
| FastFCN | ResNet50 | 55.10 |
| U2-Net | - | 56.61 |
| DeepLab v3 | ResNet50 | 59.61 |
| DeepLab v3+ | ResNet50 | 60.21 |
| PSPNet | ResNet50 | 60.29 |
| DDRNet | - | 61.35 |
| GINet | ResNet50 | 61.64 |
| HRNet | HRNet_W18 | 62.10 |
| FarSeg | ResNet50 | 63.54 |
| W-Net | HRNet_W18 | 63.68 |
Table 7. The bold values in each column indicate the IoU best results for each type of object segmentation on the iSAID dataset.
Each abbreviation is explained as: SV(Small vehicle), BD(Baseball diamond), HC(Helicopter), SP(Swimming pool), TC(Tennis court), LV(Large vehicle), SC(Storage tank), GTF(Ground field track), SBF(Soccer-ball field), BC(Basketball court), and RA(Roundabout).
| Method | mIoU(%) | IoU per category (%) | ||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| SV | BD | TC | Ship | LV | SC | SP | Bridge | GTF | HC | Plane | BC | Harbor | SBF | RA | ||
| Fast-SCNN | 47.63 | 45.78 | 43.16 | 53.08 | 76.73 | 26.81 | 25.11 | 8.28 | 51.88 | 26.36 | 1.37 | 37.42 | 17.68 | 45.75 | 66.66 | 34.90 |
| U-Net++ | 48.58 | 43.12 | 37.31 | 52.99 | 69.24 | 27.35 | 31.16 | 5.75 | 49.81 | 24.76 | 0 | 40.95 | 29.67 | 47.37 | 66.75 | 43.66 |
| U-Net | 49.25 | 48.66 | 46.84 | 43.86 | 81.71 | 35.85 | 20.99 | 10.07 | 53.78 | 25.92 | 0 | 38.34 | 0 | 54.41 | 71.63 | 44.39 |
| DNL-Net | 49.44 | 51.11 | 51.55 | 41.31 | 81.95 | 32.07 | 19.98 | 20.60 | 51.37 | 39.78 | 0 | 38.63 | 0.22 | 38.93 | 76.24 | 49.28 |
| FCN | 50.96 | 50.86 | 50.51 | 56.13 | 80.12 | 30.42 | 30.80 | 13.77 | 53.96 | 31.33 | 0.49 | 37.27 | 24.00 | 53.99 | 69.78 | 43.99 |
| FastFCN | 55.10 | 53.91 | 53.02 | 48.07 | 85.70 | 41.07 | 27.89 | 22.99 | 57.29 | 41.26 | 0 | 40.79 | 22.96 | 43.24 | 76.86 | 49.31 |
| U2-Net | 56.61 | 52.17 | 50.91 | 63.83 | 79.15 | 35.34 | 37.62 | 16.44 | 52.24 | 32.47 | 1.99 | 44.63 | 28.20 | 59.37 | 67.75 | 39.69 |
| Deeplabv3 | 59.61 | 55.05 | 51.02 | 59.87 | 77.92 | 33.43 | 34.88 | 17.63 | 53.84 | 33.59 | 2.05 | 47.23 | 24.35 | 60.56 | 70.61 | 48.17 |
| Deeplabv3+ | 60.21 | 52.67 | 55.67 | 64.10 | 83.02 | 38.08 | 39.60 | 24.35 | 52.60 | 31.82 | 4.55 | 34.69 | 33.04 | 59.07 | 70.94 | 46.45 |
| PSPNet | 60.29 | 60.82 | 56.60 | 68.94 | 86.93 | 44.64 | 42.51 | 27.46 | 58.25 | 40.13 | 6.38 | 44.74 | 41.05 | 64.44 | 77.56 | 52.55 |
| DDRNet | 61.35 | 61.95 | 58.13 | 66.19 | 88.09 | 47.80 | 42.78 | 26.28 | 59.31 | 39.69 | 14.73 | 46.79 | 32.77 | 65.19 | 78.09 | 54.41 |
| GINet | 61.64 | 60.31 | 58.04 | 67.28 | 86.97 | 50.37 | 39.37 | 32.35 | 59.00 | 41.84 | 3.28 | 49.67 | 52.64 | 62.97 | 78.16 | 51.91 |
| HRNet | 62.10 | 58.82 | 62.45 | 69.56 | 86.49 | 46.83 | 43.15 | 30.73 | 57.23 | 39.52 | 13.39 | 44.85 | 45.88 | 64.79 | 76.70 | 51.15 |
| FarSeg | 63.54 | 63.71 | 58.77 | 70.43 | 87.84 | 49.62 | 44.98 | 23.23 | 60.66 | 41.13 | 4.23 | 43.78 | 38.57 | 67.51 | 78.72 | 52.15 |
| W-Net | 63.68 | 64.77 | 59.74 | 72.11 | 88.91 | 55.9 | 42.09 | 29.50 | 61.54 | 43.18 | 18.62 | 50.62 | 44.41 | 67.75 | 79.97 | 56.68 |
Fig 6 shows a plot of our results for U2-Net, FCN, and U-Net. In the first row of the results plot, you can see the effect of segmentation when there is a shadow occlusion problem of other objects. As we can observe, our W-Net produces accurate results in the presence of shaded occluders. The U2-Net model appears to have less accurate objects owing to shadows, and FCN and U-Net have larger errors. Rows 2 and 3 show the images of both large and small objects. Our model can simply and effectively distinguish between different forms of the same class of objects, although the edges of small objects are blurred. U2-Net and U-Net clearly showed inaccurate segmentation edges for small objects and two markers for one object. The FCN failed to segment objects. Row 4 shows the power of our model to segment objects with cluttered backgrounds and complex foregrounds. In conclusion, our model can effectively solve the challenges in image segmentation by performing multi-scale contextual feature extraction on HSR remote sensing images to produce highly accurate image segmentation results.
Fig 6. Qualitative experiments based on the use of the W-Net network on the iSAID test set were compared with three other methods.
(a) iSAID test set image, (b) W-Net network object segmentation map, (c) U2-Net network object segmentation map, (d) FCN network object segmentation map, (e) U-Net network object segmentation map. To facilitate visualization of the mapping relationships of the objects of the experimental results, we resize the images. Legend: Scene 1 (tennis court), Scene 2 (airplane), Scene 3 (small vehicle, large vehicle), Scene 4 (small vehicle, port, ship).
Conclusion
In this paper, we consider the problem of edge information extraction processing of feature maps and complex backgrounds as a solution challenge for HSR remote sensing image segmentation, which is ignored by general semantic segmentation methods. To overcome these questions, we propose a W-Net network structure. It extracts the feature map from multiple scales by means of a dual-path convolution structure and further refines the boundary information of the feature map by adding a focused significant region module. The comprehensive experimental results indicated that W-Net is efficient in HSR remote sensing image segmentation and exceeds the general methods in terms of accuracy mIoU.
Although our model solves the above problems, the size of the model affects the speed. To find a better way to strike a balance between speed and accuracy, we will continue to investigate different advanced techniques and model architectures in the future.
Data Availability
Two datasets are used in this paper. The iSAID dataset is available at https://captain-whu.github.io/iSAID/index.html. It can be directly downloaded and used as a public dataset. Another data set is the private research data URL of Jilin Institute of High-Resolution Remote Sensing: https://github.com/wseven77/W-Net.
Funding Statement
This is supported by the key science and technology research and development project of Jilin Province, "Scenic Area Service and Control Integration Platform" architecture and key technology research (No. 20180201086SF). Funder Liu Guangjie plays an important role in the study design, decision to publish, and preparation of the manuscript. This is supported by Joint fund of Science & Technology Department of Liaoning Province and State Key Laboratory of Robotics, China (No.2020-KF-22-08). Funder Yu Zhang had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. This is supported by Jilin Provincial Education Department Project (No. JJKH20220840KJ). Funder Qingliang Li had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
References
- 1. Alzubaidi L, Zhang J, Humaidi AJ, Al-Dujaili A, Duan Y, Al-Shamma O, et al. Review of deep learning: Concepts, CNN architectures, challenges, applications, future directions. J Big Data. 2021;8(1):53. doi: 10.1186/s40537-021-00444-8 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2. Li Z, Liu F, Yang W, Peng S, Zhou J. A survey of convolutional neural networks: analysis, applications, and prospects. IEEE Trans Neural Netw Learn Syst. 2022. Dec;33(12):6999–7019. doi: 10.1109/TNNLS.2021.3084827 [DOI] [PubMed] [Google Scholar]
- 3. Long J, Shelhamer E, Darrell T. Fully convolutional networks for semantic segmentation. IEEE Trans Pattern Anal Mach Intell. 2017. Apr;39(4):640–651. doi: 10.1109/TPAMI.2016.2572683 [DOI] [PubMed] [Google Scholar]
- 4.Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2015:1-9.
- 5.He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2016:770-778.
- 6.Xie S, Girshick R, Dollár P, Tu Z, He K. Aggregated residual transformations for deep neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2018:1492-1500.
- 7.Huang G, Liu Z, Van Der Maaten L, Weinberger KQ. Densely connected convolutional networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2017:2261-2269.
- 8. Ronneberger O, Fischer P, Brox T. U-net: Convolutional networks for biomedical image segmentation. In: Navab N, Hornegger J, Wells W, Frangi A, eds. Medical Image Computing and Computer-Assisted Intervention—MICCAI 2015. Cham: Springer International Publishing; 2015:234–241. [Google Scholar]
- 9.Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, et al. Going deeper with convolutions. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2015:1-9.
- 10. Badrinarayanan V, Kendall A, Cipolla R. Segnet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2017;39(12):2481–2495. doi: 10.1109/TPAMI.2016.2644615 [DOI] [PubMed] [Google Scholar]
- 11.Zhao H, Shi J, Qi X, Wang X, Jia J. Pyramid scene parsing network. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2017:2881-2890.
- 12. Chen L-C, Papandreou G, Kokkinos I, Murphy K, Yuille AL. Semantic image segmentation with deep convolutional nets and fully connected crfs. Semantic Image Segmentation with Deep Convolutional Nets and Fully Connected CRFs. Computer Science. 2014(4):357–361. [DOI] [PubMed] [Google Scholar]
- 13. Guan Y, Aamir M, Hu Z, Dayo Z A, Rahman Z, Abro W A, et al. An Object Detection Framework Based on Deep Features and High-Quality Object Locations. Traitement du Signal. 2021; 38(3):719–30. doi: 10.18280/ts.380319 [DOI] [Google Scholar]
- 14. Guan Y, Aamir M, Hu Z, Abro W A, Rahman Z, Dayo Z A, et al. A Regio Based Efficient Network for Accurate Object Detection. Traitement du Signal. 2021; 38(2). doi: 10.18280/ts.380228 [DOI] [Google Scholar]
- 15. Krizhevsky A, Sutskever I, Hinton GE. Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems. 2012;25(2):1097–1105. [Google Scholar]
- 16.Ioffe S, Szegedy C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In: Proceedings of the 32nd International Conference on Machine Learning (ICML). 2015:448-456.
- 17.Szegedy C, Vanhoucke V, Ioffe S, Shlens J, Wojna Z. Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2016:2818-2826.
- 18.Szegedy C, Ioffe S, Vanhoucke V, Alemi AA. Inception-v4, inception-resnet and the impact of residual connections on learning. In: Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence (AAAI). 2017:4278-4284.
- 19.Chen Y, Li J, Xiao H, Jin X, Yan S, Feng J. Dual path networks. In: Proceedings of the 31st Conference on Neural Information Processing Systems (NeurIPS). 2017:4467-4475.
- 20. Qin X, Zhang Z, Huang C, Dehghan M, Zaiane OR, Jagersand M. U2-Net: Going deeper with nested U-structure for salient object detection. In: Proceedings of the European Conference on Computer Vision (ECCV). 2020. [Google Scholar]
- 21. Liang J, Homayounfar N, Ma W-C, Wang S, Urtasun R. Convolutional recurrent network for road boundary extraction. IEEE Trans Intell Transp Syst. 2017;18(11):2965–2975. [Google Scholar]
- 22. Batra A, Singh S, Pang G, Basu S, Jawahar C, Paluri M. Improved road connectivity by joint learning of orientation and segmentation. IEEE Trans Intell Transp Syst. 2019;20(1):157–167. [Google Scholar]
- 23. Zhang C, Sargent I, Pan X, Li H, Gardiner A, Hare J, et al. Joint Deep Learning for land cover and land use classification. Remote sensing of environment. 2019;221:173–187. doi: 10.1016/j.rse.2018.11.014 [DOI] [Google Scholar]
- 24. Huang Z, Wang X, Huang L, Huang C, Wei Y, Liu W. Ccnet: Criss-cross attention for semantic segmentation. IEEE Trans Pattern Anal Mach Intell. 2020. Jul 3;PP. [DOI] [PubMed] [Google Scholar]
- 25. Zheng Z, Zhong Y, Wang J, Ma A. Foreground-aware relation network for geospatial object segmentation in high spatial resolution remote sensing imagery. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2020. [DOI] [PubMed] [Google Scholar]
- 26.Li X, He H, Li X, Li D, Cheng G, Shi J, et al. Pointflow: Flowing semantics through points for aerial image segmentation. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021.
- 27. Brauwers G, Frasincar F. A general survey on attention mechanisms in deep learning. IEEE Transactions on Knowledge and Data Engineering. 2022. [Google Scholar]
- 28.Waqas Zamir S, Arora A, Gupta A, Khan S, Sun G, Shahbaz Khan F, et al. isaid: A large-scale dataset for instance segmentation in aerial images. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops. 2019.
- 29. Hao S, Zhou Y, Guo Y. A brief survey on semantic segmentation with deep learning. Neurocomputing. 2020:302–321. doi: 10.1016/j.neucom.2019.11.11833100581 [DOI] [Google Scholar]
- 30. Chen Z, Zhu H. Visual quality evaluation for semantic segmentation: subjective assessment database and objective assessment measure. IEEE Trans Image Process. 2019. Dec;28(12):5785–5796. doi: 10.1109/TIP.2019.2922072 [DOI] [PubMed] [Google Scholar]
- 31. Chen Y, Ming D, Zhao L, Lv B, Zhou K, Qing Y. Review on high spatial resolution remote sensing image segmentation evaluation. Photogrammetric Engineering & Remote Sensing. 2018;84(10):629–646. doi: 10.14358/PERS.84.10.629 [DOI] [Google Scholar]
- 32. Ma Y, Yu D, Wu T, Wang H. PaddlePaddle: An open source deep learning platform derived from industry practice. IEEE/ACM Transactions on Computational Biology and Bioinformatics. 2021;18(3):1073–1085. [Google Scholar]
- 33. Oktay O, Schlemper J, Folgoc LL, Lee M, Heinrich M, Misawa K, et al. Attention u-net: Learning where to look for the pancreas. 2018. [Google Scholar]
- 34. Yu C, Gao C, Wang J, Yu G, Shen C, Sang N. Bisenet v2: Bilateral network with guided aggregation for real-time semantic segmentation. International Journal of Computer Vision. 2021;129(11):3051–3068. doi: 10.1007/s11263-021-01515-2 [DOI] [Google Scholar]
- 35. Fu J, Liu J, Tian H, Li Y, Bao Y, Fang Z, et al. Dual attention network for scene segmentation. IEEE Trans Neural Netw Learn Syst. 2021. Jun;32(6):2547–2560. [DOI] [PubMed] [Google Scholar]
- 36. Chen L-C, Papandreou G, Kokkinos I, Murphy K, Yuille AL. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans Pattern Anal Mach Intell. 2018. Apr;40(4):834–848. doi: 10.1109/TPAMI.2017.2699184 [DOI] [PubMed] [Google Scholar]
- 37. Chen L-C, Papandreou G, Schroff F, Adam H. Rethinking atrous convolution for semantic image segmentation. IEEE transactions on pattern analysis and machine intelligence. 2018;40(9):2096–2109. [DOI] [PubMed] [Google Scholar]
- 38.Chen L-C, Zhu Y, Papandreou G, Schroff F, Adam H. Encoder-decoder with atrous separable convolution for semantic image segmentation. Proceedings of the European conference on computer vision (ECCV). 2018:801-818.
- 39. Poudel RP, Liwicki S, Cipolla R. Fast-scnn: Fast semantic segmentation network. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops. 2019. [Google Scholar]
- 40.Cao Y, Xu J, Lin S, Wei F, Hu H. Gcnet: Non-local networks meet squeeze-excitation networks and beyond. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019:1971-1980. PMID: 32055635.
- 41.Takikawa T, Acuna D, Jampani V, Fidler S. Gated-scnn: Gated shape cnns for semantic segmentation. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019:5229-5238.
- 42.Chao P, Kao C-Y, Ruan Y-S, Huang C-H, Lin Y-L. Hardnet: A low memory traffic network. In: Proceedings of the IEEE/CVF international conference on computer vision. 2019:3552–3561.
- 43.Yuan Y, Chen X, Chen X, Wang J. Segmentation transformer: Object-contextual representations for semantic segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021:4781-4790.
- 44. Zhou Z, Siddiquee MMR, Tajbakhsh N, Liang J. Unet++: Redesigning skip connections to exploit multiscale features in image segmentation. IEEE Trans Med Imaging. 2020. Jun;39(6):1856–1867. doi: 10.1109/TMI.2019.2959609 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45. Wang J, Sun K, Cheng T, Jiang B, Deng C, Zhao Y, et al. Deep high-resolution representation learning for visual recognition. IEEE Trans Pattern Anal Mach Intell. 2021. Oct;43(10):3349–3364. doi: 10.1109/TPAMI.2020.2983686 [DOI] [PubMed] [Google Scholar]
- 46.Park J, Woo S, Lee J-Y, Kweon IS. Bam: Bottleneck attention module. In: Proceedings of the European Conference on Computer Vision. 2018:797-812.
- 47.Yin M, Yao Z, Cao Y, Li X, Zhang Z, Lin S, et al. Disentangled non-local neural networks. Proceedings of the European conference on computer vision (ECCV). 2020:191-207.
- 48.Hong Y, Pan H, Sun W, Jia Y. Deep dual-resolution networks for real-time and accurate semantic segmentation of road scenes. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2021.
- 49. Wu H, Zhang J, Huang K, Liang K, Yu Y. FastFCN: Rethinking Dilated Convolution in the Backbone for Semantic Segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2019. [Google Scholar]
- 50.Wu T, Lu Y, Zhu Y, Zhang C, Wu M, Ma Z, et al. GINet: Graph interaction network for scene parsing. Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XVII 16. 2020:34-51.
- 51.Wang W, Zhou T, Yu F, Dai J, Konukoglu E, Van Gool L. Exploring cross-image pixel contrast for semantic segmentation. Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021:7303-7313.
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
Two datasets are used in this paper. The iSAID dataset is available at https://captain-whu.github.io/iSAID/index.html. It can be directly downloaded and used as a public dataset. Another data set is the private research data URL of Jilin Institute of High-Resolution Remote Sensing: https://github.com/wseven77/W-Net.






