Abstract
In the field of the Internet of Things, image acquisition equipment is the very important equipment, which will generate lots of invalid data during real-time monitoring. Analyzing the data collected directly from the terminal by edge calculation, we can remove invalid frames and improve the accuracy of system detection. SSD algorithm has a relatively light and fast detection speed. However, SSD algorithm do not take full advantage of both shallow and deep information of data. So a multiscale feature fusion attention mechanism structure based on SSD algorithm has been proposed in this paper, which combines multiscale feature fusion and attention mechanism. The adjacent feature layers for each detection layer are fused to improve the feature information expression ability. Then, the attention mechanism is added to increase the attention of the feature map channels. The results of the experiments show that the detection accuracy of the optimized model is improved, and the reliability of edge calculation has been improved.
Subject terms: Engineering, Mathematics and computing
Introduction
The application scenarios of image acquisition equipment are more and more owing to the booming development of the Internet of Things and the huge breakthrough and progress of computer vision related technologies. On account of the increase of equipment nodes, the pressure of data transmission increases sharply. the data flow generated by image acquisition equipment may be very large, so it is necessary to use edge computing1–3 to preprocess images. With the successive breakthroughs of deep learning technologies4–8 and the rapid development of the world economy, object detection algorithms9–11 have made significant progresses in various fields12–15. Especially, target detection is one of the basic research projects in the fields of public transportation16, national defense and military. It is widely used in aerospace17, robot navigation18, industrial detection19, pedestrian tracking20 and military applications21. And a high-performance target detection algorithm can promote the development of industry. The so-called target detection is to find the target from pictures or videos by analyzing the geometric characteristics of the target, judge the specific category of each target accurately, and provide the bounding box of each target. The CNN (Convolution Neural Network)22 has been proved to be an effective model for processing visual tasks. The convolutional layer can capture the image representation of hierarchical patterns and obtain the feature layer of different receptive fields. To find a more powerful expression is an significant topic in the research of object detection, so that the network can better capture the significant information in the specific tasks23–26. The accuracy of edge equipment in screening images data is improved and the reliability is enhanced. In recent years, deep learning has developed rapidly. And more and more scholars have applied deep learning to the field of object detection. There are two kinds of target detection algorithms based on deep learning27. One is the object detection algorithm based on candidate box represented by RCNN28, Fast-RCNN27, Faster-RCNN29,30 etc. This kind of target detection algorithm firstly uses the Selective Search31, Edge Boxed32 and other algorithms to generate the candidate regions (region proposal)33 that may contain the target to be detected, and then these candidate regions are classified and located to achieve the intent of targets detection. The other one is the regression-based object detection algorithm represented by SSD34–36series and YOLO37,38 series. The regression-based object detection algorithm is surely faster than the box-based detection algorithm, and the main advantage of the box-based target detection algorithm is its high accuracy. Object detection algorithms not only require high precision, but also require fast real-time performance. Although the target detection algorithm based on candidate box has high precision, the generation of candidate box will consume a lot of time and result in unsatisfactory speed. However, the object detection based on regression does not need to generate candidate boxes and it is directly detected on the original image. The speed is greatly improved, but the disadvantage is that the accuracy is not high enough. With the improvement of the algorithm, some regression-based target detection algorithms have high accuracy and fast detection speed, and their accuracy is even higher than the same box-based target detection algorithms39–41. In this paper, the SSD target detection algorithm has been optimized, the extracted feature map has been divided into three categories according to the receptive field. Different scale fusion methods were carried out according to the categories to achieve a much better utilization of feature information and to enhance the robustness of the detection frame. Then, the channel attention mechanism is added to the fused feature layer to make the model become more interested in the specific channel information and to improve the network performance in channel dimension. The validity of proposed model is demonstrated by comparative experiment results, and the results show that the algorithm proposed in this paper not only improves the detection accuracy, but also ensures the detection speed42–46. In this paper, Senet (squeeze- and -exception networks)47 is used as the channel attention mechanism. The remaining structure of this paper is arranged as follows. In chapter two, we first review the SSD algorithm and some model evaluation criteria. In chapter three, we focus on the improved SSD model. After that, the result of experiment and the corresponding analysis are given in the fourth chapter. Finally, the fifth chapter summarizes the work.
Background and related work
Now many scholars both domestically and internationally are interested in the field of object detection, especially small object detection. And they have done a lot of research works and achieved good research results. For examples, in order to raise the detection efficiency of small objects, an improved multiscale feature fusion method is proposed in reference48, namely, the atrous spatial pyramid pooling-balanced-feature pyramid network is proposed for object detection. In particular, the atrous convolution operators with different dilation rates are applied to fully utilize context information, where the skip connection is employed to achieve sufficient feature fusions. In reference49, the authors show how Deep Learning may be used to reliably extract higher-level features and then fuse multi-scale features to identify eddies, regardless of their structures and scales. And the experimental results show that high target detection accuracy can be get by their method.
Next, this chapter will describe the basic idea of the traditional SSD algorithm and analyse the advantages and shortcomings of SSD algorithm in detail. Then, some model evaluation criteria will be introduced.
SSD network structure
SSD is one of the typical representatives of one-stage target detection algorithm. The model structure is shown in Fig. 150. The image is adjusted by the algorithm to a fixed size 300 × 300, and they are inputted into the network. And use the backbone feature extraction network based on vgg1651 to obtain the feature layers of different scales, which are named Conv4_3 layer, Conv7 layer, Conv8_2 layer, Conv9_2 layer, Conv10_2 layer, Conv11_2 layer, and their sizes are (38 × 38 × 512), (19 × 19 × 1024), (10 × 10 × 512), (5 × 5 × 256), (3 × 3 × 256), (1 × 1 × 256). Then, each grid point of the feature layer creates some prior boxes with different aspect ratios. The numbers of prior boxes generated on different feature layers are different, which are 4, 6, 6, 6, 4, 4 respectively. On the basis of the size of the receptive fields, Conv4_3 layer and Conv7 layer have large size and small receptive fields and have strong geometrical information expression ability, which are used to detect smaller targets; Conv10_2 layer and Conv11_2 layer have small size and large receptive fields, and the semantic information expression ability is strong. It is suitable for the detection of large targets. The geometrical information and the semantic information obtained for Conv8_2 and Conv9_2 are between those obtained for Conv10_2 and Conv11_2, and they are used for detect medium targets. Finally, all redundant prior boxes are removed by non-maximum Supression (Non-Maximum Suppression, NMS) to generate the final detection boxes.
Figure 1.
SSD structure chart.
SSD algorithm uses high-level feature information with large receptive field to predict large objects, and low-level feature information with small receptive field to predict small objects effectively. This brings a problem: when the feature information of low-level network is used to predict small targets, SSD algorithm has a weak detection performance for small targets due to the lack of high-level semantic information because the deep feature map loss too much information and has insufficient resolution after being sampled multiple times.
Loss function
Loss function of SSD algorithm contains two aspects: location loss () and confidence loss (). There are many prior boxes, and relatively few objects to be detected from a image. Then many prior boxes cannot match a real box and cannot generate too many negative samples. This algorithm can conduct difficult sample mining, adjust and control the positive and negative samples, reduce the influence of too many negative samples, and improve the optimization speed and the stability of training results. The algorithm's loss function is defined as Eq. (1):
| 1 |
The is total loss function, is the number of default boxes which are matched truth boxes; the parameter is used to adjusting the ratio between the location loss and the confidence loss; is category confidence; represents the positional information of predictive boxes; on behalf of the positional information of truth boxes; the value of input depends on the IoU (intersection over union, IoU)52 threshold between prior box and real box. When the IoU between a priori box i and real box j is greater than the threshold, . This indicates a priori box i is matched with real box j, and real box category is , or . And the location loss function is adopted for Smooth L1 loss, the function is defined as Eq. (2):
| 2 |
The is defined as Eq. (3):
| 3 |
The are on behalf of the offset of boxes’ center along direction x and y; and the width and height of boxes represented by ; shows the predictive box which is positive sample, and represent positive sample collection. Because the predictive box is encoded, so by encoding operation of the real box to get . The coding process defines as follows:
| 4 |
| 5 |
| 6 |
| 7 |
The confidence loss adopts softmax loss, which is defined as Eq. (8):
| 8 |
represent the predictive box which is negative sample, and represent the negative sample collection. is the probability which represents the category is correctly classified as background, the calculated through the softmax function represents the probability that the category is correctly classified as non-background.
Model evaluation criteria
Some model evaluation criteria are introduced. To evaluate the detection effect of the model, the following criteria are used to measure the model. And the common terms are shown in Table 1.
Table 1.
Common terms for object detection evaluation criteria.
| True positive (TP) | Number of positive samples which are classified correctly |
| True negative (TN) | Number of negative samples which are classified correctly |
| False positive (FP) | Number of positive samples which are classified correctly |
| False negative (FN) | Number of negative samples which are classified correctly |
1. Accuracy: The accuracy is one of the common evaluation criteria of object detection model. The mathematical meaning is to divide the number of correctly classified samples by the number of all samples. The higher the accuracy, the better the detection effect of the model. And the function is as follows:
| 9 |
2. Precision: The precision is calculated from the test results, which indicates the number of real positive samples in the samples predicted as positive samples. It is denoted as Eq. (10):
| 10 |
3. Recall: The recall rate is calculated from the real sample set, which indicates the probability of correct recognition in all positive samples. It is denoted as Eq. (11):
| 11 |
4. AP (average precision): In general, precision and recall rate are contradictory standards. Thus, AP is proposed to better measure the performance of the model. After drawing the smooth PR curve (precision recall curve), and the final AP value is calculated as follows:
| 12 |
5. mAP (mean average precision): AP means the average precision for a single category, while mAP means the average of AP for multiple categories. The value range of mAP is 0–1, and the higher the value of mAP, the better the detection performance. This criterion is the most important one in the evaluation criteria of object detection algorithm. It is denoted as follows:
| 13 |
6. PFS (frames per second): Object detection algorithm requires high precision and fast detection speed. The ultimate goal is to find a high-precision and efficient model. The mathematical meaning of FPS refers to the quantity of pictures that the model can detect per second.
Improved algorithm based on SSD
In this chapter, we will optimize the SSD algorithm and introduce the optimization steps in detail. There are two main steps to optimize the model. The first step is adopting different feature fusion methods for different scale feature layers to improve the utilization rate of feature maps. The second step is adopting the channel attention to optimize the model.
Multi-scale feature fusion
Based on the basic structure of SSD, multi-scale feature fusion attention mechanism () is proposed to improve the utilization rate of the model for extracting features. Different fusion mechanisms are adopted for feature layers of different sizes, the layer Conv4_3 for the detection of small targets is fused with Conv_7 and Conv8_2, and the fusion method can be seen in Fig. 3a. The fusion method of Conv7 which is fused with Conv8_2 and Conv9_2 is shown in Fig. 3b. It is beneficial to strengthen semantic information of the shallow feature layer by fusing the features of relatively deeper layer, and increased the accuracy of small target detection. There, we select any dimension of the corresponding feature layer for visualization, as shown in Fig. 2a. In the thesis, we named method multi-scale feature fusion attention () for Small object. While the Conv8_2 used to detect medium targets is fused with Conv7 and Conv9_2, the fusion method can be shown as Fig. 3c. And the fusion method of Conv9_2 which fused with Conv8_2 and Conv10_2 can be seen in Fig. 3d, making full use of the information from adjacent extracted features to improve the ability of information expression. Feature fusion operation to detected medium-sized targets is called Multi-scale feature fusion attention () for medium object, the visualization results are shown in Fig. 2b. Finally, layer Conv10_2 used for detecting large-scale objects is fused with Conv8_2 and Conv9_2, and the fusion method can be seen in Fig. 3e. And the fusion method of layer Conv11_2 fused with Conv9_2 and Conv10_2 can be seen in Fig. 3f. As the deep feature layer goes through multiple convolution and downsampling, the receptive field becomes larger, but lots of feature information are lost, which affects the detection accuracy, especially for smaller objects. Such influence can be reduced through the fusion of relatively shallow features, and such operation is named multi-scale feature fusion attention () for large object. Specific fusion steps are visualized as shown in Fig. 2c. In the fusion step, we change the size of feature maps by upsampling and convolution with a stride size of 2, and adopt the convolution with kernel 1 × 1 to change the number of channels. The persons and animals in Fig. 2 are from reference53.
Figure 3.
Fusion methods for different feature layers.
Figure 2.
Visualization for .
In this paper, different fusion methods are adopted for different depth feature layers, which greatly improves the utilization rate of feature information. To reduce the overfitting of the model, the following random data enhancement was performed on the original data to improve the diversity of the input data. (1) zoom: randomly scale the image to a certain size; (2) flip: randomly flip the picture from side to side; (3) color replacer: transform the image from RGB color space to HSV color space, and fine-tune the image's hue (H), saturation (S), and value (V). test results of different fusion methods on the PASCAL VOC 2007 datasets are shown in Table 2 and is .
Table 2.
Performance comparison between different methods.
| SSD | SSD + | SSD + | SSD + | SSD + MFA | |
|---|---|---|---|---|---|
| (%) | 87.30 | 90.05 | 89.78 | 89.56 | 90.57 |
| FPS | 29.31 | 26.75 | 27.62 | 29.06 | 26.11 |
The Table 2 indicates that FPS of the real-time detection of different fusion methods is lower than that of the conventional SSD algorithm to a lesser extent. And mAP of the SSD algorithm of different fusion methods are higher than the conventional algorithm of SSD, and of SSD algorithm with is 90.57%, increasing 3.27% compared with the conventional SSD algorithm. The average detection speed of SSD algorithm with on the experiment platform is 26.11 frame/second, compared to the conventional SSD algorithm reduced 3.2 frame/second.
Feature channel attention mechanism
Squeeze-and-excitation network(SEnet) were proposed by Senior R&D engineer Hu Jie and his team, the network won the Image Classification task champion of the last ImageNet 2017 with great advantage. SEnet network structure is shown in Fig. 4.
Figure 4.

Senet structure chart.
SEnet alters the attention between feature channels to improve model feature extraction. By learning to automatically acquire the importance of each characteristic channel, according to this degree, more attention is paid to the model's effective channel, while the ineffective or inefficient channel is suppressed. SEnet consists of two important parts, squeeze and excitation. The operation of squeeze is to compress each two-dimensional data into a real number through global average pooling in spatial dimension, and the real number has a global receptive field. Next, we learn to generate weights for each channel, which names excitation. SEnet network parameters increase mainly comes from two full connection layers, and the first full connection layer through the compression ratio r (r = 16) reduced the number of arguments. Therefore, the detection rate of the proposed algorithm is only slightly reduced. In this study, SEnet was added to each feature layer after different fusion operations, and the framework of the peoposed SSD algorithm was shown in Fig. 5.
Figure 5.
Improved SSD structure diagram.
Experimental results and analysis
Experimental equipment and data
The experimental equipment configuration in this paper is as follows:
Intel(R) Core (TM) i5-9300HF CPU @ 2.40 GHz, 16 G memory;
GPU: NVIDIA GeForce GTX 1660 Ti, 16 G memory;
Operating System: Windows10;
Program running environment: python 3.7.7, tensorflow2.2.0, CUDA10.1;
Dataset: PASCAL VOC 2007 (http://host.robots.ox.ac.uk/pascal/VOC/voc2007/)53, and the experiment data table can be seen in Table 3.
Table 3.
Experiment data table of PASCAL VOC 2007.
| Train set | Test set | |||
|---|---|---|---|---|
| Images | Objects | Images | Objects | |
| Aeroplane | 402 | 547 | 40 | 44 |
| Bicycle | 434 | 613 | 48 | 77 |
| Bird | 536 | 833 | 76 | 112 |
| Boat | 313 | 492 | 40 | 61 |
| Bottle | 412 | 888 | 44 | 86 |
| Bus | 327 | 400 | 33 | 42 |
| Car | 1314 | 2244 | 120 | 207 |
| Cat | 588 | 652 | 71 | 78 |
| Chair | 774 | 1412 | 88 | 142 |
| Cow | 240 | 450 | 28 | 53 |
| Diningtable | 360 | 391 | 30 | 30 |
| Dog | 729 | 874 | 110 | 125 |
| Horse | 498 | 619 | 63 | 91 |
| Motorbike | 425 | 614 | 42 | 50 |
| Person | 3632 | 8300 | 383 | 918 |
| Pottedplant | 427 | 908 | 42 | 86 |
| sheep | 175 | 453 | 18 | 46 |
| Sofa | 397 | 431 | 55 | 56 |
| Train | 462 | 514 | 58 | 65 |
| Tvmonitor | 443 | 581 | 42 | 51 |
Analysis of experimental results
In this paper, four common target detection algorithms, namely, SSD, YOLOv3, YOLOv4, and Faster RCnn, are used to compare the performance with the improved SSD algorithm. And Table 4 shows the experimental results.
Table 4.
The performance comparison of the different object detection algorithms.
| (%) | (%) | mAP75 (%) | FPS/s | |
|---|---|---|---|---|
| SSD | 87.30 | 49.17 | 35.19 | 29.31 |
| YOLOv3 | 79.69 | 47.89 | 44.63 | 17.76 |
| YOLOv4 | 88.57 | 51.81 | 49.99 | 15.45 |
| Faster-RCNN | 79.28 | 53.06 | 54.18 | 1.56 |
| SSD + MFA | 90.57 | 64.96 | 69.00 | 26.11 |
Experiments are carried out using PASCAL VOC 2007 dataset and detection performance indexes include mAP and FPS. is the average of all kinds of classes’, FPS is detection speed. refers to the average precision when IoU threshold of the real box and prior box is 0.5. And refers to the average precision when IoU threshold of real box and prior box is 0.75.The is the average of , , , , . The experimental data show that the mAP of improved algorithm of SSD (SSD + MFA) under different IoU threshold is the highest. With as evaluation standard, the improved SSD algorithm is 2.00% better than the second-ranked YOLOv4 algorithm.With as evaluation standard, the improved SSD algorithm is 14.82% higher than the second-ranked Faster RCNN algorithm. With as evaluation standard, the improved SSD algorithm is 11.90% higher than the second-ranked Faster RCNN algorithm. The improved SSD algorithm is second-ranked in average detection speed, and it’s average detection speed is only 3.2 frames /second lower than the SSD algorithm ranked first, however, the average detection of the improved SSD algorithm is 8.35 frames/second higher than the third-ranked YOLOv3 algorithm. The comprehensive comparison shows that the improved SSD algorithm has the best performance.
Figure 6 shows the accuracy rate, namely, recall rate curve comparison diagram of the average precision of different algorithms in different categories. Where SSD stands for conventional SSD algorithm, SSD+ MFA stands for improved SSD algorithm, YOLOv3 stands for YOLOv3 algorithm, YOLOv4 stands for YOLOv4 algorithm, and Faster-RCNN stands for Faster-RCNN algorithm. Seen from the figure, for the class' person' and class 'motorbike', there is a small difference in the accuracy rate—recall rate curve of each detection algorithm, but the improved SSD algorithm has the best performance. And the accuracy rate—recall rate curve of the improved SSD algorithm in class 'chair' and class 'dining table' increases significantly, which’s AP respectively are 88%, 92%, the detection accuracy is clearly better than other detection algorithms. For class 'cow' and class 'sofa', the improved SSD algorithm, SSD algorithm, and YOLOv4 algorithm have a small difference in detection accuracy, but are significantly better than YOLOv3 algorithm and Faster RCNN algorithm. For the class 'bottle', YOLOv4 algorithm has the highest precision, and improved SSD algorithms’ ranks second. For the class 'pottedplant', the detection accuracy of the improved SSD algorithm and YOLOv4 algorithm is clearly better than other algorithms. To sum up, the detection effects of the improved SSD algorithm in different size targets both have been improved.
Figure 6.
Comparison of five algorithms in different categories of accuracy—recall rate curve.
Compared with the other four algorithms, the improved SSD algorithm can reduce the error detection box and improve the detection accuracy. The comparison of detection results between different detection algorithms and the improved SSD algorithm are shown in Fig. 7. Select three detection results of large objects, three detection results of medium objects, and three detection results of small objects, respectively, among which the red boundary boxes are the error detection box, the blue boundary boxes are the real box, and the green boundary boxes are the prediction box corresponding to the real box. The results show that the mAP of the improved SSD algorithm are improved for objects of different scales. In addition, the error detection box is significantly reduced, and the IoU between the detection result (prediction box) of the improved SSD algorithm and the corresponding real box are also improved. The persons and other objects in Fig. 7 are from reference53.
Figure 7.
Detection results of different algorithms on PASCAL VOC dataset.
Conclusion
In this paper, an improved SSD algorithm (SSD + MFA) is proposed by adopting different fusion methods for feature extraction different scales layers and using the channel attention mechanism to reallocate the channel weights of the fused feature map. The on PASCAL VOC2007 dataset reached 90.57%, which is 3.27% higher than the conventional SSD algorithm and 2.00% higher than YOLOv4 algorithm. The improved SSD algorithm can effectively reduce the error detection rate.
And the value of mAP of detection targets for different sizes has been improved to some extent, which improved significantly the precision of edge equipment screening image.
Ethics approval and consent to participate
The research is approved by the College of Computer Science, and Intelligent Information Perception and Processing Technology Hunan Province Key Laboratory, Hunan University of Technology, Zhuzhou, Hunan, China, 412007. The research includes some information and images. All subjects involved in the information and images, or their guardians agreed to for publication of identifying information/images in an online open-access publication.
Acknowledgements
First of all, I would like to thank Mr. Zeng Zhigao, who gave me guiding opinions and recommendations on the research direction of my paper. He also gave me careful advice on the difficulties and doubts I encountered in the process of writing the paper and put forward many helpful suggestions for improvement. At the same time, I would like to thank all my teachers and classmates for helping each other and making progress together during the study. We had a perfect and unforgettable time together.
Author contributions
Conceptualization, L.D. and Z.Z.; methodology, Z.Z.; validation, W.Z., L.D., Z.Z. and Q.L.; formal analysis, W.Z. and Z.Z.; investigation, L.D. and C.M.; resources, W.Z. and Y.Z.; data curation, Q.L.; writing—original draft preparation, Z.Z. and L.D.; writing—review and editing, Z.Z. and L.D.; visualization, Z.Z.; supervision, Q.L.; project administration, Z.Z.; funding acquisition, Y.Z. and Z.Z.
Funding
This work is partially supported by Major Project for New Generation of AI (Grant No. 2018AAA0100400), The Scientific Research Fund of Hunan Provincial Education Department, China (Grant Nos. 21A0350, 21C0439, 19A133), the national natural science foundation of Hunan (Grant Nos. 2022JJ50051, 2020JJ6089, 2020JJ6088).
Data availability
The datasets used or analysed during the current study are available from the corresponding author on reasonable request.
Competing interests
The authors declare no competing interests.
Footnotes
Publisher's note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Contributor Information
Qiang Liu, Email: liuqiang@hut.edu.cn.
Zhigao Zeng, Email: zzgzzg99@163.com.
References
- 1.Huang J, Lv B, Yuan Wu, Chen Y, Shen X. Dynamic admission control and resource allocation for mobile edge computing enabled small cell network. IEEE Trans. Veh. Technol. 2022;71(2):1964–1973. doi: 10.1109/TVT.2021.3133696. [DOI] [Google Scholar]
- 2.Chen Y, Liu Z, Zhang Y, Yuan Wu, Chen X, Zhao L. Deep reinforcement learning-based dynamic resource management for mobile edge computing in industrial internet of things. IEEE Trans. Ind. Inf. 2021;17(7):4925–4934. doi: 10.1109/TII.2020.3028963. [DOI] [Google Scholar]
- 3.Chen Y, Zhao F, Lu Y, Chen X. Dynamic task offloading for mobile edge computing with hybrid energy supply. Tsinghua Sci. Technol. 2021 doi: 10.26599/TST.2021.9010050. [DOI] [Google Scholar]
- 4.Lecunz Y, Bengio Y, Hinton C. Deep learning. Nature. 2015;521:436–444. doi: 10.1038/nature14539. [DOI] [PubMed] [Google Scholar]
- 5.Deng S, Zhao H, Fang W, et al. Edge intelligence: The confluence of edge computing and artificial intelligence. IEEE Internet Things J. 2020;7(8):7457–7469. doi: 10.1109/JIOT.2020.2984887. [DOI] [Google Scholar]
- 6.Khan A, Sohail A, Zahoora U, et al. A survey of the recent architectures of deep convolutional neural networks. Artif. Intell. Rev. 2020;53:5455–5516. doi: 10.1007/s10462-020-09825-6. [DOI] [Google Scholar]
- 7.Chen Y, Wei Gu, Li K. Dynamic task offloading for internet of things in mobile edge computing via deep reinforcement learning. Int. J. Commun. Syst. 2020 doi: 10.1002/dac.5154. [DOI] [Google Scholar]
- 8.Rong Gu, Chen Y, Liu S, Dai H, Chen G, Zhang K, Che Y, Huang Y. Liquid: Intelligent resource estimation and network-efficient scheduling for deep learning jobs on distributed GPU clusters. IEEE Trans. Parallel Distrib. Syst. 2021 doi: 10.1109/TPDS.2021.3138825. [DOI] [Google Scholar]
- 9.Dai H, Yun Xu, Chen G, Dou W, Tian C, Xiaobing Wu, He T. ROSE: Robustly safe charging for wireless power transfer. IEEE Trans. Mob. Comput. 2022;21(6):2180–2197. doi: 10.1109/TMC.2020.3032591. [DOI] [Google Scholar]
- 10.Dai H, Wang X, Lin X, Rong Gu, Liu Y, Dou W, Chen G. Placing wireless chargers with limited mobility. IEEE Trans. Mob. Comput. 2021 doi: 10.1109/TMC.2021.3136967. [DOI] [Google Scholar]
- 11.Zhao Z, Zheng P, Xu S, et al. Object detection with deep learning: A review. IEEE Trans. Neural Netw. Learn. Syst. 2019;30(11):3212–3232. doi: 10.1109/TNNLS.2018.2876865. [DOI] [PubMed] [Google Scholar]
- 12.Li, B. 3D fully convolutional network for vehicle detection in point cloud. In Proceedings of IEEE/RSJ International Conference on Intelligent Robots and Systems, 1513–1518 (2017).
- 13.Zhou, Y. & Tuzel, O. VoxelNet: End-to-End Learning for Point Cloud Based 3D Object Detection. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 4490–4499 (2018).
- 14.Krizhevsky, A., Sutskever, I. & Hinton, G. E., ImageNet classification with deep convolutional neural networks. NIPS, 1097–1105 (2012).
- 15.Anton, S. D. D., Sinha, S. & Diter Schotten, H. Anomaly-based intrusion detection in industrial data with SVM and random forests. In 2019 International Conference on Software, Telecommunications and Computer Networks, 1–6 (2019).
- 16.Wei, Y., et al., Street object detection/tracking for AI city traffic analysis. In 2017 IEEE SmartWorld, Ubiquitous Intelligence, 1–5 (2017).
- 17.Saputera, Y. P. et al. Radar software development for the surveillance of indonesian aerospace sovereighty. In 2018 International Conference on Electrical Engineering and Computer Science, 189–194 (2018).
- 18.Khan A, Rinner B, Cavallaro A. Cooperative robots to observe moving targets: Review. IEEE Trans. Cybern. 2018;48(1):187–198. doi: 10.1109/TCYB.2016.2628161. [DOI] [PubMed] [Google Scholar]
- 19.Weiming H, Tieniu T, Liang W, et al. A survey on visual surveillance of object motion and behaviors. IEEE Trans. Syst. Man Cybern. Part C. 2004;34(3):334–352. doi: 10.1109/TSMCC.2004.829274. [DOI] [Google Scholar]
- 20.Chia-Jung P, Hsiao-Rong T, Yu-Ming L, et al. Pedestrian detection and tracking at crossroads. Pattern Recogn. 2004;37(5):1025–1034. doi: 10.1016/j.patcog.2003.10.005. [DOI] [Google Scholar]
- 21.Zhang M, Li H, Xia G, et al. Research on the application of deep Learning target detection of engineering vehicles in the patrol and inspection for military optical cable lines by UAV. Int. Symp. Comput. Intell. Des. 2018;01:97–101. [Google Scholar]
- 22.Li J, Peng H, Cao Y, Dou Y, Zhang H, Yu PS, He L. Higher-order attribute-enhancing heterogeneous graph neural networks. IEEE Trans. Knowl. Data Eng. 2021 doi: 10.1109/TKDE.2021.3074654. [DOI] [Google Scholar]
- 23.Saini, A. & Biswas, M. Object detection in underwater image by detecting edges using adaptive thresholding. In International Conference on Trends in Electronics and Informatics, 628–632 (2019).
- 24.Raghunandan, A., Mohana, P. R. & Aradhya, H. V. R. Object detection algorithms for video surveillance applications. In International Conference on Communication and Signal Processing, 0563–0568 (2018).
- 25.Mane, S. & Mangale, S. Moving object detection and tracking using convolutional neural networks. In International Conference on Intelligent Computing and Control Systems, 1809–1813 (2018).
- 26.Ciresan, D., Meier, U. & Schmidhuber, J. Multi-column deep neural networks for image classification. In Proceedings of Computer Vision and Pattern Recognition, 3642–3649 (2012).
- 27.Girshick, R. Fast R-CNN. In Proceedings of IEEE International Conference on Computer Vision, 1440–1448 (2015).
- 28.Girshick, R., et al. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 580–587 (2014).
- 29.Ren S, He K, Girshick R, et al. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017;39(6):1137–1149. doi: 10.1109/TPAMI.2016.2577031. [DOI] [PubMed] [Google Scholar]
- 30.Li, Y. T. & Guo, J. I. A VGG-16 based faster RCNN model for PCB error inspection in industrial AOI applications. In IEEE International Conference on Consumer Electronics-Taiwan, 1–2 (2018).
- 31.Uijlings JRR, et al. Selective search for object recognition. Int. J. Comput. Vis. 2013;104:9. doi: 10.1007/s11263-013-0620-5. [DOI] [Google Scholar]
- 32.Zitnick CL, Dollar P. Edge boxes: Locating object proposals from edges. Comput. Vis. ECCV. 2014;2014(8693):391–405. [Google Scholar]
- 33.Hosang J, Benenson R, Dollar P, et al. What makes for effective detection proposals? IEEE Trans. Pattern Anal. Mach. Intell. 2016;38(4):814–830. doi: 10.1109/TPAMI.2015.2465908. [DOI] [PubMed] [Google Scholar]
- 34.Liu, W., et al. SSD: Single shot MultiBox detector. In ECCV, 21–37 (2016).
- 35.Zhou, S., & Qiu, J. RSSD: Object detection via attention regions in SSD detector. In 2019 2nd International Conference on Safety Produce Informatization, 266–169 (2019).
- 36.Araki, R. et al. MT-DSSD: Deconvolutional single shot detector using multi task learning for object detection, segmentation, and grasping detection. In 2020 IEEE International Conference on Robotics and Automation, 10487–10493 (2020).
- 37.Redmon, J., et al. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 779–788 (2016).
- 38.Ren, P., Fang, W. & Djahel, S. A novel YOLO-Based real-time people counting approach. In International Smart Cities Conference, 1–2 (2017).
- 39.Huang J, Tong Z, Feng Z. Geographical POI recommendation for internet of things: A federated learning approach using matrix factorization. Int. J. Commun. Syst. 2022 doi: 10.1002/dac.5161. [DOI] [Google Scholar]
- 40.Huang J, Zhang C, Zhang J. A multi-queue approach of energy efficient task scheduling for sensor hubs. Chin. J. Electron. 2020;29(2):242–247. doi: 10.1049/cje.2020.02.001. [DOI] [Google Scholar]
- 41.Qi L, Lin W, Zhang X, Dou W, Xiaolong Xu, Chen J. A Correlation graph based approach for personalized and compatible web APIs recommendation in mobile APP development. IEEE Trans. Knowl. Data Eng. 2022 doi: 10.1109/TKDE.2022.3168611. [DOI] [Google Scholar]
- 42.Redmon, J. & Farhadi, A. YOLO9000: Better, faster, stronger. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 7263–7271 (2016).
- 43.Xiaolong Xu, Jiang Q, Zhang P, Cao X, Khosravi MR, Alex LT, Qi L, Dou W. Game theory for distributed IoV Task offloading with fuzzy neural network in edge computing. IEEE Trans. Fuzzy Syst. 2022 doi: 10.1109/TFUZZ.2022.3158000. [DOI] [Google Scholar]
- 44.Qi L, Chunhua Hu, Zhang X, Khosravi MR, Sharma S, Pang S, Wang T. Privacy-aware data fusion and prediction with spatial-temporal context for smart city industrial environment. IEEE Trans. Ind. Inf. 2021;17(6):4159–4167. doi: 10.1109/TII.2020.3012157. [DOI] [Google Scholar]
- 45.Xiaolong X, Zijie F, Jie Z, Qiang H, Dongxiao Y, Lianyong Q, Wanchun D. Edge content caching with deep spatiotemporal residual network for IoV in smart city. ACM Trans. Sens. Netw. 2021;17(3, Article No. 29):1–33. doi: 10.1145/3447032. [DOI] [Google Scholar]
- 46.Rong, G., et al. Fluid: Dataset abstraction and elastic acceleration for cloud-native deep learning training jobs. In The 38th IEEE International Conference on Data Engineering, 2183–2196 (2022).
- 47.Hu, J., Shen, L. & Sun, G. Squeeze-and-excitation network. In Proceedings of the IEE Conference on Computer Vision and Pattern Recognition, 7132–7141 (2018).
- 48.Zeng N, Wu P, Wang Z, et al. A small-sized object detection oriented multi-scale feature fusion approach with application to defect detection. IEEE Trans. Instrum. Meas. 2022;71:1–14. [Google Scholar]
- 49.Du Y, Song W, He Q, et al. Deep learning with multi-scale feature fusion in remote sensing for automatic oceanic eddy detection. Inf. Fusion. 2019;49:89–99. doi: 10.1016/j.inffus.2018.09.006. [DOI] [Google Scholar]
- 50.Hosang, J., Benenson, R., & Schiele, B. A convnet for non-maximum suppression. In German Conference on Pattern Recognition, 192–204 (Springer, 2016).
- 51.Jin, M. et al. Glove detection system based on VGG-16 network. In International Symposium on Computational Intelligence and Design, 172–175 (2020).
- 52.Yu, J., et al. Unitbox: An advanced object detection network. In Proceedings of the 24th ACM International Conference on Multimedia, 516–520 (2016).
- 53.Everingham, M., Van Gool, L., Williams, C. K. L., Winn, J. & Zisserman, A. The PASCAL visual object classes challenge 2007 (vOC2007) results (2007).
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
The datasets used or analysed during the current study are available from the corresponding author on reasonable request.






