On the influence of artificially distorted images in firearm detection performance using deep learning

Patricia Corral-Sanz; Alvaro Barreiro-Garrido; A Belen Moreno; Angel Sanchez

doi:10.7717/peerj-cs.2381

. 2024 Oct 18;10:e2381. doi: 10.7717/peerj-cs.2381

On the influence of artificially distorted images in firearm detection performance using deep learning

Patricia Corral-Sanz ¹, Alvaro Barreiro-Garrido ¹, A Belen Moreno ¹, Angel Sanchez ^1,^✉

Editor: Giovanni Angiulli

PMCID: PMC11622868 PMID: 39650342

Abstract

Detecting people carrying firearms in outdoor or indoor scenes usually identifies (or avoids) potentially dangerous situations. Nevertheless, the automatic detection of these weapons can be greatly affected by the scene conditions. Commonly, in real scenes these firearms can be seen from different perspectives. They also may have different real and apparent sizes. Moreover, the images containing these targets are usually cluttered, and firearms can appear as partially occluded. It is also common that the images can be affected by several types of distortions such as impulse noise, image darkening or blurring. All these perceived variabilities could significantly degrade the accuracy of firearm detection. Current deep detection networks offer good classification accuracy, with high efficiency and under constrained computational resources. However, the influence of practical conditions in which the objects are to be detected has not sufficiently been analyzed. Our article describes an experimental study on how a set of selected image distortions quantitatively degrade the detection performance on test images when the detection networks have only been trained with images that do not present the alterations. The analyzed test image distortions include impulse noise, blurring (or defocus), image darkening, image shrinking and occlusions. In order to quantify the impact of each individual distortion on the firearm detection problem, we have used a standard YOLOv5 network. Our experimental results have shown that the increased addition of impulse salt-and-pepper noise is by far the distortion that affects the most the performance of the detection network.

Keywords: Artificial image distortions, Firearm detection, Deep learning, YOLO, Object detection metrics

Introduction

Weapon detection problem consists in locating and classifying in different environments the potential threats caused by the presence of firearms, explosives, knives and other dangerous objects (Paulter, 2015; Xu, 2021). This task can use a variety of sensing methods such as X-ray scanning, metal detection, and specially images/videos (e.g., visible or thermal), which in many cases come from Closed Circuit Television (CCTV) systems, in order to detect and identify those threats. Therefore, weapon detection technology helps to prevent violent incidents and ensures the safety of individuals specially in different urban environments (Debnath & Bhowmik, 2021). For example, in places like stadiums, government buildings, airports or schools, where large numbers of people gather, the risk of a violent incident happening is higher. This technology also helps to discourage individuals from attempting to bring weapons into these places. Overall, weapon detection is an important tool for keeping safety in high-risk environments.

A significant part of image-based firearm detection technology is represented by the video surveillance CCTV cameras (Gelana & Yadav, 2019). These cameras are being installed in many public outdoor places to provide security. In these environments, the image capture conditions are not controlled at all and images could present distortions such as: impulse noise due to poor illumination conditions, blurring due to camera defocus or movements of objects, occlusions due to the variegated complexity of the scenes, and object shrinking due to long distances between camera and objects, among others. All these mentioned distortions hinder to varying degrees the corresponding object detection tasks. In this context, the present study, applied to firearm detection, aims to quantify the influence of different perceived types of image distortions in the task of detecting these objects accurately. In our study, the analyzed image degradations correspond to some common distortions appearing in the context of realistic firearm images, such as occlusions, including noise, blurring, darkening and shrinking the images that, at certain levels, could significantly degrade the detection performance. To quantify the specific impact of these image conditions, we produced altered test images from the original ones by adding to them the corresponding distortion with different levels or “strengths” (e.g., impulse noise at 2%). Note that the training images do not include the aforementioned distortions.

Considered distortions are intended to simulate some common realistic conditions that hinder firearm detection in scenes. In particular, testing image detection models with the considered distortions is crucial for ensuring their robustness, reliability, and generalization ability in real-world applications. A use of the present study could be an assessment method to help to interpret the correctness and accurateness of firearm detections in test images, after analyzing each distortion type (and its “strength”) contained in a given image. An advantage of using artificial transformations on images is that we can analyze the influence of the image distortion degree, which is a much more complex task when dealing with natural distorted images (i.e., “in-capture” distortions).

We conduct a set of comprehensible experiments to quantitatively analyze the influence of the considered individual image distortion factors in the results of the firearm detection task itself. For our purpose, we have chosen a well-established one-stage object detection system, as it is the case for YOLOv5 (Jocher, 2020). This detector is lightweight, effective, it offers good inference times and it is also capable of achieving very good results for object detection tasks. Although new versions of YOLO detectors have recently appeared (e.g., YOLOv8 (Terven & Cordova-Esparza, 2023)), in our experiments we have used YOLOv5 since it is easier to train and it is also a suitable choice when one needs to deploy a solution on devices without GPU support.

Related work

This subsection outlines some relevant works on firearm detection using deep learning, as well as some studies concerning perceived image distortions.

Firearm detection using deep learning

The problem of detecting weapons in images has been intensively researched in the last decades (Darker, Gale & Blechko, 2008; Bhatti et al., 2021). First, through traditional machine learning methods and more recently using deep learning models, specially convolutional neural networks (Yadav, Gupta & Sharma, 2023).

Most research on weapon detection has been targeted toward knives (Kmiec & Glowacz, 2011; Buckchash & Raman, 2017; Castillo et al., 2019; Glowacz, Kmieć & Dziech, 2015) and firearms (Tiwari & Verma, 2015; Olmos, Tabik & Herrera, 2018; Bhatti et al., 2021). Detection of knives and firearms in images has received a growing attention by the scientific community due to their security implications (Debnath & Bhowmik, 2021). Some works are focused on the detection of only one of these types of weapons (e.g., Moran, Conci & Sanchez, 2022) only considers the detection of knives and (Olmos, Tabik & Herrera, 2018) is solely devoted to handgun firearms, while other consider both detections jointly (Grega et al., 2016). Weapon detection is generally a challenging task due to variable size and shapes of objects within the images, possible occlusions and/or cluttered backgrounds of scenes. A related problem is to detect paired bounding boxes that contain both the weapon and the human, for a more robust visual identification of gunmen in crowds (Basit et al., 2020; Mahmood et al., 2024).

Nowadays, deep learning has revolutionized the general object detection problem (Zou et al., 2019). In particular, deep learning firearm detection has also been recently investigated. Lai & Maples (2017) used CNN for detecting and classifying weapons in images. These authors considered around 3,000 CCTV images with weapons as training data to cover every situation and possible orientation of these objects. Olmos, Tabik & Herrera (2018) proposed a CNN-based system for detecting handguns from CCTV videos with the goal of reducing false positives. These authors created their own dataset (called DaSCI), and achieved their best results with a Faster R-CNN model. Salido et al. (2021) compared the performance of three deep CNN models: Faster R-CNN, RetinaNet and YOLOv3, respectively, in the detection of guns in videos. The study focuses on how the inclusion of gun grip information influences the results of each detector. Only a consistent improvement is achieved using YOLOv3. Khan et al. (2023) used U-Net networks for weapon segmentation in real time when people pass through a scanner system. For this purpose, the 2D segmentation network is reformulated and a Gaussian map is used to model the weapons in the feature map. Their dataset only provides handguns such as pistols or revolvers. Ruiz-Santaquiteria et al. (2023) recently proposed a combination of a human body pose classifier (OpenPose) with a deep network that processes images to extract relevant features for gun detection in video surveillance. After comparing the results obtained with different deep networks (ResNet-50, EfficientNet-B4, ConvNeXt-Base, Darknet53, DeiT and ViT, respectively) in combination with and without the pose features and filtering out false positives, the authors concluded that the best detection results were obtained with the Vision Transformer (ViT) model. The detection of firearms orientation can provide insights about the behavior and intentions of people carrying these weapons, which is critical for identifying potential threats. Iqbal et al. (2021) propose a weakly supervised deep learning architecture to predict Oriented Bounding Boxes (OBB) without using OBB annotations while training.

Perceived image distortions

We have not found in the literature any experimental studies aimed at analyzing separately the influence of synthetic image distortions (i.e., by applying artificially computed blur, impulse noise and random occlusions) on specific object detection problems with the approach presented here. For a particular detection problem, our approach consists in using a set of training images without the considered distortions and, after that, using new distorted test images to quantify the impact that the distortions and their “degrees” have on the accuracy of the detection task.

The work by Dodge & Karam (2016) used a classification problem to understand how image quality affects different deep neural networks. These authors trained their networks with distorted images using different levels of quality distortions. Then, they carried out a study on the effect of compression, noise blur and contrast.

There are also some studies that, using deep networks, analyze only the impact of other isolated image variabilities. For instance, the effect that illumination conditions (Sanchez et al., 2016) have on some specific object detection problems such as face recognition. Other authors consider the scale of the YOLOv5 model for blood cell detection (Rahaman et al., 2022) or its application for classification of caries lesions (Salahin et al., 2023). The work by Venkataramanan et al. (2022) analyzed the detection problem in authentically distorted images of roadways for quality assessment of detection algorithms.

Contribution and outline of this work

This article describes an experimental study that analyzes individually how each considered image distortion quantitatively affects the detection performance on test images when the detector network has not been trained with images presenting these variations. The main contribution of this work consists in quantifying, analyzing and comparing the effect of each considered individual image distortion (e.g., occlusion, impulse noise or blurring) in the detection performance. It should also be noted that that the goal of this study is not demonstrating the viability of the considered YOLO model itself for our firearm detection problem, but quantifying how it gets affected by the aforementioned distortions.

The rest of the article is organized as follows: Section ‘Materials and Methods’ introduces the materials and methods used in this research on detection of firearms. Section ‘Results’ describes the experimental setup, and it also displays and analyzes the results achieved for the different experiments on the considered firearm detection problem. Section ‘Discussion’ discusses these results and points out the most relevant findings. Finally, in Section ‘Conclusions’ we summarize the conclusions of the present work.

Materials and Methods

This section first describes the object detection problem in images and the neural detector model used in the experiments. Then, an overview of the stages in the proposed solution is presented. The preprocessing and data augmentation applied to the original training images is later indicated. We continue with the value of hyperparameters of the YOLO detector employed, and we also provide some details on the training of the network. Finally, the dataset used for the experiments is described.

Object detection and the YOLO model

Object detection is a challenging task in Computer Vision that has received large attention in recent years, especially with the development of deep learning (Zou et al., 2019; Wang et al., 2021). It presents many applications related to video surveillance, automated vehicle system, robot vision or machine inspection, among many others. The problem consists in recognizing and localizing some classes of objects present in static images or videos.

Recognizing (or classifying) involves identifying the categories of all object instances in a scene from a given set of classes, along with their confidence values. Localizing, in contrast, returns the coordinates of bounding boxes for each detected object in the image. Detection differs from instance segmentation, which identifies the object instance each pixel belongs to. Challenges in object detection include geometrical variations (e.g., scale changes, small object-to-image size ratios), partial occlusions, or varying illumination. Some images may exhibit multiple variabilities, such as small and partially occluded objects.

The You Only Look Once (YOLO) model, proposed by Redmon et al. (2016), is a state-of-the-art real-time object detection network. YOLO is a one-stage detector that uses features from the entire image to predict class probabilities and bounding box coordinates in a single pass. It formulates object detection as a regression problem, enhancing speed, accuracy, and generalization. YOLO splits an image into an NxN grid, where each cell predicts the presence of one object using a fixed number of bounding boxes and a Non-Maxima Suppression (NMS) algorithm. The YOLO framework has evolved through iterations like YOLOv8, YOLO-NAS, and YOLO with Transformers. We focused on YOLOv5 for this study, developed by Ultralytics in 2020 using Python and PyTorch, offering versions from nano to extra large to suit various hardware requirements. We tested YOLOv5s (small) and YOLOv5m (medium) configurations.

Figure 1 depicts the simplified network architecture of YOLOv5, comprising three main components: backbone, neck, and head. A 416 × 416 RGB image is processed through an input layer to the backbone, a modified CSP Darknet53 CNN, which extracts hierarchical features at various scales using the Cross Stage Partial (CSP) strategy, enhancing inference speed by reducing parameters. The neck integrates output features from the backbone at different resolutions using modules like Fast Spatial Pyramid Pooling (FSPP) and Path Aggregation Networks (PAN), connecting the backbone to the head. The head, based on anchors, classifies detected objects with three convolutional layers, predicting bounding box locations, confidence scores, and classes, displaying this information in the output image.

(Note: The image within the figure can be found in the DaSCI Weapon Detection dataset, and it is distributed under the terms of the Creative Commons Attribution 4.0 International).

It is worth mentioning that this YOLO model uses data augmentation during each training batch. The data loader makes three types of augmentations: scaling, colors space adjustments and mosaic (i.e., a combination of four images into four tiles of random ratio), respectively.

Experimental setup

Figure 2 shows a UML diagram with the steps followed in the proposed experimental setup using the considered YOLO model. First, the original set of training images was preprocessed and augmented to increase both the size and the variability of the dataset. These two stages will be described in detail during the next subsection. After that, the chosen YOLO model was trained, tested and evaluated through different experiments using standard metrics for the object detection problem.

Image preprocessing and data augmentation

The original set of training images was first preprocessed and then augmented to increase the quality and size of the original training dataset. The Roboflow tool (Dwyer & Hansen, 2022) was used for all preprocessing tasks. These tasks consisted in first applying a contrast stretching by means of an adaptive equalization (using the Auto-Adjust Contrast command) on the training images. Then, the resulting images were rescaled (using the Resize+Fit (Black Edges) command) from their original dimensions to the YOLOv5 input layer dimension (416 × 416). This rescaling keeps the aspect ratio of source images, and in some cases it creates a black padding image region.

For the augmentation on the training set of images, we used the image Augmentor software (Bloice, Stocker & Holzinger, 2017). This generated five new images for each original one in the dataset according to the following transformations:

1.
45° clockwise rotation (Rotate45 command) followed by horizontal mirroring (flip-left–right command)
2.
vertical mirroring (flip-top-bottom command)
3.
25° clockwise rotation (Rotate25 command)
4.
90° clockwise rotation (Rotate90 command) followed by a translation of 40 and 20 pixels respectively in x and y axes (translation-xy(40, 20) command), and
5.
vertical mirroring (flip-top-bottom command) followed by 45° clockwise rotation (Rotate45 command)

Figure 3 shows, from top to bottom and from left to right, the application of the five aforementioned augmentations for an original sample training image (upper left corner).

(Note: The original image within this figure comes from Wikimedia Commons, and it is distributed under the terms of the Creative Commons Attribution-Share Alike 3.0 Unported license).

The choice of these preprocessing and augmentation transformations was made after multiple experiments.

Network parameterization and training details

Training YOLO models need from a large collection of input images as well as their corresponding output ones with the ground truth boxes for each object instance contained in them. In our approach, we have used transfer learning, and the pretrained weights of a YOLOv5 network trained on the Microsoft (MS) COCO dataset (Lin et al., 2014) were used to boost the training of the model with our images. The MS COCO dataset includes objects belonging to 80 different classes, but unfortunately, some classes like ‘firearm’ or ‘handgun’ are not included. Nonetheless, the class ‘knife’, that has some resemblance to the objects being detected, does appear in MS COCO. In our problem we consider two classes of firearms: ‘handgun’ and ‘long gun’, respectively. Note that the class ‘handgun’ includes ‘pistol’ and ‘revolver’ object instances, while class ‘long gun’ includes ‘machine gun’, ‘shotgun’ and ‘rifle’ instances.

Table 1 summarizes some important training parameter values used for YOLO model in the experiments. These values were determined under experimentation.

Table 1. Training hyperparameter values used for the YOLOv5 networks.

Network hyperparameter	Set value
Training epochs	100
Batch size	32
Optimizer	SGD
Learning rate	’lr0 (initial)’: (1, 1e−5, 1e−1)
	’lrf (final)’: (1, 0.01, 1.0)
Momentum	(0.3, 0.6, 0.98)
Decay	(1, 0.0, 0.001)

Class	Images	Train	Validation	Test
handgun	2,972	2,080	446	446
long gun	2,953	2,052	452	446
Total	5,924	4,132	898	892

Class	Instances	Precision	Recall	F1-score	mAP50	mAP50-95
handgun	522	0.917	0.741	0.820	0.842	0.577
long gun	502	0.994	0.922	0.957	0.959	0.883
All	1,024	0.955	0.832	0.890	0.900	0.730

Class	Instances	Precision	Recall	F1-score	mAP50	mAP50-95
handgun	522	0.870	0.716	0.785	0.816	0.552
long gun	502	0.973	0.928	0.949	0.961	0.839
All	1,024	0.921	0.822	0.870	0.889	0.695

Noise	Class	Instances	Precision	Recall	F1-score
Noiseless	handgun	522	0.917	0.741	0.820
Noiseless	long gun	502	0.994	0.922	0.957
0.01	handgun	522	0.891	0.408	0.560
0.01	long gun	502	0.932	0.715	0.810
0.02	handgun	522	0.902	0.213	0.345
0.02	long gun	502	0.908	0.570	0.700
0.05	handgun	522	0.867	0.050	0.095
0.05	long gun	502	0.888	0.253	0.394

Degree	Class	Instances	Precision	Recall	F1-score
No occlusion	handgun	522	0.917	0.741	0.820
No occlusion	long gun	502	0.994	0.922	0.957
1 × 30%	handgun	522	0.733	0.618	0.671
1 × 30%	long gun	502	0.788	0.820	0.804
2 × 15%	handgun	522	0.759	0.667	0.710
2 × 15%	long gun	502	0.832	0.858	0.845
3 × 10%	handgun	522	0.773	0.901	0.832
3 × 10%	long gun	502	0.849	0.901	0.874

Degree	Instances	TP	FP	FN	Precision	Recall	F1-score	mAP50
No occl.	1,024	848	73	176	0.955	0.832	0.890	0.900
1 × 30%	1,024	760	240	264	0.760	0,719	0.740	0.728
2 × 15%	1,024	791	204	233	0.795	0.763	0.780	0.765
3 × 10%	1,024	827	193	197	0.811	0.800	0.800	0.795

Kernel size	Class	Instances	Precision	Recall	F1-score
No blur	handgun	522	0.917	0.741	0.820
No blur	long gun	502	0.994	0.922	0.957
3	handgun	522	0.901	0.713	0.796
3	long gun	502	0.981	0.916	0.947
9	handgun	522	0.894	0.661	0.760
9	long gun	502	0.974	0.886	0.928
13	handgun	522	0.894	0.659	0.759
13	long gun	502	0.971	0.871	0.918

Gamma correction	Class	Instances	Precision	Recall	F1-score
No darkening	handgun	522	0.917	0.741	0.820
No darkening	long gun	502	0.994	0.922	0.957
1.5	handgun	522	0.912	0.716	0.802
1.5	long gun	502	0.987	0.910	0.947
3	handgun	522	0.887	0.630	0.737
3	long gun	502	0.976	0.807	0.883
5	handgun	522	0.841	0.506	0.632
5	long gun	502	0.959	0.604	0.741

Interpolation	Class	Instances	Precision	Recall	F1-score
No shrinking	handgun	522	0.917	0.741	0.820
No shrinking	long gun	502	0.994	0.922	0.957
nearest	handgun	522	0.893	0.418	0.569
nearest	long gun	502	0.943	0.530	0.679
bilinear	handgun	522	0.923	0.669	0.776
bilinear	long gun	502	0.948	0.843	0.892
bicubic	handgun	522	0.924	0.672	0.778
bicubic	long gun	502	0.943	0.827	0.881
Lanczos	handgun	522	0.918	0.642	0.756
Lanczos	long gun	502	0.935	0.809	0.867

Variability	Category	No. Experiment	F1-score Degradation (%)	mAP50 Degradation (%)
Impulse noise	0.05	2	270.8	73.1
Shrinking	nearest	6	29.8	21.4
Occlusion	1 × 30%	3	20.3	23.6
Darkening	5	5	22.8	19.9
Blur	k = 13	4	5.6	4.2

Interpolation	Instances	TP	FP	FN	Precision	Recall	F1-score	mAP50
No shrinking	104	59	6	45	0.902	0.570	0.699	0.765
nearest	104	15	19	89	0.450	0,148	0.223	0.292
bilinear	104	55	5	49	0.921	0.533	0.675	0.739
bicubic	104	61	4	43	0.939	0.583	0.719	0.770
Lanczos	104	56	4	48	0.938	0.534	0.681	0.742

PERMALINK

On the influence of artificially distorted images in firearm detection performance using deep learning

Patricia Corral-Sanz

Alvaro Barreiro-Garrido

A Belen Moreno

Angel Sanchez

Abstract

Introduction

Related work

Firearm detection using deep learning

Perceived image distortions

Contribution and outline of this work

Materials and Methods

Object detection and the YOLO model

Figure 1. Schematic layer representation of YOLOv5 architecture.

Experimental setup

Figure 2. Overview of proposed experimental setup for firearms detection.

Image preprocessing and data augmentation

Figure 3. Sample firearm image (upper left corner) and its five augmented images.

Network parameterization and training details

Table 1. Training hyperparameter values used for the YOLOv5 networks.

Description of the used datasets

Table 2. Image dataset composition and its distribution.

Results

Performance metrics

Experimental results

Experiment 1: Global classification of firearms

Table 3. Detection results for classes ‘handguns’ and ‘long guns’, and also for all the firearms using the medium YOLOv5m model.

Table 4. Detection results using the small YOLOv5s model.

Experiment 2: Influence of the impulse noise level

Table 5. Detection results for firearm classes and different noise levels.

Table 6. Instance noise detection results without separating into firearm classes.

Figure 4. Effect of salt-and-pepper noise on the same test image containing two handguns: (left) 0.01-level, (center) 0.02-level and (right) 0.05-level.

Experiment 3: Influence of the number and the size of occlusions

Table 7. Detection results for firearm classes and different occlusion degrees.

Table 8. Instance occlusion detection results without separating into firearm classes.

Figure 5. Effect of synthetic random occlusions on the same test image: (left) one occlusion of 30% image size, (center) two occlusions of 15% and (right) three occlusions of 30%.

Experiment 4: Influence of the Gaussian blur

Table 9. Detection results for firearm classes and different levels of blur.

Table 10. Instance blur detection results without separating into firearm classes.

Figure 6. Effect of Gaussian blur using tested kernel sizes k on the same test image: (left) k=3, (center) k=9 and (right) k=13.

Experiment 5: Influence of image darkening

Table 11. Detection results for firearm classes and different levels of darkening (gamma correction).

Table 12. Instance darkening results without separating into firearm classes.

Figure 7. Effect of image darkening using different gamma corrections on the same test image: (left) γ = 1.5, (center) γ=3 and (right) γ=5.

Experiment 6: Influence of image shrinking

Table 13. Detection results for firearm classes and different levels of shrinking.

Table 14. Instance shrinking results without separating into firearm classes.

Figure 8. Effect of image shrinking using different interpolation algorithms on the same test image: (left) nearest neighbor, (center) bilinear and (right) bicubic.

Experiment 7: Image shrinking results on the new Localization of Firearm Carriers dataset

Table 15. Detection results for shrinking, when separating into firearm classes, on the LFC dataset.

Table 16. Detection results for shrinking, without separating into firearm classes, on the LFC dataset.

Discussion

Table 17. Worst F1-score and mAP50 average performance degradation for the corresponding “worst” variability considered in each of the experiments.

Figure 9. Two sample examples of incorrectly detected firearms: (left) false positive (FP) detection and (right) wrongly-classified firearm.

Conclusions

Funding Statement

Additional Information and Declarations

Competing Interests

Author Contributions

Data Availability

References

Associated Data

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases