Autonomous face mask detection using single shot multibox detector, and ResNet-50 with identity retrieval through face matching using deep siamese neural network

S Vignesh Baalaji; S Sandhya; S A Sajidha; V M Nisha; M D Vimalapriya; Amit Kumar Tyagi

doi:10.1007/s12652-023-04624-7

. 2023 Jun 7:1–11. Online ahead of print. doi: 10.1007/s12652-023-04624-7

Autonomous face mask detection using single shot multibox detector, and ResNet-50 with identity retrieval through face matching using deep siamese neural network

S Vignesh Baalaji ¹, S Sandhya ¹, S A Sajidha ^1,^✉, V M Nisha ¹, M D Vimalapriya ², Amit Kumar Tyagi ^1,³

PMCID: PMC10246526 PMID: 37360778

Abstract

The COVID-19 pandemic poses a global health challenge. The World Health Organization states that face masks are proven to be effective, especially in public areas. Real-time monitoring of face masks is challenging and exhaustive for humans. To reduce human effort and to provide an enforcement mechanism, an autonomous system has been proposed to detect non-masked people and retrieve their identity using computer vision. The proposed method introduces a novel and efficient method that involves fine-tuning the pre-trained ResNet-50 model with a new head layer for classification between masked and non-masked people. The classifier is trained using adaptive momentum optimization algorithm with decaying learning rate and binary cross-entropy loss. Data augmentation and dropout regularization are employed to achieve best convergence. During real-time application of our classifier on videos, a Caffe face detector model based on Single Shot MultiBox Detector is used to extract the face regions of interest from each frame, on which the trained classifier is applied for detecting the non-masked people. The faces of these people are then captured, which is passed on to a deep siamese neural network, based on VGG-Face model for face matching. The captured faces are compared with the reference images from the database, by extracting the features and calculating cosine distance. If the faces match, the details of that person are retrieved from the database and displayed on the web application. The proposed method has secured best results where the trained classifier has achieved 99.74% accuracy, and the identity retrieval model achieved 98.24% accuracy.

Keywords: COVID-19 pandemic, Mask detection, Deep learning, Computer vision, Face matching, Deep siamese neural networks

Introduction

In the new period of COVID-19, multidisciplinary efforts have been coordinated to hinder the pandemic’s spread. As mentioned by Rahmani and Mirmahaleh (2020) the coronavirus disease is a significant public health and economic problem in today's world due to the virus's adverse impact on people's lives, causing acute respiratory illnesses, fatalities, and financial crashes worldwide. Employing Artificial Intelligence is extremely crucial during the pandemic due to the accessibility of huge statistical data and specifically the lack of expertise in this domain. Deep Learning aids in the analysis and the prediction of a new wave of COVID-19. Furthermore, it could also be used for predicting the effects and occurrence of the new variants of the virus by providing meaningful insights that might help in restricting the spread of the virus.

An article by Centre for Disease Control and Prevention (2020) claims that face masks are an essential tool to combat the COVID-19 pandemic which would highly reduce the spread of the disease, especially when used universally in communities. Though there is a surge in the usage of face masks by people, especially in public areas, we can still find many people who do not follow regular use of face masks, which poses a need for enforcement monitoring systems. Without protection, these people pose a threat to themselves, their loved ones, and the people who commute and share a common space with them. As there is an insistence on commuters to wear face masks in many countries, detecting masked and unmasked faces becomes crucial for facial recognition applications that are extensively used for verification (Rana and Kisku 2021) and authentication (Vasanthi and Seetharaman 2020) of a person’s identity as mentioned by Wu et al. (2020). In this research, we have implemented a deep transfer learning model for detecting non-masked people in a video, and retrieving their identity information autonomously, through face matching with a deep siamese neural network.

The proposed system was developed to solve one of the most challenging situations humankind ever faced in this century. Though the usage of masks has been continuously encouraged from the moment the pandemic was declared, the people out there never understood the grave situation, and the number of infected cases is still in a surge. The model presented in this research can be combined with security cameras as in Karaman et al. (2021) to deter the transmission of COVID-19 by identifying individuals who are not wearing face masks. This model can also be used in the post-COVID-19 era in environments where facial coverings are mandatory.

The key contribution of this research is to introduce a novel and efficient approach using a pretrained ResNet-50 model with a customized head, that helps to classify people with and without face masks in images and live video streams. By using this fine-tuned approach, the training of the model is reduced significantly and achieves maximal performance with a minimal dataset, thereby addressing the data scarcity problem. In addition to this, a completely automated face matching mechanism is employed using a deep siamese neural network-based approach to identify the non-masked individuals, considerably reducing human effort. The remaining manuscript follows Sect. 2, which presents a survey on related works, and Sect. 3 that describes the dataset. The proposed model is explained in Sect. 4, followed by performance analysis and conclusion in Sect. 5 and 6, respectively.

Related works

He et al. (2016) presented a methodology using a residual learning approach that eases newer networks' training process. This ensemble residual net model resulted in a minimal error of 3.57 percent with the ImageNet dataset by recognizing deep representation as an essential function for many visual recognition tasks. Ejaz et al. (2019) has presented an approach that utilizes Principal Component Analysis (PCA) and Nearest Neighbor (NN) classifier distance to detect masked and unmasked faces and achieves an overall average accuracy of 83.5%. Jiang and Fan (2020) has suggested “RetinaFaceMask”, along with a background attention module to focus on face mask detection. When trained and tested on the Public Face Mask dataset it performs with 1.5% better precision and 5.9% better recall for mask detection. Loey et al. (2020) have proposed a hybrid deep learning model with ResNet-50 feature extractor and ensemble machine learning classifier, for classifying whether the given image has a face mask or not, which secured an accuracy of 99.64% in a subset of the RMFD dataset. Loey et al. (2021) have proposed a method using YOLOv2 and ResNet-50 to detect face masks that makes use of a custom dataset with 1415 images, where it achieves 81% average precision. Nagrath et al. (2021) proposed SSDMNV2, which combines SSD and MobileNetV2 for face mask detection, which utilizes a custom-made simulated face mask dataset and has secured an accuracy of 92.64%.

The novelty of our research is that it proposes an efficient transfer learning model using the pretrained ResNet-50 model for the detection of face masks of all types, along with a completely autonomous deep neural network methodology for obtaining the identity of the people without face masks. The above-discussed methods can be further improved in terms of performance by achieving better accuracy. Additionally, these methods do not introduce an enforcement system that can be used for extracting information about the people without face masks. This can aid the governments or organizations to enforce a need to wear face masks which is crucial in preventing the transmission of COVID-19, especially in densely populated spaces.

Dataset description

In this method, we have used a custom dataset consisting of 3835 real-world images of people with and without face masks. These images are captured with different orientation, lighting conditions and includes people wearing various types of face masks, such as medical and cloth face masks of different colors. This dataset is divided into two classes: with_mask containing 1916 images and without_mask containing 1919 images. The pictures for our dataset were collected from the following sources: RMFD Dataset (Huanga 2020) and Kaggle Dataset (Larxel 2020) and are made publicly available for research purposes at https://drive.google.com/file/d/11MpxaGwqKbppNW2KVhMK9qLyVvDvi6tv/view?usp=sharing. A bar plot denoting the distribution of the classes of our dataset is given in Fig. 1A. The sample images from the with_mask and without_mask classes are given in Fig. 1B and C, respectively.

Fig. 1 — Distribution and sample images from our custom dataset on which our classifier is trained and tested. (A) Distribution of the dataset based on class with 1916 images under ‘with_mask’ and 1919 images under ‘without_mask’ class, (B) with_mask class, (C) without_mask class

Proposed system

This section provides a profound explanation on the working of our proposed method, which can be broadly organized into two steps:

To predict if a person is wearing a mask or not from images and live video stream.
Retrieving the identity of the person not wearing a mask through face matching using deep siamese neural network.

The prediction phase can be further divided into two steps: the training and detection phases, for which the transfer learning process is employed for training the deep learning model that classifies a person’s faces with and without a mask using a new customized head layer. Finally, facial matching is performed on the faces that are classified as ‘no-mask’ to retrieve their identities. Figure 2 presents the architecture diagram of our methodology.

Training phase

The training phase deals with the actual training process of the face mask classifier through finetuning the pretrained ResNet-50 model. We have considered fine-tuning the pre-trained ResNet-50 architecture, as this transfer learning methodology (Wang et al. 2021) would help us reduce the training time and maximize the model's accuracy simultaneously. The rest of the section discusses the pre-processing techniques, data augmentation methodologies, learning rate decay schedule, the configuration of the new fully-connected head, and the training and compilation process, respectively. A detailed diagrammatic representation of the training phase is given in Fig. 3.

Fig. 3 — Architecture of training methodology

Data preprocessing

Preprocessing plays a vital role, as the trained model's effectiveness depends entirely on the dataset's quality. This phase deals with the preparation of the dataset to be used with the ResNet-50 model. The dataset is loaded, and the images are resized to 224 × 224 pixels, which are then converted into an array. These image arrays consist of values in the range of 0 to 255 for each pixel corresponding to the intensity of red, green, and blue in that pixel. They are shifted from RGB to BGR format, and each of the color channels is “zero-centered” without scaling based on the ImageNet dataset and is stored as a NumPy array of type float32. Finally, one hot encoding is applied on our class labels to include it as a feature for training our ResNet-50 model.

Data augmentation

Data augmentation includes a variety of techniques that modifies the original data by applying random mutations. The primary goal of data augmentation is to enable the model to learn more complex features and further improve the model's generalizability. In our case, we apply In-place Data Augmentation during the training process, thereby ensuring that the neural network model encounters images that are never before seen by it at each epoch. The data augmentation process in our method uses transformations like rotations, zoom, width shift, height shift, shear, and horizontal flip.

Learning rate decay schedule

The training process follows a learning rate decay schedule that can be viewed as a two-step process, where it tries to arrive at some good weight values with a large learning rate and then tries finding optimal weight values with a smaller learning rate. The decay for our method is given in Eq. (1).

Decay = \frac{α_{initial}}{Epochs}

In our method, the learning rate is initialized as 1 $e^{- 4}$ (0.0001) and epochs is initialized to 20. The learning rate update formula is given in Eq. (2).

α = α_{initial} \times \frac{1}{1 + (Decay \times Iterations)}

where, $α_{initial}$ represents the initial learning rate, and iterations (the total number of steps per epoch) can be calculated using the formula in Eq. (3).

Iterations = \frac{Number of training examples}{Batch size}

Construction of new head layer for classification

To fine-tune the pre-trained ResNet-50 for our classification problem, a new fully-connected head is constructed. This head layer takes the extracted feature vector from the pre-trained ResNet-50 model as its input. The head layer consists of an Average Pooling layer to reduce image dimensions, with a pool size of 7 × 7, followed by a flatten layer that converts this two-dimensional feature matrix into a vector, to be passed on to a fully connected layer. Then a dense layer with ReLU activation function is constructed. The ReLU function is given in Eq. (4).

R (z) = \{\begin{matrix} z, z > 0 \\ 0, z \leq 0 \end{matrix})

where, R represents the Rectified Linear Unit (ReLU) Function, and z denotes the input vector.

Dropout regularization with a keep probability of 0.5 is introduced to regularize the neural network during the training process, thereby preventing overfitting. Then the final dense layer with the SoftMax activation function is appended. The SoftMax function is given in Eq. (5).

σ {(\vec{z})}_{i} = \frac{e^{z_{i}}}{\sum_{j = 1}^{K} e^{z_{j}}}

where, $σ$ represents the SoftMax Function, $\vec{z}$ denotes the input vector, $e^{z_{i}}$ is the standard exponential function for input vector, K denotes the number of classes in the multiclass classifier, and $e^{z_{j}}$ is the standard exponential function for the output vector.

Training and compilation

ResNet-50 (He et al. 2016), which is pre-trained on the ImageNet dataset, has been considered for our classification task. The procedure for fine-tuning the ResNet-50 model follows by using the Keras implementation of ResNet-50, which is loaded with weights from the training on ImageNet, excluding its head, as it would have been trained for classifying the ImageNet dataset. A new fully connected head that has been constructed is appended in the old head's place to fine-tune the ResNet-50 model for our classification problem. The weights of the new fully connected head are tuned, whereas the other layers are frozen to prevent their weights from being updated to reduce unnecessary training overhead. Our model employs Adaptive Momentum (ADAM) optimization algorithm (Kingma and Ba 2014) with decaying learning rate and Binary Cross Entropy loss. The batch size of 32 is considered for training. The Adam optimization algorithm’s weight update rule is given in (6), and the binary cross-entropy loss function is given in Eq. (9).

θ_{t} = θ_{t - 1} - α \frac{{\hat{m}}_{t}}{\sqrt{{\hat{v}}_{t}} + ϵ}

where $α$ denotes the step size, w denotes the parameters, the formula of ${\hat{m}}_{t}$ and ${\hat{v}}_{t}$ is given in Eqs. (7) and (8), respectively, and $ϵ$ is a small constant introduced for numerical stability.

{\hat{m}}_{t} = \frac{m_{t}}{1 - β_{1}^{t}}

{\hat{v}}_{t} = \frac{v_{t}}{1 - β_{2}^{t}}

where ${\hat{m}}_{t}$ , ${\hat{v}}_{t}$ denotes the bias-corrected estimators for the first ( $m_{t})$ and second ( $v_{t}$ ) moments, respectively, and $β_{1}$ , $β_{2}$ , represent the exponential decay rates for the moment estimates.

J (θ) = - \frac{1}{m} \sum_{i = 1}^{m} y_{i} . \log ({\hat{y}}_{i}) + (1 - y_{i}) . \log (1 - {\hat{y}}_{i})

where J denotes the binary cross-entropy loss function on parameters $θ$ , ${\hat{y}}_{i}$ represents the value predicted by the model, $y_{i}$ is the target value, and m is the output size.

Detection phase

This phase elucidates the deployment of our methodology which has been trained and finetuned. The rest of the section elucidates the extraction of face ROIs using OpenCV DNN Face Detection Model that utilizes Single Shot MultiBox Detector (SSD) with ResNet-10 backbone network. This face detection model uses SSD, which provides a better frame rate by using multiscale feature maps is preferred over other object detection algorithms for our purpose. An experiment conducted by Liu et al. (2016) shows that SSD using an input size of 300 × 300 has outperformed YOLO, which takes a 448 × 448 input. It further expands on how the detection phase makes use of the trained classifier to make real-time detections. Figure 4 depicts the architecture of the detection phase.

Fig. 4 — Architecture of detection mechanism

Extraction of face ROIs using single shot multibox detector (SSD-ResNet-10)

Before applying the trained model on a live video stream to classify people’s faces as with and without masks, the faces in each frame of the video must be detected. This is done using a Caffe-based face detection model from the OpenCV Deep Neural Network module, as its speed and the ability to process millions of images quickly makes the Caffe model a perfect solution for fast-track development in research and industrial sectors. This model helps to identify the face locations in a live video stream by reading the video stream with each frame preprocessed by resizing them into 300 × 300 pixels and constructing blobs from which the face ROIs are extracted, thereby giving us the coordinates of the faces available in each of the frames. The face detections with a confidence rate of more than 50% are only considered for classification. The SSD-ResNet-10 model is trained with internet images, but the actual data source is not disclosed by OpenCV. This is available in two different versions as an “eight-bit quantized version” using TensorFlow, and the “floating point sixteen version” of the original Caffe implementation where we have made use of the latter version.

Classification using the trained model

After filtering out the weak detections by choosing the confidence rate of more than 50%, the coordinates of the bounding boxes for faces in a frame are computed. These extracted face regions of interests (ROIs) are then preprocessed as mentioned in Sect. 4.1 (Data Preprocessing) on which, the trained model is then applied to classify a person as "mask" or "no-mask." If classified as "no-mask," then the face of the person is captured and stored for identity retrieval. Figure 5 shows us the detection of masks by our system on the frames from the live and recorded video along with the confidence level of prediction. In Fig. 5A, the person from a live video is wearing a mask, and the system has correctly predicted that the person is wearing a mask. Figure 5B shows the picture of the person from the live video, who is without a mask, and the system’s detection that he is not wearing a mask. Figure 5C shows multiple detections of the presence of face masks from a recorded video, even with strong occlusion.

Fig. 5 — Output screenshots of application of our mask detector model on live video and recorded video streams. (A) Case 1—live video mask on, (B) Case 2—live video mask off, (C) Case 3—recorded video mask on (https://www.videvo.net/)

Identity retrieval phase

The captured images of the people without a face mask are sent for performing face matching. The captured images are compared with the available set of reference images of the people using a deep siamese neural network approach. When a particular face is a match with the captured face, the corresponding identification details that are unique to the specific person are retrieved from the database. Figure 6 represents the architecture of the Identity Retrieval Phase implementing face matching technique with the use of deep siamese neural network, for retrieving the information about the person without a face mask.

Fig. 6 — Architecture of identity retrieval methodology

Deep siamese neural network approach for face matching

As obtaining the identity of the people without face mask is crucial for deploying an enforcement mechanism, we make use of a deep siamese neural network approach where the facial recognition network excluding the top layer, will take two images as its input and makes use of vector similarity to determine whether they belong to the same person or not. We have considered the VGG-Face model by Parkhi et al. (2015) and Cosine Distance Similarity for retrieving the identity of the people without face masks. The VGG-Face model consisting of 22 layers and 37 deep units, has achieved 98.78% accuracy in the “Labeled Faces in the Wild Dataset”, where human beings have 97.53% accuracy in recognition of faces. The cosine distance metric is computationally efficient, less complicated, and works well with sparse data when compared with other distance metrics like Euclidean distance. Hence, the combination of the VGG-Face model and Cosine Distance similarity is preferred for the facial matching methodology. The Deep Siamese Network architecture (Chicco 2021; Ostad-Ali-Askari et al. 2017) consists of two identical VGG-Face models with the exact same configurations. After passing the non-masked people’s face images and the reference images to these identical networks, we obtain two different feature vectors, on which the cosine distance function is applied to verify whether the image belongs to the same person or not. If the similarity value is higher than or equal to the specified threshold, then those two images belong to the same person, which constitutes our face matching model employed in the identity retrieval phase. The vectorized formula for computing Cosine Distance Similarity and Cosine Distance is given in Eqs. (10) and (11), respectively.

S_{cosine} (p, q) = \frac{(p^{T} . q)}{\sqrt{p^{T} . p} \sqrt{q^{T} . q}}

D_{cosine} (p, q) = 1 - S_{cosine} (p, q)

where p, q are the feature vectors of the two pictures, considered for verification, $S_{cosine} (x, y)$ and $D_{cosine} (x, y)$ represent the cosine distance similarity and cosine distance between x and y, respectively.

Information retrieval mechanism

To evaluate our identity retrieval methodology, a sample population consisting of around 57 images of people is collected and their formulated details are stored using PostgreSQL. The database table consists of fields like reference ID, which serves as a unique ID assigned to a citizen, name, and location details. The images of the citizens are named after their unique reference ID. Figure 7 shows the database table which stores the details of this population.

Fig. 7 — Sample records from citizens database table

The facial matching is performed using the above discussed deep siamese neural network approach, between the captured images of the non-masked people and the reference images. If the images of the people without mask match with any reference images, then with the help of the reference image ID, the details of the non-masked people are retrieved from the database. The retrieved list of identification details is displayed on the web application developed using Streamlit, providing an interactive platform for analysis. A graph is also displayed to visually represent the monthly count of the non-masked people. Figure 8A shows the screenshot of the Streamlit web application that lists the details of the people who are not wearing a mask for a given day, and a graphical representation of the count of the non-masked people in a month is displayed in Fig. 8B.

Fig. 8 — Output screenshots of the identity retrieval methodology. (A) Sreamlit output–tabular format (B) Sreamlit output–graphical format

Results and discussion

In this section, the results of our method are presented and compared with other related models that are built for face mask detection which is evaluated using different performance metrics. It also describes the training plots and tabulated results for a better inference.

Performance analysis

The standard evaluation methodologies, namely the accuracy, precision, recall, and F1 score, have been considered to evaluate the model's excellence. The mathematical formula of the performance metrics mentioned above are given in Eqs. (12), (13), (14) and (15), respectively.

Accuracy = \frac{T P + T N}{(T P + F P) + (T N + F N)}

Precision = \frac{TP}{T P + F P}

Recall = \frac{TP}{T P + F N}

F 1 Score = 2 \times \frac{P r e c i s i o n \times R e c a l l}{P r e c i s i o n + R e c a l l}

where TP, TN constitute the number of True Positives and True Negatives respectively and FP, FN correspond to the number of False Positives and False Negatives respectively.

The trained classification model has achieved an overall test accuracy of 99.74% when trained for twenty epochs with the dataset split ratio as 80% for training and 20% for testing. It has a precision of 1.00 for mask class and 0.99 for no_mask class, a recall of 0.99 for mask class, and 1.00 for no_mask class, along with a value of 0.99 in F1 score for both the classes. The values of precision, recall, and F1-score of our proposed system are tabulated in Table 1. Figure 9A and B give us the graphical plots of loss and accuracy of our system respectively.

Table 1.

Precision, recall and F1-scores for Mask and No_Mask classification

Class	Precision	Recall	F1 score
Mask	1.00	0.99	0.99
No_Mask	0.99	1.00	0.99

Open in a new tab

From the graphical representation in Fig. 9A, we can infer a steady decrease in training loss and validation loss to a stable point, which indicates a good fit. We can also observe from Fig. 9B that the validation accuracy is higher than the training accuracy, which indicates that the model is perfectly fitted for the dataset eliminating both underfitting and overfitting. The deep siamese neural network approach used in the verification task, has achieved an accuracy of 98.24% when it was tested on the sample population mentioned in Sect. 4.3.

Comparison with other standard models

A comparison of our proposed methodology with other related models for face mask detection is tabulated in Table 2,

Table 2.

Comparison of proposed method with other standard models for face mask detection in terms of accuracy

Reference	Methodology	Performance
Ejaz et al. (2019)	Principal component analysis	Accuracy—83.5%
Jiang and Fan (2020)	RetinaFaceMask + ResNet	Face detection: Precision–91.9%; recall—96.3% Mask detection: Precision–93.4%; recall—94.5%
Loey et al. (2020)	Hybrid deep transfer learning	Accuracy—99.64%
Loey et al. (2021)	YOLOv2 with ResNet-50	Average precision—81%
Nagrath et al. (2021)	SSD and MobileNetV2	Accuracy—92.64%
Proposed method	SSD, ResNet-50, and deep siamese neural network	Classification accuracy—99.74% Identity retrieval accuracy—98.24%

Open in a new tab

We can infer from Table 2 that our proposed method has superseded the performance of other related models. The models presented by Ejaz et al. (2019) and Loey et al. (2020) deals only with classification and do not provide a mechanism for detection on live videos. Our proposed approach provides a method for application on live videos and surpasses the performance of these models by a considerable margin, with an accuracy of 99.74% for classification which achieves an improvement of 16.24% over the model presented by Ejaz et al. (2019) and 0.10% improvement over the model presented by Loey et al. (2020). The proposed method has also proven to be effective even in cases of strong occlusion, and is capable of detecting all kinds of masks such as cloth masks of different colors, surgical masks, etc. Compared with the model presented by Jiang and Fan (2020), our methodology has achieved better precision and recall values of 100% and 99% in terms of the masked face detection, and values of 99% and 100% for unmasked face detection, respectively. The model proposed by Loey et al. (2021) is restricted only to detecting medical face masks. Our model has also secured a better accuracy for classification compared to the approach presented by Nagrath et al. (2021) with over 7.1% improvement. The major drawback of all the models, compared in Table 2, is that they have not presented a methodology for identity retrieval of people without face masks whereas, the approach presented in this paper introduces a novel identity retrieval methodology, using deep siamese neural network, which achieves an accuracy of 98.24%.

Conclusion

Our main objective is to curb the spread of COVID-19 by providing an enforcement methodology on people using computer vision, thereby encouraging them to wear masks in shared spaces. Besides, such a methodology would encourage people not to commit the same felony repeatedly and would aid the governments around the world for effective monitoring of face masks. The detection of face masks in images and live video streams has been implemented and achieved exemplary results compared to other related models in terms of face mask detection. We propose a novel and efficient method that involves fine-tuning the pre-trained ResNet-50 model with a new head layer to classify faces wearing not wearing mask. The classifier is trained using adaptive momentum optimization algorithm with decaying learning rate and binary cross-entropy loss. Data augmentation and dropout regularization are employed to achieve best convergence. During real-time application of our classifier on videos, a Caffe face detector model based on Single Shot MultiBox Detector is used to extract the face regions of interest from each frame, on which the trained classifier is applied to detect people who do not wear mask. Thus, an efficient deep transfer learning model that requires less training time and achieves maximum accuracy with less training data has been proposed to detect face masks of all types, with an accuracy of 99.74% which is the highest accuracy value obtained as compared with latest methods such as hybrid deep transfer learning with machine learning methods as proposed by Loey et al. (2020) and YOLOv2 with ResNet-50 as proposed by Loey et al. (2021). Additionally, a completely autonomous and novel methodology for retrieving the identity of the people without a mask has been introduced in our research where the captured faces are compared with the reference images from the database, by extracting the features and calculating cosine distance which has secured a testing accuracy of 98.24%, thereby reducing human effort significantly. As our current approach classifies people as Masked and Non-Masked people, this approach can be extended by training the model further to identify and mention the type of mask the people are wearing, which will help in enforcing the need of wearing the rightful mask that will further aid in curbing the spread.

Data availability

The data that supports the findings of this study are available from the corresponding author upon reasonable request.

Footnotes

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Contributor Information

S. Vignesh Baalaji, Email: vigneshbaalajis@gmail.com

S. Sandhya, Email: sandhyavsridhar@gmail.com

S. A. Sajidha, Email: sajidha.sa@vit.ac.in

V. M. Nisha, Email: nishavm@vit.ac.in

M. D. Vimalapriya, Email: prshvi17375@gmail.com

Amit Kumar Tyagi, Email: amitkrtyagi025@gmail.com.

References

Centers for Disease Control and Prevention (2020) Scientific Brief: community Use of cloth masks to control the spread of SARS-CoV-2. Updated 10 November 2020 [PubMed]
Chicco D. Artificial neural networks. Springer; 2021. Siamese neural networks: an overview; pp. 73–94. [DOI] [PubMed] [Google Scholar]
Ejaz MS, Islam MR, Sifatullah M, Sarker A (2019) Implementation of principal component analysis on masked and non-masked face recognition. In: 2019 1st International Conference on Advances in Science, Engineering and Robotics Technology (ICASERT), IEEE, pp 1–5
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 770–778
Huang Baojin (2020) Real-world masked face dataset–RMFD. https://github.com/X-zhangyang/Real-World-Masked-Face-Dataset. Accessed 2 Aug 2021
Jiang M, Fan X (2020) RetinaMask: a face mask detector. arXiv preprint arXiv:2005.03950
Karaman O, Alhudhaif A, Polat K. Development of smart camera systems based on artificial intelligence network for social distance detection to fight against COVID-19. Appl Soft Comput. 2021;110:107610. doi: 10.1016/j.asoc.2021.107610. [DOI] [PMC free article] [PubMed] [Google Scholar]
Kingma DP, Ba J (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980
Larxel. Face mask detection dataset (2020). https://www.kaggle.com/andrewmvd/face-mask-detection. Accessed 2 Aug 2021
Liu W, Anguelov D, Erhan D, Szegedy C, Reed S, Fu CY, Berg AC (2016) Ssd: Single shot multibox detector. In: European conference on computer vision. Springer, Cham, pp 21–37
Loey M, Manogaran G, Taha MHN, Khalifa NEM (2020) A hybrid deep transfer learning model with machine learning methods for face mask detection in the era of the COVID-19 pandemic. Measurement 167:108288 [DOI] [PMC free article] [PubMed]
Loey M, Manogaran G, Taha MHN, Khalifa NEM. Fighting against COVID-19: a novel deep learning model based on YOLO-v2 with ResNet-50 for medical face mask detection. Sustain Cities Soc. 2021;65:102600. doi: 10.1016/j.scs.2020.102600. [DOI] [PMC free article] [PubMed] [Google Scholar]
Nagrath P, Jain R, Madan A, Arora R, Kataria P, Hemanth J. SSDMNV2: a real time DNN-based face mask detection system using single shot multibox detector and MobileNetV2. Sustain Cities Soc. 2021;66:102692. doi: 10.1016/j.scs.2020.102692. [DOI] [PMC free article] [PubMed] [Google Scholar]
Ostad-Ali-Askari K, Shayannejad M, Ghorbanizadeh-Kharazi H. Artificial neural network for modeling nitrate pollution of groundwater in marginal area of Zayandeh-rood River, Isfahan, Iran. KCSE J Civil Eng. 2017;21:134–140. [Google Scholar]
Parkhi OM, Vedaldi A, Zisserman A (2015) Deep face recognition
Rahmani AM, Mirmahaleh SYH (2020) Coronavirus disease (COVID-19) prevention and treatment methods and effective parameters: a systematic literature review. Sustainable Cities and Society, p 102568 [DOI] [PMC free article] [PubMed]
Rana S, Kisku DR. Proceedings of international conference on frontiers in computing and systems. Springer; 2021. Face recognition using siamese network; pp. 369–376. [Google Scholar]
Vasanthi M, Seetharaman K. Facial image recognition for biometric authentication systems using a combination of geometrical feature points and low-level visual features. J King Saud Univ Comput Inf Sci. 2020;34:4109–4121. [Google Scholar]
Wang B, Zhao Y, Chen CP. Hybrid transfer learning and broad learning system for wearing mask detection in the COVID-19 Era. IEEE Trans Instrum Meas. 2021;70:1–12. doi: 10.1109/TIM.2021.3123218. [DOI] [PMC free article] [PubMed] [Google Scholar]
Wu X, Sahoo D, Hoi SC. Recent advances in deep learning for object detection. Neurocomputing. 2020;396:39–64. doi: 10.1016/j.neucom.2020.01.085. [DOI] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

The data that supports the findings of this study are available from the corresponding author upon reasonable request.

[CR2] Centers for Disease Control and Prevention (2020) Scientific Brief: community Use of cloth masks to control the spread of SARS-CoV-2. Updated 10 November 2020 [PubMed]

[CR21] Chicco D. Artificial neural networks. Springer; 2021. Siamese neural networks: an overview; pp. 73–94. [DOI] [PubMed] [Google Scholar]

[CR8] Ejaz MS, Islam MR, Sifatullah M, Sarker A (2019) Implementation of principal component analysis on masked and non-masked face recognition. In: 2019 1st International Conference on Advances in Science, Engineering and Robotics Technology (ICASERT), IEEE, pp 1–5

[CR7] He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 770–778

[CR13] Huang Baojin (2020) Real-world masked face dataset–RMFD. https://github.com/X-zhangyang/Real-World-Masked-Face-Dataset. Accessed 2 Aug 2021

[CR9] Jiang M, Fan X (2020) RetinaMask: a face mask detector. arXiv preprint arXiv:2005.03950

[CR6] Karaman O, Alhudhaif A, Polat K. Development of smart camera systems based on artificial intelligence network for social distance detection to fight against COVID-19. Appl Soft Comput. 2021;110:107610. doi: 10.1016/j.asoc.2021.107610. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR17] Kingma DP, Ba J (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980

[CR14] Larxel. Face mask detection dataset (2020). https://www.kaggle.com/andrewmvd/face-mask-detection. Accessed 2 Aug 2021

[CR18] Liu W, Anguelov D, Erhan D, Szegedy C, Reed S, Fu CY, Berg AC (2016) Ssd: Single shot multibox detector. In: European conference on computer vision. Springer, Cham, pp 21–37

[CR10] Loey M, Manogaran G, Taha MHN, Khalifa NEM (2020) A hybrid deep transfer learning model with machine learning methods for face mask detection in the era of the COVID-19 pandemic. Measurement 167:108288 [DOI] [PMC free article] [PubMed]

[CR11] Loey M, Manogaran G, Taha MHN, Khalifa NEM. Fighting against COVID-19: a novel deep learning model based on YOLO-v2 with ResNet-50 for medical face mask detection. Sustain Cities Soc. 2021;65:102600. doi: 10.1016/j.scs.2020.102600. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR12] Nagrath P, Jain R, Madan A, Arora R, Kataria P, Hemanth J. SSDMNV2: a real time DNN-based face mask detection system using single shot multibox detector and MobileNetV2. Sustain Cities Soc. 2021;66:102692. doi: 10.1016/j.scs.2020.102692. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR22] Ostad-Ali-Askari K, Shayannejad M, Ghorbanizadeh-Kharazi H. Artificial neural network for modeling nitrate pollution of groundwater in marginal area of Zayandeh-rood River, Isfahan, Iran. KCSE J Civil Eng. 2017;21:134–140. [Google Scholar]

[CR20] Parkhi OM, Vedaldi A, Zisserman A (2015) Deep face recognition

[CR1] Rahmani AM, Mirmahaleh SYH (2020) Coronavirus disease (COVID-19) prevention and treatment methods and effective parameters: a systematic literature review. Sustainable Cities and Society, p 102568 [DOI] [PMC free article] [PubMed]

[CR3] Rana S, Kisku DR. Proceedings of international conference on frontiers in computing and systems. Springer; 2021. Face recognition using siamese network; pp. 369–376. [Google Scholar]

[CR4] Vasanthi M, Seetharaman K. Facial image recognition for biometric authentication systems using a combination of geometrical feature points and low-level visual features. J King Saud Univ Comput Inf Sci. 2020;34:4109–4121. [Google Scholar]

[CR16] Wang B, Zhao Y, Chen CP. Hybrid transfer learning and broad learning system for wearing mask detection in the COVID-19 Era. IEEE Trans Instrum Meas. 2021;70:1–12. doi: 10.1109/TIM.2021.3123218. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR5] Wu X, Sahoo D, Hoi SC. Recent advances in deep learning for object detection. Neurocomputing. 2020;396:39–64. doi: 10.1016/j.neucom.2020.01.085. [DOI] [Google Scholar]

PERMALINK

Autonomous face mask detection using single shot multibox detector, and ResNet-50 with identity retrieval through face matching using deep siamese neural network

S Vignesh Baalaji

S Sandhya

S A Sajidha

V M Nisha

M D Vimalapriya

Amit Kumar Tyagi

Abstract

Introduction

Related works

Dataset description

Fig. 1.

Proposed system

Fig. 2.

Training phase

Fig. 3.

Data preprocessing

Data augmentation

Learning rate decay schedule

Construction of new head layer for classification

Training and compilation

Detection phase

Fig. 4.

Extraction of face ROIs using single shot multibox detector (SSD-ResNet-10)

Classification using the trained model

Fig. 5.

Identity retrieval phase

Fig. 6.

Deep siamese neural network approach for face matching

Information retrieval mechanism

Fig. 7.

Fig. 8.

Results and discussion

Performance analysis

Table 1.

Fig. 9.

Comparison with other standard models

Table 2.

Conclusion

Data availability

Footnotes

Contributor Information

References

Associated Data

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases