Deep vision-based real-time hand gesture recognition: a review

Cui Cui; Mohd Shahrizal Sunar; Goh Eg Su

doi:10.7717/peerj-cs.2921

. 2025 Jun 24;11:e2921. doi: 10.7717/peerj-cs.2921

Deep vision-based real-time hand gesture recognition: a review

Cui Cui ^1,², Mohd Shahrizal Sunar ^1,^2,^✉, Goh Eg Su ^1,²

Editor: Sara Comai

PMCID: PMC12453657 PMID: 40989457

Abstract

Hand gesture recognition is an approach to comprehending human body language, applied in various fields such as human-computer interaction. However, some issues remain in edge blurring generated by complex backgrounds, rotation inaccuracy induced by fast movement, and delay caused by computing cost. Recently, the emergence of deep learning has ameliorated these issues, convolution neural network (CNN) enhanced edge clarity, long-short term memory (LSTM) improved rotation accuracy, and attention mechanism optimized response time. In this context, this review starts with the deep learning models, specifically CNN, LSTM, and attention mechanisms, which are compared and discussed from the utilization rate of each, their contribution to improving accuracy or efficiency, and their role in the recognition stage, like feature extraction. Furthermore, to evaluate the performance of these deep learning models, the evaluation metrics, datasets, and ablation studies are analyzed and discussed. The choice of evaluation metrics and dataset is critical since different tasks require different evaluation parameters, and the model learns more patterns and features from diverse data. Therefore, the evaluation metrics are categorized into accuracy and efficiency. The datasets are analyzed from self-created to public datasets. The ablation study is summarized in four aspects: similar underlying models, integrating specific models, pre-processing, others. Finally, the existing research gaps and further research on accuracy, efficiency, application range, and environmental adaptation are discussed.

Keywords: Deep learning, Evaluation metric, Self-created datasets, Underlying models, Ablation study

Introduction

In the ever-evolving landscape of technology and human-computer interaction, hand gesture recognition (HGR) emerges as a transformative force. It bridges the divide between human communication and the digital world, impacting diverse domains such as augmented reality (AR), virtual reality (VR), and sign language interpretation.

HGR accomplishes communication between humans and machines by tracking hand gestures and recognizing their representation, converting them into semantically meaningful commands. Based on data sources and technologies used for capturing and analyzing gestures, HGR can be classified into vision-based HGR, sensor-based HGR, wearable-based HGR, etc. Among them, vision-based HGR is primarily concerned with capturing signals through the camera, calculating the positional information of the hand through models, evaluating the gesture posture, and then processing it into comprehensible information. Vision-based real-time HGR further emphasizes efficiency and immediate responsiveness in simultaneously capturing gestures and executing real-time data analysis, ensuring fast response in dynamic interaction scenes. Given this, vision-based dynamic HGR has been further advanced, becoming a hot research topic in recent years in several application scenarios (Zhang, Wang & Lan, 2020), such as robotics (Nogales & Benalcázar, 2021), military (Naik et al., 2022), clinical operations (Lanza et al., 2023), risk warning (Wang et al., 2023b), agriculture (Moysiadis et al., 2022), remote collaboration (Tian et al., 2023), AR (Yusof et al., 2016), entertainment (Cui, Sunar & Su, 2024) and education (Su et al., 2022). In robotics control, it empower robots flexibility and precision; in the military, it supports real-time silent communication and command; in clinical operations, it supports surgeons to control surgical equipment in real time; in risk warning, it realizes instant safety response through gesture signals; in agriculture, it regulates drones in real time through gestures for crop monitoring or pesticide spraying; and in the fields of remote collaboration, AR and education, it enhances the naturalness and interactivity of the user experience.

Two momentous metrics for vision-based real-time HGR are instant response and recognition accuracy. However, challenges persist in accuracy and efficiency, such as inaccuracy in rotation or fast movement, high computing costs in video processing, and long response times. Recently, the emergence of deep learning has enabled some improvement in these issues. Since 2018, deep learning models have been widely adopted by an increasing cohort of HGR researchers. Many researchers have successfully implemented deep learning models to enhance the performance of HGR, such as structured dynamic time warping (SDTW) (Tang et al., 2018), InteractionFusion (Zhang et al., 2019), and a two-branch fusion deformable network (Liu & Liu, 2023). The previously mentioned deep learning models are deeply involved in each key stage of HGR, which plays an essential role in promoting system performance and enhancing the interaction experience. The key step of HGR involves acquiring data, data pre-processing, feature extraction, hand segmentation, hand detection and tracking, and classification, as shown in Fig. 1. The sensor captures continuous hand gestures. These data are cleaned through a pre-processing stage to remove invalid frames. Meanwhile, the hand is extracted from the intricate background to obtain a segmented hand area. Simultaneously, hand location is detected within the frame while tracking the continuous location, rotation, and orientation. At the core of the process is feature extraction, such as skin color, skeleton, and spatiotemporal information. Finally, the identified hand gestures are classified.

In this context, this review embarks on a comprehensive exploration of vision-based deep learning methods and evaluation techniques in real-time HGR, spanning from 2018 to 2024. Our analysis focuses on three core aspects: the model, model evaluation, and research gaps. Although many researchers have summarized the literature in this area regarding methods, applications, challenges, and so on in Table 1, they have not detailed the relationship between improvements and models, or the influence of key stages of HGR on the models within these methods. Unlike previous reviews or surveys, this review summarizes and discusses these aspects, as the strengths and weaknesses of the model at each stage directly affect the final recognition. Furthermore, we elaborated on and analyzed model performance across the dataset to highlight the intrinsic relevance of model performance concerning both models and datasets.

Table 1. The previous related reviews or surveys from 2018 to 2024.

				Elements covered in the study
Reference	Style	Count	Timeline	Acquisition device	Methods	Evaluation metric	Datasets	Application	Challenge	Future direction
Xia et al. (2019)	Survey	N/A	N/A	✓	✓	–	–	–	–	✓
Zulpukharkyzy Zholshiyeva et al. (2021)	Survey	70	2012–2021	–	✓	–	–	✓	–	–
Kaur & Bansal (2022)	Survey	19	N/A	–	✓	✓	–	✓	✓	✓
Al-Shamayleh et al. (2018)	Review	100	N/A	✓	✓	–	–	✓	✓	✓
Yasen & Jusoh (2019)	Review	111	2016–2018	–	✓	–	–	✓	✓	–
Vuletic et al. (2019)	Review	148	1980–2018	✓	✓	–	–	✓	–	✓
Nogales & Benalcázar (2021)	Review	40	2015–2019	✓	✓	✓	✓	–	✓	–
Chakraborty et al. (2018)	Review	N/A	N/A	✓	✓	–	–	–	✓	–
Oudah, Al-Naji & Chahl (2020)	Review	N/A	N/A	–	✓	–	–	✓	✓	–
Mohamed, Mustafa & Jomhari (2021)	Review	98	2014–2020	✓	✓	✓	✓	–	–	✓
Sarma & Bhuyan (2021)	Review	N/A	N/A	–	✓	–	✓	–	✓	✓
Sha et al. (2020)	Review	N/A	N/A	–	✓	✓	✓	–	–	✓
Al Farid et al. (2022)	Review	108	2012–2022	–	✓	–	–	–	✓	✓

Open in a new tab

Based on the 2012 Association for Computing Machinery (ACM) computing classification, the focus area of this review is highlighted in blue color as depicted in Fig. 2. The goal of this review is to review previous research on vision-based real-time HGR by deep learning. The main contribution of this review is summarized as follows:

To summarize and discuss the last 7 years of real-time hand gesture recognition methods using deep learning. Linking the deep learning models with improvement and role in key stage.
To analyze and discuss the evaluation metrics and datasets for evaluating, training, and testing the model, especially the comparison between self-created datasets and public datasets, and the technical evaluation.
To analyze the relationship between the ablation study and the selection of underlying models
To categorize the existing research gaps and suggest potential directions.

The blue color is the research direction to which the research content of this review subsumes.

Related work

Since the application of real-time HGR has been promoted, research efforts in this field are on a constant upward trajectory. To facilitate the research of subsequent researchers, many researchers have summarized and analyzed the related literature. Some of them are more comprehensive in their analytical summaries, including static and dynamic HGR, vision-based and wearable-based HGR (Nogales & Benalcázar, 2021; Tang et al., 2018; Zhang et al., 2019; Liu & Liu, 2023). Others have some review and survey on vision-based HGR. These related articles are listed in Table 1.

Acquisition device

Some of these reviews and surveys summarize and generalized vision-based real-time HGR from diverse perspectives. The analyses and comparisons were carried out in the following aspects: analyzing single cameras, active techniques, and invasive techniques (Xiong et al., 2021); collecting in cameras, sensors, and wearable devices (Vuletic et al., 2019); comparing the difference between Sensor devices and vision-based devices (Vuletic et al., 2019; Al-Shamayleh et al., 2018; Sarma & Bhuyan, 2021); summarizing basic information and defects of 2D camera or a 3D camera (Chakraborty et al., 2018); comparing sensor technologies (Xia et al., 2019). Among them, the surface electromyography (sEMG) sensors (Yasen & Jusoh, 2019) and leap motion controller were generic acquisition tools (Nogales & Benalcázar, 2021; Mohamed, Mustafa & Jomhari, 2021).

Method

For methods or techniques, Al-Shamayleh et al. (2018) discovered that most studies focus on appearance-based HGR of vision-based HGR. Some reviews or surveys analyzed hand detection, segmentation, and classification technologies (Vuletic et al., 2019; Sarma & Bhuyan, 2021; Xia et al., 2019; Al Farid et al., 2022; Oudah, Al-Naji & Chahl, 2020; Zulpukharkyzy Zholshiyeva et al., 2021). Feature extraction is automatically obtained using deep learning such as convolutional neural networks (CNN) or long short-term memory (LSTM) (Nogales & Benalcázar, 2021; Sha et al., 2020). Moreover, the classification models included k-nearest neighbors (KNN), dynamic time warping (DTW), support vector machine (SVM), artificial neural network (ANN), LSTM (Nogales & Benalcázar, 2021). Yasen & Jusoh (2019) found that an ANN is a widely used classifier. Chakraborty et al. (2018) summarized advantages and disadvantages of various classifiers.

Application

For application, the HGR is applied in various areas such as robot control (Vuletic et al., 2019; Oudah, Al-Naji & Chahl, 2020; Zulpukharkyzy Zholshiyeva et al., 2021), sign language recognition (Vuletic et al., 2019; Al-Shamayleh et al., 2018; Sarma & Bhuyan, 2021; Yasen & Jusoh, 2019; Oudah, Al-Naji & Chahl, 2020; Zulpukharkyzy Zholshiyeva et al., 2021), healthcare (Vuletic et al., 2019; Sarma & Bhuyan, 2021; Oudah, Al-Naji & Chahl, 2020; Zulpukharkyzy Zholshiyeva et al., 2021), entertainment (Vuletic et al., 2019; Al-Shamayleh et al., 2018; Oudah, Al-Naji & Chahl, 2020), the most sign language recognition among them (Al-Shamayleh et al., 2018; Yasen & Jusoh, 2019).

Evaluation metric

For evaluation metrics, some reviews or surveys analyzed many aspects, such as parameters or tools of accuracy (Kaur & Bansal, 2022), processing time (Nogales & Benalcázar, 2021), and recognition accuracy (Nogales & Benalcázar, 2021; Mohamed, Mustafa & Jomhari, 2021; Sha et al., 2020).

Datasets

The datasets are created by Kinect, Leap Motion, Intel RealSense, or an Interactive gesture camera (Nogales & Benalcázar, 2021). The datasets are sorted by fingerspelling, isolated and continuous gestures (Mohamed, Mustafa & Jomhari, 2021). Sarma & Bhuyan (2021) listed the content and links of some datasets. Sha et al. (2020) described the details of seven isolated gesture datasets.

Research gap and challenge

The majority of reviews and surveys identify potential research directions derived from analyzing research gaps or challenges.

The challenge of vision-based real time HGR includes system (Al-Shamayleh et al., 2018), complex background (Al-Shamayleh et al., 2018; Chakraborty et al., 2018; Yasen & Jusoh, 2019; Oudah, Al-Naji & Chahl, 2020; Kaur & Bansal, 2022), illumination variation (Chakraborty et al., 2018; Oudah, Al-Naji & Chahl, 2020; Kaur & Bansal, 2022), gesture-related challenges (Al-Shamayleh et al., 2018; Chakraborty et al., 2018; Kaur & Bansal, 2022), highly accuracy and efficient recognition (Nogales & Benalcázar, 2021; Sarma & Bhuyan, 2021; Chakraborty et al., 2018; Kaur & Bansal, 2022), reasonableness of the technology and its application (Vuletic et al., 2019), overfitting in the datasets (Yasen & Jusoh, 2019), matching issues in datasets (Oudah, Al-Naji & Chahl, 2020).

Future research direction

For future scope, Al-Shamayleh et al. (2018) introduced potential research directions including syntactic interpretation, hybrid methods, smartphone sensors, normalization, and real-life systems. Kaur & Bansal (2022) proposed potential research directions including dynamic hand classification methods in videos, number of classifications, recognition complexity and time reduction hand gesture detection, tracking and classification in complex backgrounds, real-time requirement, and user experience improvement (Xia et al., 2019).

The combination of hand gestures and speech is a research direction warranting attention in the future (Vuletic et al., 2019). Increased collective volume of datasets, feature integration, and reducing computational cost will be the future directions (Mohamed, Mustafa & Jomhari, 2021). Gesture communication will instead involve multi-step interactions (Sarma & Bhuyan, 2021). Research activities may advance in large-scale datasets, temporal or spatial relations, the gap between virtual and real scenes (Sha et al., 2020).

The preceding reviews and surveys have not analyzed the relationship between improvement and models, despite many of them organizing applications of models. Meanwhile, they have not analyzed the functions of the model in each stage of HGR. In addition, most of them focus on obtaining data and public datasets, rarely summarizing self-created datasets.

This review, by contrast, summarizes and discusses the roles of models at each stage of the recognition process and their impact on overall performance. Among the highlights are the last 7 years studies in vision-based real-time HGR deep learning models including CNN, LSTM, attention mechanisms, and others.

In addition, this review organizes the evaluation metrics from both accuracy and efficiency perspectives to inform future researchers in the selection of evaluation parameters. Furthermore, analyzing the distribution of self-created datasets and public datasets provides a reference for choosing the datasets for training and testing the models.

Methodology

In this section, we are based on the understanding of the definition of Bjørnson & Dingsøyr (2008) and incorporate the review methodology detailed by Kitchenham et al. (2009). Completing our study according to Lavallée, Robillard & Mirsalari (2013) and the PRISMA statement (Page et al., 2020), the research methodology for literature selection includes six steps: planning, research questions, search strategy, inclusion and exclusion criteria, quality assessment, data extraction. PRISMA is universally acknowledged as the crucial framework for writing reviews. It provides a comprehensive guideline for researchers to complete reviews in a structured, methodological manner. Therefore, we follow this framework to finish this section. The selection process of the studies consists of two main stages, as shown in Fig. 3, searching for and analyzing previous literature.

The previous literature search contains three steps, searching in the literature repository, duplicate removal, and adding from others.

Research questions

The research question is essential to guide the overall review process. We define the following three research questions (RQ), aiming to organize the results relevant to the models, performance evaluation, and research gaps. These include the improvement and limitation of methods, the relationship between models and improvement, the models in key stages of HGR processing, the evaluation metrics and datasets, and the research gaps and future directions. All of them focus on methods for real-time HGR using deep learning (DL).

RQ1. What type of models and procedures are used in the real time HGR through deep learning?
RQ2. What are the performance metrics used to evaluate the HGR models?
RQ3. Which research gap remains in real time HGR using deep learning?

The findings of each RQs are analysis and discussed in sections of RQ1, RQ2, and RQ3. The overall findings organized in Synthesis of Findings.

Search strategy

This section involves designing and implementing a structured literature search process to locate inclusively relevant studies by using keywords and databases. The period of search for references is from January 2018 to May 2024. References to the sources used in the literature repositories: IEEE Xplore, Web of Science, Scopus and Springer. To ensure covering relevant literature, keywords, and synonymous words related to RQs are selected, as demonstrated in Table 2. The number of studies for each literature repository is listed in Table 3.

Table 2. The keywords for searching studies.

ID	Keywords
K1	‘Vision based’ and ‘Real Time’ and ‘Hand Gesture Recognition’ and ‘Deep Learning’
K2	‘Vision based’ and ‘Dynamic’ and ‘Hand Gesture Recognition’ and ‘Deep Learning’
K3	‘Vision based’ and ‘Real Time’ and ‘Hand Gesture Recognition’ and ‘RNN’
K4	‘Vision based’ and ‘Real Time’ and ‘Hand Gesture Recognition’ and ‘CNN’
K5	‘Vision based’ and ‘Real Time’ and ‘Hand Gesture Recognition’ and ‘LSTM’
K6	‘Vision based’ and ‘Real Time’ and ‘Hand Gesture Recognition’ and ‘Attention Mechanism’

Open in a new tab

Table 3. The number of the studies for each literature repository and keywords.

Literature repositories	Keywords						Total
Literature repositories	K1	K2	K3	K4	K5	K6
IEEE Xplore	59	29	2	32	6	29	157
ACM digital library	131	112	35	105	58	7	448
Scopus	64	62	3	48	9	2	188
Springer	105	71	25	70	28	36	335
Web of science	39	34	1	27	6	3	110
Science direct	35	30	11	31	15	15	137

Open in a new tab

In Fig. 3, the initial search produced 1,375 studies that were filtered for duplicate removal, and 1,021 studies were chosen. In addition, 32 studies were collected from Google Scholar, CVPR, ICCV, ECCV, and other repositories and conference articles to ensure sufficient relevant literature was searched. In the first stage, 1,053 studies were conducted. In the second stage, 199 studies were extracted by inclusion and exclusion criteria, and then forty-seven studies were finally selected by quality assessment.

Inclusion and exclusion criteria

To determine whether a primary study would help answer the RQs and to ensure the completeness and accuracy of the search strategy. To ensure the selected articles can cover all RQs, the inclusion and exclusion criteria were created in Table 4.

Table 4. Inclusion and exclusion criteria.

Inclusion criteria	The model of real-time or dynamic Hand Gesture Recognition (HGR)
Inclusion criteria	3D-based, Vision-based and using deep learning
Exclusion criteria	No indication that the model is real-time or dynamic HGR
	No vision-based and no using Deep Learning
	Gesture recognition without human hands

Open in a new tab

Quality assessment

The quality of the references was assessed by answering the following seven quiz questions, each with three responses, and their scores: “Yes” = 1, “Partly” = 0.5, and “No” = 0. The results shown in Table 5, the designed questions facilitated answering the RQs mentioned previously, followed by summing and ranking the scores.

Were the research purposes of the study clear?
Was the structure of the HGR model shown?
Was the results of experiments shown?
Were the contributions of the study clear?
Did the article mention the future works?
Was the limitation explicitly mentioned?
Was the article published in an accreditation source?

Table 5. Quality assessment criteria.

No.	Quality assessment questions	Criteria scores
QAC1	Were the research purposes of the study clear?	“Yes” = 1/“No” = 0
QAC2	Was the structure of the HGR model shown?	“Yes” = 1, “Partly” = 0.5, and “No” = 0.
QAC3	Was the results of experiments shown?	“Yes” = 1, “Partly” = 0.5, and “No” = 0.
QAC4	Were the contributions of the study clear?	“Yes” = 1, “Partly” = 0.5, and “No” = 0.
QAC5	Did the article mention the future works?	“Yes” = 1/“No” = 0
QAC6	Was the limitation explicitly mentioned?	“Yes” = 1/“No” = 0
QAC7	Was the article published in an accreditation source?	Rank by IF/JCR Q1 = 2 or CVPR, ICCV, ECCV = 2, Rank by IF/JCR Q2 = 1.5, Rank by IF/JCR Q3 or Q4 = 1, no ranking = 0

Open in a new tab

Note:

*QAC, Quality Assessment Criteria.

Data extraction

In the second stage of Fig. 3, the 199 studies were selected by title, abstract, and full article according to inclusion and exclusion criteria. Moreover, the studies were assessed based on quality assessment, the scores presented in Table 6. Finally, forty-seven studies were analyzed in depth for the three RQs.

Table 6. Data extracted results.

Ref.No.	Reference	Scores
R1	Tang et al. (2018)	8
R2	Zhang et al. (2019)	8
R3	Zhang, Wang & Lan (2020)	8
R4	Tang et al. (2021)	8
R5	Sharma & Singh (2021)	8
R6	Liu & Liu (2023)	8
R7	Shanmugam & Narayanan (2024)	8
R8	Rastgoo et al. (2024)	8
R9	Balaji & Prusty (2024)	7.5
R10	Zhang, Tian & Zhou (2018)	7.5
R11	Fang et al. (2019)	7.5
R12	Li et al. (2019)	7.5
R13	Lu et al. (2019)	7.5
R14	Ozcan & Basturk (2019)	7.5
R15	dos Santos, Samatelo & Vassallo (2020)	7.5
R16	Rahim, Shin & Islam (2020)	7.5
R17	Tellaeche Iglesias et al. (2021)	7.5
R18	Wang et al. (2023a)	7.5
R19	Ng et al. (2022)	7.5
R20	Cao, Li & Shin (2022)	7.5
R21	Jain, Karsh & Barbhuiya (2022)	7.5
R22	Rajalakshmi et al. (2023)	7
R23	Lu et al. (2024)	7
R24	Huang et al. (2023)	7
R25	Li et al. (2018)	7
R26	Patil & Subbaraman (2019)	7
R27	Ameur, Khalifa & Bouhlel (2020)	7
R28	Jiang et al. (2021)	7
R29	Rubin Bose & Sathiesh Kumar (2021)	7
R30	Li et al. (2021b)	7
R31	Yu et al. (2021)	7
R32	Bose & Kumar (2022)	7
R33	Chen et al. (2023)	6.5
R34	Xiao et al. (2023)	6.5
R35	Hou et al. (2023)	6.5
R36	Liu et al. (2019)	6.5
R37	Verma & Choudhary (2020)	6.5
R38	Do et al. (2020)	6.5
R39	Li et al. (2021a)	6.5
R40	Taranta et al. (2021)	6.5
R41	Verma (2022)	6.5
R42	Wang (2022)	6.5
R43	Yadav et al. (2022)	6.5
R44	Dubey (2023)	6.5
R45	Guler & Yucedag (2022)	6.5
R46	Haroon et al. (2022)	6.5
R47	Rastgoo, Kiani & Escalera (2022)	6.5

Open in a new tab

Biased evaluation criteria

This section analyzes the following constraints on this review and alleviation options.

First, searching for articles is incomplete. The proximity of the number of searched articles to the total related literature in the literature database determines the completeness of the search articles. That is directly associated with the selection of the database and the definition of searching keywords or their synonyms. To address these issues, we conducted the search strategy in the earlier section.

Second, inclusion and exclusion criteria may not apply to the three RQs. To obviate this, the formal inclusion and exclusion criteria are formulated in the previous section.

Third, quality assessment is not objective. To avoid these issues, this review developed eight questions to reduce this trend.

Lastly, data extraction is inaccurate. To solve this problem, data are extracted by a comprehensive method on the basis of the three RQs, inclusion and exclusion criteria, and quality assessment.

RQ1: what type of models and procedures are used in the real time HGR through deep learning?

In response to RQ1, this section provided a detailed statistical summary and in-depth analysis, concentrating on the underlying models in methods, the HGR models specifically designed for key stages in real-time recognition, and analysis of their strengths and weaknesses. These statistics not only reveal the advancement of real-time HGR technologies but also serve as essential references for researchers in selecting optimal models for specific application scenarios.

The statistics shown below are collected from forty-seven articles by reading them. The data on methods, model composition, methods employed in key stages of HGR, advantages and disadvantages are listed in a spreadsheet, generated by professional graph and chart generating software.

Underlying models in methods

Since each deep learning model has advantages and disadvantages, methods designed with different underlying models imply obtaining different research results. Hence, the option of models is directly related to overall performance. Figure 4, Table 7 demonstrate the trends and distribution of underlying models in methods.

Table 7. Underlying models in methods.

Ref. No.	Methods	CNN	LSTM	Attention mechanism	Others
R1	Structured Dynamic Time Warping (SDTW)	–	✓	–	✓
R2	InteractionFusion	–	✓	–	✓
R3	Short-Term Sampling Neural Networks (STSNN)	✓	✓	–	–
R4	Selective Spatiotemporal Features Learning (SeST).	✓	✓	–	–
R5	Gesture-CNN (G-CNN)	✓	–	–	–
R6	A Two-branch Fusion Deformable Network with Gram Matching	✓	✓	–	✓
R7	Modified Deep Convolutional Neural Network-based Hybrid Arithmetic Hunger Games (MDCNN-HAHG)	✓	–	–	–
R8	A Transformer-based with a C3D; AutoEncoder (AE) on LSTM network	✓	✓	✓	–
R9	Multimodal Fusion Hierarchical Self-attention Network (MF-HAN)	–	–	✓	–
R10	HandSense	✓	–	–	–
R11	An Integrated Framework Based on the Covariance Matrix	✓	–	✓	–
R12	A Spatiotemporal Attention-based ResC3D model	✓	–	–	–
R13	A Lightweight Inflated 3D ConvNets (I3D)	✓	–	–	–
R14	ABC Tuned-CNN structure	✓	–	–	–
R15	Star RGB and Dynamic Gesture Classifier	✓	–	–	–
R16	A non-touch character writing system	✓	–	–	–
R17	The optimized Darknet CNN architecture	✓	–	✓	–
R18	A Two-branch Hand Gesture Recognition Approach (HGRA)	✓	–	✓	–
R19	AA-A2J and AA-3DA2J	✓	–	✓	–
R20	A Transformer-based Network	✓	–	✓	–
R21	The Encoded Motion Image (EMI)	–	✓	–	✓
R22	The hDNN-SLR Framework	–	–	–	✓
R23	3D DenseNet-BiLSTM	✓	✓	–	–
R24	Deep Robust Hand Gesture Network (RGRNet)	✓	–	–	✓
R25	Fisher Bidirectional Long-Short Term Memory (F-BiLSTM) and Fisher Bidirectional Gated Recurrent Unit (F-BiGRU)	✓	✓	–	–
R26	Hough Transform (HT) and Artificial Neural Network (ANN)	✓	✓	✓	–
R27	Hybrid Bidirectional Unidirectional LSTM (HBU-LSTM)	✓	✓	–	–
R28	The Generative Adversarial networks (GAN) Model and the Mask R-CNN	✓	✓	✓	–
R29	The Optimum Deep Residual Network (RetinaNet-DSC)	✓	–	–	–
R30	A Lightweight 3D Inception-ResNet	✓	–	–	✓
R31	Resnet-101 and SSD with attention mechanism (TA-RSSD) and Temporal Attentional LSTM (TA-LSTM)	✓	✓	✓	✓
R32	The Deep Single-stage CNN Model (Hybrid-SSR)	✓	–	–	–
R33	The RPCNet Module	✓	–	✓	–
R34	TSM-ResNet50	✓	✓	–	–
R35	Kernel Optimize Accumulation (KOA), Union Frame Difference (UDF), and CLSTM	–	✓	–	✓
R36	The Radial Basis Function (RBF) Neural Networks	–	–	–	✓
R37	A Novel Grassmann Manifold Based Framework	✓	–	–	–
R38	A Multi-level Feature LSTM with Conv1D, the Conv2D Pyramid, and the LSTM Block	–	✓	–	✓
R39	3D-Ghost and Spatial Attention Inflated 3D ConvNet (3DGSAI)	✓	–	✓	–
R40	Machete	–	–	–	✓
R41	A Two-Stream Hybrid Model (CNN + BGRU model)	✓	✓	–	✓
R42	The Normalized 2D SDD Features and A Priori Knowledge	✓	–	–	–
R43	A Pixel-wise Semantic Segmentation (SegNet) Model with VGG16; SegNet network, point tracker and Kalman filter (SDT); a Deep Convolutional Neural Network (DCNN)	–	✓	✓	✓
R44	The Adaptive Region-based Active Contour (ARAC), the Principal Component Analysis (PCA), the Optimized Probabilistic Neural Network (PNN), the Opposition Strategic Velocity Updated Beetle Swarm Optimization (OSV-BSO)	✓	✓	–	–
R45	Convolutional Capsule Neural Network (CCNN) Model	✓	✓	–	–
R46	A Scale-Invariant Feature Transfor (SIFT)	✓	–	✓	–
R47	Single Shot Detector (SSD), CNN, LSTM, Discriminative Hand-related features (SVD)	✓	✓	–	–

Open in a new tab

Trends in underlying model utilization

Analyzing these forty-seven articles, as demonstrated in Fig. 4, many researchers are incorporating deep learning models for real-time HGR. The trend is to peak in 2022.

Underlying models

Table 7 lists the underlying deep-learning models implemented in each method, with some methods having just one and others combining many models.

The HGR models for key stage in real-time HGR

The final performance of the HGR will be affected by the selection, alteration, or improvement of the models in the key stages of the HGR process. These critical steps include pre-processing data, feature extraction, hand detection, hand segmentation, hand classification, and hand tracking. In Fig. 5, the models that concentrate on research to improve feature extraction are the largest, with minimal differences in the number of other stages, as follows.

In particular, pre-processing plays a critical role in bridging the gap between data collection and model training while enhancing the accuracy and robustness of deep learning models. Despite its significance, pre-processing is rarely applied in the analyzed literature, as shown in Table 8.

Table 8. Pre-processing technologies.

Ref. No.	Technologies
R3	Zoom out images; Rotation and crop
R5, R29	Image resize and data labeling
R6	A time-synchronized preprocessing method based on the RGB morphology and radar spectrum differences
R9	Converted to tokens using a Pixel Encoder
R12	Hybrid median filter
R15	Star RGB
R17	Transfer learning
R25	Suppress noise (i.e., data smoothing) using Average Filter, Median Filter, and Butterworth Filter
R28	Generative adversarial networks (GAN)
R35	Compress input video stream
R43	Data augmentation: translation and noise injection
R44	Median filtering
R46	Luminosity method based on gray-scale conversion of the input image

Open in a new tab

Improvements and limitations in the HGR models

The researchers designed models with DL to solve the problems that still exist in real-time hand gesture recognition. Most of these models specialize in enhancing accuracy, efficiency, robustness, occlusion issues, and reducing consumption problems. Whereas some progress has been achieved, some limitations remain such as, partial accuracy, efficiency, robustness, application range. Figure 6 shows the contribution of model’s implementation to the whole performance enhancement for real-time HGR. The deeply detailed comparison are analysis in section ‘Comparative analysis of technologies in methods’.

Most of the models contributed recognition accuracy improvements and efficiency advancements.

By the analysis, these methods improve one or more of the HGR’s performances with one model or combination. In Fig. 6, most of the researchers focus on enhancing accuracy in which CNN is used most than others.

Comparative analysis of technologies in methods

CNN and model combinations are the most widespread methods, as shown in Fig. 4 and Table 7. This is because each model has its merits and demerits; combining the underlying models can achieve higher accuracy and efficiency in HGR. Furthermore, most studies tend to enhance accuracy, as illustrated in Fig. 6. Additionally, Fig. 5 reveals that the majority of model research focuses on improving feature extraction. The pre-processing technologies contribute to enhancing model accuracy; fewer studies utilize this technique, as shown in the Fig. 5.

Underlying models in methods

Since CNN has a unique advantage in feature extraction to enhance accuracy, it is frequently used in methods, either by itself or combined with other models. The most prevalent of them are 2 deep convolutional neural network (DCNN) and 3DCNN. The 2DCNN is only for a single image, while the 3DCNN is for continuous frame images. 2DCNN can efficiently process 2D static images and also the video segmented into a sequence of static images by frame; however, most of the background information of video images is redundant, leading to inefficiency. Therefore, the 2DCNN-based models are constrained in modeling temporal relationships, whereas the 3DCNN-based models better capture temporal features at a computational cost.

In addition, numerous models are designed based on influential CNN architectures to optimize performance. Such as, AlexNet (Krizhevsky, Sutskever & Hinton, 2017), VGG16, VGG19 (Simonyan & Zisserman, 2014), GoogleNet (Szegedy et al., 2015), and ResNet (He et al., 2016), C3D (Tran et al., 2015), ResC3D (Tran et al., 2017), DenseNet (Huang et al., 2017).

Although CNN extracts features of a single image more accurately, front-to-back temporal dependencies remain insufficiently captured. Moreover, 3DCNN only extracts short-term features. On the contrary, LSTM is designed to associate previous information with the current task, with an additional forget gate compared to RNN, which determines the retention of earlier information at the current moment.

However, LSTM has a relatively complex model structure that is more time-consuming to train than CNN, and has a disadvantage in parallel processing. In contrast, the attention mechanism enables lightweight and parallel processing. It can filter out task-independent information while enhancing task-related information, but the results are not accurate enough.

Hence, some researchers integrate CNN or LSTM or attention mechanism or other deep learning models to obtain better performance. The advantages and disadvantages of these models are summarized in Fig. 7.

The HGR models for key stage in real-time HGR

The key stage model is essentially designed to achieve better performance in real-time gesture recognition. In this case, improving the feature extraction model is significant, especially spatiotemporal feature extraction. The others are similar, as shown in Fig. 5. The reason for this is that the precision of feature extraction has a direct influence on not only the segmentation, tracking, and classification but also the precision and responsiveness of real-time HGR.

Moreover, in the recent seven years, pre-processing techniques have been employed to tackle specific challenges in image and video data. Pre-processing data decreases the amount of unnecessary data in training and testing, thereby enhancing accuracy and reducing computational consumption.

Various pre-processing techniques are applied to address inaccuracy. Image resizing (R5, R29) standardizes input data, such as resizing images to 300 × 300 pixels, ensuring consistency across datasets and facilitating model training. The hands are labeled by LabelImg (R29) to obtain accurate annotations for supervised learning. Rotation, cropping, and zooming (R3) further augment the dataset by simulating different perspectives, reducing overfitting, and improving the model’s generalization.

Noise reduction is another critical aspect to enhance accuracy, as noisy data can degrade model performance. Multiple filtering techniques are employed to suppress noise, including average filters, median filters, and Butterworth filters (R25), hybrid median filters (R12) and median filtering (R44). The luminosity method (R46) converts images to grayscale while retaining essential hand edges. What’s more, Transfer learning (R17) leverages pre-trained models to improve efficiency and accuracy for tasks with limited data. Translation and noise injection (R43) and generative adversarial networks (GANs) (R28) expand data diversity to enhance model robustness.

In particular, some pre-processing technologies for a special time-synchronized method (R6) leverage differences in RGB morphology and radar spectrum. Pixel tokenization via self-attention mechanisms (R9) is used to convert tokens to capture spatial dependencies. Additionally, compressing input video streams (R35) ensures lower computational cost.

Furthermore, in the recent 7 years, researchers have begun to refine hand segmentation models for deep learning and enhance the essential stage in alignment with application scenarios. This is because accurate segmentation is beneficial for precisely identifying hand gestures. In some cases, the vital stage of improvement links to the application context. At the classification stage, for example, enhancement models are chiefly applied in sign language recognition and handwriting trajectory recognition.

Improvements and limitations in the HGR models

Researchers have undertaken to enhance the accuracy and efficiency, whereas these two aspects are still primary issues to be addressed, as shown in Fig. 6. One reason is that the accuracy of the method has only improved in some specific situations. The CNN-based hybrid model (R7) is accurate in complex backgrounds, yet lower in specific gestures. Conversely, EMI (R21) is precise in each gesture, while degraded in outdoor. Gesture-CNN (R5), I3D (R13), Star RGB (R15) and A two-stream hybrid model (R41) are classified with accuracy, while error rate increased in real-time, issues in a more complex context, losing hand details, and confused finger and hand, respectively. Non-touch character writing system (R16) is accurate without increasing error rates over time. The normalized 2D SDD features and a prior knowledge (R42) accuracy, except for similar hand gestures of classes ‘a’, ‘e’, ‘m’, ‘n’, ‘s’ and ‘t’.

The other reason is that the models for improving accuracy bring extreme consumption costs, which in turn affect the recognition speed, such as R3 (CNN and LSTM-based model), R18 (CNN, multi-scale fusion, weight multi-scale, multi-scale attention and attention U-Net), R25 (F-BiLSTM and F-BiGRU), R29 (RetinaNet-DSC), and R33 (RPCNet). In contrast, the current model for efficiency hardly achieves relatively good accuracy, such as R1 (structured dynamic time warping), R30 (A lightweight 3D Inception-ResNet), and R17 (optimized Darknet CNN).

Many researchers have made efforts to combine various models to obtain a balance, but the outcome is unsatisfactory. An integrated model with the covariance matrix (R11) and the deep single-stage CNN model (R32) balanced the accuracy and efficiency, while they had deficiencies in multiple gestures and in predicting the level of detail, respectively. The hybrid model (R35) balanced computation cost and classification accuracy, yet was still limited on mobile devices.

In addition, although recent methods remarkably contribute to HGR, they still have the following problems in practical applications. Most methods perform excellently on datasets (R13, R21, R28), yet poorly in real-world scenarios, such as complex backgrounds, hydrological change, and dynamic outdoor backgrounds. Moreover, high computational cost is a major bottleneck of model fusion (R3, R13, R18, R25, R26, R29, R33, R47). Although model fusion specialized in accuracy improvement and robustness, they have limited application scenarios on real-time and resource-constrained devices. On the contrary, lightweight models (R13, R38) are more practical in these scenarios, whereas their environmental adaptability is weak. Beyond that, the existing methods, despite improved in some special scenarios, remain to be further advanced, such as, hand-hand or hand-object interaction with occlusion (R2), online (R4, R5), complex backgrounds (R7), lighting variations (R10, R16), underwater (R28), and astronauts interacting with robots (R31).

Current research on HGR has gained remarkable achievements in accuracy, efficiency, and robustness by integrating deep learning models such as CNN, LSTM, attention mechanisms, and multi-modality. However, it still faces challenges such as the balance between accuracy and efficiency, weak environmental adaptability, occlusion, and high consumption cost.

RQ2: what are the performance metrics used to evaluate the HGR models?

In response to RQ2, this section elaborates a statistical summary and critical analysis, focusing on the evaluation metrics, datasets, ablation studies, and technical assessments. Moreover, the comparative analysis and critical synthesis are elaborated in the discussion section. These synthesizing aspect provides valuable insights into performance evaluation and lay the foundation for the future. Meanwhile, these statistical analyses provide valuable insights into the influence on model performance and a crucial foundation for refining dataset design and selection, thereby contributing to enhancing the model’s usability and research significance.

The following statistics have been gathered by reading forty-seven references. The numerical data in the spreadsheet includes evaluation metrics, dataset, ablation study, model performance, and the charts were generated by specialized software.

Evaluation metrics

Figure 8 illustrates the distribution of evaluation metrics in studies. The following is a detailed description.

The x-axis presents the number of literature that use a certain evaluation parameter.

Accuracy

Commonly evaluating accuracy by recognition accuracy, F1-score, precision, recall, average precision (AP), etc., with almost all models using recognition accuracy to assess the performance. The equations related to the evaluation metrics are listed below:

Precision emphasizes the reliability of positive predictions and indicates the proportion of true positive predictions among all positive predictions:

P r e c i s i o n = \frac{T r u e P o s i t i v e s}{T r u e P o s i t i v e s + F a l s e P o s i t i v e s} .

(1)

Recall evaluates the model’s ability to identify all relevant instances, which is vital in minimizing false negatives. It reflects the percentage of true positive predictions among all actual positives:

R e c a l l = \frac{T r u e P o s i t i v e s}{T r u e P o s i t i v e s + F a l s e N e g a t i v e s} .

(2)

F $β$ -score is a metric that adjusts the balance between precision and recall using the parameter $β$ :

F_{β} = (1 + β^{2}) \cdot \frac{P r e c i s i o n \cdot R e c a l l}{β^{2} \cdot P r e c i s i o n + R e c a l l} .

(3)

When $β = 1$ , it becomes the F1-score, which gives equal weight to precision and recall. Larger $β$ emphasizes recall, while smaller $β$ emphasizes precision.

Average precision (AP) presents the relationship between precision and recall across different thresholds. It calculates the mean of Precision values at varying levels of recall:

A P = \int_{0}^{1} P (r) d r

(4)

where $r$ is the recall values ranging from 0 to 1, $P (r)$ is precision as a function of recall $r$ .

Mean average precision (mAP) evaluates multi-class object detection performance. It is the average of AP values across all classes:

m A P = \frac{1}{n} \sum_{i = 1}^{n} A P_{i}

(5)

where $n$ is the total number of classes, $i$ is the index of a class (ranging from 1 to $n$ , $A P_{i}$ is the Average Precision for class $i$ ).

Mean squared error (MSE) is a common evaluation metric that measures the average of the squared differences between the actual values ( $y_{i}$ ) and the predicted values ( ${\hat{y}}_{i}$ ). $n$ represents the total number of samples. $y_{i}$ is the actual (true) value for the ith observation. ( ${\hat{y}}_{i}$ ) is the predicted value for the ith observation.

M S E = \frac{1}{n} \sum_{i = 1}^{n} {(y_{i} - {\hat{y}}_{i})}^{2} .

(6)

Efficiency

Many studies evaluate efficiency by computational performance, mean detection time, run time, delay, etc. Computational performance is the average per frame processing time. Detection time means time for detecting hand gestures.

In brief, the majority of models concentrate on enhancing the accuracy, as illustrated in Fig. 6. As a result, in Fig. 8, the evaluation metrics for evaluating accuracy are significantly higher than efficiency. For accuracy, recognition accuracy is the largest because it provides a global assessment of the final performance throughout the real-time HGR process. Following is the F1-score, and its relevant precision and recall. On the contrary, other models demonstrate accuracy by evaluating the low error rate. For efficiency, most evaluation metrics are related to time consumption per frame.