Enhanced feature fusion with hand gesture recognition system for sign language accessibility to aid hearing and speech impaired individuals

Najm Alotaibi; Reham Al-Dayil; Nojood O Aljehane; Mohammed Rizwanullah

doi:10.1038/s41598-025-34100-5

. 2026 Jan 2;16:3998. doi: 10.1038/s41598-025-34100-5

Enhanced feature fusion with hand gesture recognition system for sign language accessibility to aid hearing and speech impaired individuals

Najm Alotaibi ^1,^2,^✉, Reham Al-Dayil ³, Nojood O Aljehane ⁴, Mohammed Rizwanullah ⁵

PMCID: PMC12855952 PMID: 41484456

Abstract

In the modern world, individuals with intellectual or communication disabilities face significant challenges in communicating with others. To reduce their communication difficulties, a communication system is designed and developed to convert sign language into text and speech. Dynamic hand gesture recognition (HGR) is a preferred option that focuses on human–computer interactions (HCI). HGR investigation is obtaining increasing attention from investigators globally. Also, regular application in day-to-day life, gesture recognition (GR) is beginning to enter education, virtual reality, automotive, mobile devices, and so on. Owing to the massive growth in artificial intelligence (AI), computer vision (CV)-based GR systems are the most extensively researched field recently. This paper presents a Feature Fusion-based Hand Gesture Recognition for Sign Language Accessibility using the Tornado Optimisation Algorithm (FFHGR-SLATOA) model to aid hearing- and speech-impaired people. The aim is to develop an innovative deep learning-based HGR model to enhance communication accessibility for hearing- and speech-impaired individuals. The image pre-processing stage begins with median filtering (MF) to improve image quality by removing noise. Furthermore, the fusion of ConvNeXt Base, VGG16, and EfficientNet-V2 techniques is employed for the feature extraction process. Moreover, the FFHGR-SLATOA approach employs the deep belief network (DBN) model for classification. Finally, the tornado optimization algorithm (TOA) model is implemented for the parameter tuning process. The experimental analysis of the FFHGR-SLATOA approach is performed under the GR dataset. The comparison study of the FFHGR-SLATOA approach portrayed a superior accuracy value of 99.14% over existing models.

Keywords: Hand gesture recognition, Sign language accessibility, Tornado optimization algorithm, Fusion feature extraction, Image pre-processing

Subject terms: Computational biology and bioinformatics, Mathematics and computing

Introduction

Across the globe, there are 466 million individuals who are hearing- and speech-impaired, and out of them, 34 million are children¹. World health organization (WHO) reports that it might rise to 900 million by the year 2050. Genetic factors and birth-related issues, etc., cause hearing impairment. Sign language helps people in the signing community connect with the general population². Inside a family, a hearing- and speech-impaired person may use unique methods to communicate, so there is no necessity for standard sign language gestures, but while talking with the hearing- and speech-impaired person, one should utilize the standard sign language gestures³. On this platform, the interaction is like an HCI. When it comes to sign language recognition, it is highly challenging to acquire appropriate input data due to numerous factors, including the environment and complications in the sign language data⁴. In everyday life, people communicate through speech and use gestures to guide, indicate, and emphasize their points. Gestures are more suitable and natural for HCI, making a stronger link between humans and machines. For many hearing-impaired and deaf person, sign language is their main way of expression and forms a deep part of their cultural and social identity⁵.

HGR is employed in human–robot interaction (HRI) to form user interfaces that are user-friendly and beginner-friendly. Sensors leveraged for HGR comprise external sensors, namely video cameras, and wearable gadgets, like data gloves⁶. Video-based GR tackles such problems; however, it poses a novel issue: finding the hand and separating it from the background in a sequence of images is a challenging task, specifically when there are variations in lighting, obstructions, fast movements, or the presence of skin-colored objects in the background⁷. Data gloves may deliver precise movement and hand posture measurements; however, they need wide calibration, limit natural hand motion, and are costly. Creating HGR systems, like sign language applications, is crucial to address the communication barrier with individuals who are unfamiliar with sign language⁸. Technology that spontaneously interprets hand gestures into audible speech or text for a non-signing person to understand can assist in reducing this barrier. Because of the significant advancement in camera technology and AI, CV-assisted gesture detection systems have become a commonly studied research area currently⁹. Deep learning (DL) techniques have received considerable attention from the academic community and businesses swiftly, as they are powerful and have attained better performance in the HGR area¹⁰.

This paper presents a Feature Fusion-based Hand Gesture Recognition for Sign Language Accessibility using the Tornado Optimisation Algorithm (FFHGR-SLATOA) model to aid hearing- and speech-impaired people. The aim is to develop an innovative deep learning-based HGR model to enhance communication accessibility for hearing- and speech-impaired individuals. The image pre-processing stage begins with median filtering (MF) to improve image quality by removing noise. Furthermore, the fusion of ConvNeXt Base, VGG16, and EfficientNet-V2 techniques is employed for the feature extraction process. Moreover, the FFHGR-SLATOA approach employs the deep belief network (DBN) model for classification. Finally, the tornado optimization algorithm (TOA) model is implemented for the parameter tuning process. The experimental analysis of the FFHGR-SLATOA approach is performed under the GR dataset. The key contribution of the FFHGR-SLATOA approach is listed below.

The FFHGR-SLATOA method improves input quality by applying the MF model for image pre-processing, effectively mitigating noise. This step ensures cleaner data for subsequent processing, contributing to an enhanced feature extraction and classification accuracy within the GR process.
The FFHGR-SLATOA technique integrates a fusion of ConvNeXt Base, VGG16, and EfficientNet-V2 models for extracting rich and diverse features from gesture inputs, enabling more robust and discriminative representations. This fusion significantly improves the model’s ability to capture intrinsic spatial patterns, resulting in higher recognition accuracy.
The FFHGR-SLATOA approach implements the DBN technique for robust and hierarchical classification of gesture inputs, effectively capturing intrinsic feature relationships. This improves the decision-making capability of the model, contributing to more accurate and reliable GR across diverse input discrepancies.
The FFHGR-SLATOA model utilizes the TOA technique for fine-tuning the parameters of the model, thus enhancing the overall accuracy and convergence. This intelligent optimization effectively improves the effectiveness and adaptability of the model and also plays a crucial part in improving the performance across varying GR scenarios.
The novelty of the FFHGR-SLATOA methodology is in its unique integration of multi-CNN feature fusion, DBN-based hierarchical classification, and TOA-driven parameter optimization. This incorporates allows for effective extraction, classification, and optimization. The model is also commonly explored in prior GR studies. This results in a highly efficient, scalable, and accurate end-to-end recognition framework.

Section "Related works" presents the related works in the field of hand GR for hearing- and speech-impaired individuals. Section "The proposed methodology" describes the proposed methodology, followed by Section "Experimental validation", which outlines the experimental validation of the approach. Finally, Section "Conclusion" concludes the study with a summary of the findings’ directions.

Related works

Alabduallah et al.¹¹ presented a new Sign Language Recognition through Hand Pose alongside a Hybrid Metaheuristic Optimiser Algorithm in DL (SLRHP-HMOADL) method for hearing-impaired persons. The methods aim to focus on HGR for enhancing the efficacy and precision of sign language understanding for deaf individuals. Alaimahal et al.¹² proposed an innovative result using the ability of DL models, long short-term memory (LSTM) networks, for addressing specific problems. This method concentrated on estimating and identifying movements made by people with disabilities depending on consecutive frames of activities. This ground-breaking service of LSTM and DL approaches presents a real-world solution to overcome communication barriers and helps to advance accessible technologies. Singhal et al.¹³ explored the employment of vision-based HGR to address communication problems encountered by people who are deaf or hard of hearing in HCI. Emphasizing the barriers presented by sign language interpretation and traditional communication techniques, this study presents the dumb aid phone system as a new solution. Shegokar et al.¹⁴ introduced a new adaptive sign language detection network, which merges CNN with the histogram of oriented gradients (HOG) method. The system bridges the communication gap between hearing and deaf individuals by effectively identifying dynamic and static gestures utilizing a classic web camera. This tool, which is projected to be both cost-effective and accessible, eliminates the necessity for expensive hardware, permitting broader application.

In¹⁵, an innovative Inverted Residual Network Convolutional Vision Transformer-based Mutation Boosted Tuna Swarm Optimiser (IRNCViT-MBTSO) model is suggested to recognize both hand sign languages. The presented dataset is intended for identifying diverse dynamic names, and the anticipated images are pre-processed to enhance the model’s generalization potential and improve image quality. The local features are captured with the help of feature graining, whereas global features were extracted from the pre-processed images by the ViT transformer algorithm. Tan et al.¹⁶ introduced the stacking of distilled ViT (SDViT) for HGR. Firstly, a pre-trained ViT containing a self-attention mechanism is presented to efficiently identify complex associations among image patches, thus augmenting its ability to manage the difficulty of higher-order relationships in hand signals. Then, knowledge distillation is presented to constrain the ViT model and refine the model’s generalizability. Alyami and Luqman¹⁷ recommended the Swin multi-scale temporal perception (Swin-MSTP) architecture, where the Swin transformer (Swin T) is employed as the spatial feature extractor that can capture clear spatial information and deliver a greater contextual interpretation among SL components in video frames.

The proposed methodology

In this manuscript, a FFHGR-SLATOA technique is proposed to aid hearing- and speech-impaired people. The primary objective of this paper is to propose a novel DL-based HGR technique to enhance communication accessibility for hearing- and speech-impaired individuals. It comprises distinct levels of image pre-processing, fusion of transfer learning, classification, and parameter tuning methods. Figure 1 illustrates the overall process of the FFHGR-SLATOA model.

Fig. 1 — Overall process of the FFHGR-SLATOA model.

MF-based image pre-processing

Initially, the image pre-processing phase applies MF to upgrade image quality by eliminating the noise¹⁸. MF preserves crucial edge details that are considered significant for accurate hand gesture recognition. Also, this model avoids edge blurring compared to other models such as Gaussian or mean filters, thus ensuring that gesture contours remain sharp for feature extraction. This model is also more appropriate for real-time applications due to its simplicity, computational efficiency, and robustness.

MF is a nonlinear digital filter method usually used to eliminate noise from images, which is particularly effective in preserving edges, while removing impulse (salt-and-pepper) noise. In the field of HGR, MF is used in the pre-processing stage to improve the quality of the image before feature extraction. By substituting each value of the pixel with the median of neighbouring pixel values, this filter smooths the image without blurring significant details such as finger edges or hand contours. This is important to maintain the precision of gesture shape detection. MF also facilitates decreasing background interference and enhancing segmentation performance. Therefore, it improves the reliability and robustness of the HGR system, particularly after addressing real-time video input or variable lighting states.

Fusion of feature extractor

Besides, the fusion of ConvNeXt Base, VGG16, and EfficientNet-V2 techniques is employed for the feature extraction process¹⁹. The fusion model appropriate maximizes the feature learning capability of the technique. Among the fusion techniques, ConvNeXt enable efficient comprehension of intrinsic visual patterns while also maintaining computational efficiency. Additionally, VGG16 provides a simple yet effective hierarchical convolutional structure, enhancing gradual abstraction and robust spatial feature extraction. Furthermore, MBConv blocks are utilized by the EfficientNet-V2 technique for ensuring robust generalization and low computational cost. Also, the fusion operation is accomplished via concatenation to integrate the high-level features extraction from the prior layer into a unified, discriminative representation. This fusion also effectually integrates the merits of all three architectures, improving accuracy, robustness, and overall performance in hand gesture recognition.

ConvNeXt base model

ConvNeXt Base is an innovative CNN structure that exemplifies a significant advancement in the CV field. Although not as generally recognized as a few conventional structures, it has gathered attention for its strong feature extraction abilities and computational efficacy. Outstanding for its use of grouped convolutions, layer normalization (LN), and random depth regularisation, this model targets to strike a subtle balance between model performance and complexity. Regarding performance, it has shown promise in different image classification tasks. For example, investigators have applied ConvNeXt Base to classify composite medical images, such as those associated with tumour recognition in histopathological analysis and mammography. Its capability to take complex visual designs while preserving computational complexity makes this model an excellent selection for different applications. Structurally, this model consists of a sequence of CNBlocks, all featuring permutations, grouped convolutions, and linear transformations. These blocks, boosted with random depth layers, assist in improved model strength and generalizability. Additionally, this model utilizes LN and several activation functions to seize complex patterns successfully.

VGG16 method

VGG16 is an innovative CNN structure that has left a permanent mark on the image classification environment. Well-known for its effectiveness and simplicity, it remains a basis in the domain even with the development of innovative models. Its structural design, consisting of repeated blocks of convolutional layers followed by max‐pooling operations and rectified linear unit (ReLU) activation functions, selects spatial downsampling and feature extraction. Practically, this model is widely applied in numerous CV tasks, ranging from image recognition to object localization. Its direct model and robust performance make it a preferred choice for benchmarking and experiments in research settings. Structurally, it contains a hierarchical model of convolutional layers, developed by max pooling operations and ReLU activation functions. This model promotes hierarchical feature extraction, allowing the method to capture gradually abstract representations as information passes through deeper into the network. Moreover, the combination of max‐pooling layers helps spatial down-sampling, lowering computational efficiency, while preserving selective information.

EfficientNet-V2

EfficientNet-V2 has emerged as a prominent model in the field of efficient and scalable CNN structures, exemplifying advanced developments in model optimization and design. To attain the best performance through changing computational budgets, EfficientNet-V2 has gained widespread recognition for its adaptability to different deployment settings. Its structural design, described by the normal convolutional layer succeeded by fused MobileNet-V2 (MBConv) blocks, exemplifies efficacy without compromised classification precision. In practical contexts, this model has proven phenomenal efficiency through a range of tasks, from image classification to object detection. Its hierarchical model and strategic combination of stochastic depth layers lead to improved model strength and generalizability. Structurally, it consists of a cascade of MBConv blocks, all feature depth‐wise separable convolutions and effective channel attention mechanisms. This model fosters effective feature aggregation and extraction, allowing the method to take composite visual forms while reducing computational complexity. In addition, the use of random depth layers improves regularisation of the model, contributing to enhanced performance on different datasets. Figure 2 represents the structure of the EfficientNet-V2 technique.

Fig. 2 — Architecture of EfficientNet-V2 model.

Classification using DBN model

Followed by, the FFHGR-SLATOA model utilizes the DBN technique for the classification process²⁰. This technique is robust in modelling intrinsic, high-dimensional feature distributions. Additionally, the model enhances weight initialization and generalization and is also exhibit efficiency over standard deep MLP or softmax head techniques by pretraining each layer in an unsupervised manner. DBNs is capable of effectively capturing complex dependencies when applied to a fully connected classification head on the fused CNN features. The model is also computationally lighter and easier to train compared to transformer-based models, thus making it appropriate for real-time applications. Moreover, they outperform simpler and more efficient classifiers in handling hierarchical and nonlinear feature interactions. DBN is also well-aligned with the fused CNN feature structure, due to their ability in utilizing both unsupervised pretraining and supervised fine-tuning.

A restricted Boltzmann machine (RBM) is a two-part graphical model that consists of a visible layer (VL) and a hidden layer (HL), highlighting end-to-end communication amongst layers and no intra-layer connectivities. The DBN is a DL structure established in probability-based visual representation, including several layers of RBMs ordered in a stacked formation. The DBN’s ability to automatically learning higher‐level conceptual characteristics from data gives an essential benefit in handling higher‐dimensional, nonlinear datasets. The RBM acts as the basic component of the DBN. The nodes within VL describe the input data, while the nodes within HL are used to learn the feature data representation. During RBM, all nodes are displayed as the binary stochastic variable in their state, subject to the weights and states of the linked nodes. The model of contrastive divergence (CD) is proficient at learning the likelihood distribution of the data, aiding either feature learning or data generation.

The bottom layer of the DBN method uses a multi-layer RBM architecture. A greedy model is applied to train the sample data layer-by-layer. The parameters gained from the CD-based training of the initial layer of RBM serve as input for the next layer of RBM, and this procedure is reiterated for subsequent layers. The training procedure is described as unsupervised learning. This layer‐wise pre-training approach successfully deals with the gradient vanishing problem encountered in deep network training, improving either generalizability or training efficiency. The model is not restricted to treating it as a sequence issue; instead, the idea of phase space reconstructions from physics and mathematics is applied to describe and evaluate the intricate behaviour of dynamical methods. The basic concept of space reconstructions is to convert time-series data into a collection of points in phase space, facilitating a complete understanding of the model’s features and evolution. It allows the transformation of the new 1D time series into higher‐dimensional phase space vectors. These higher-dimensional space vectors can then act as inputs to the DBN, permitting it to additionally handle these vectors to remove important nonlinear features and implement dimension reduction. In detail, start with a univariate time series Inline graphic . In contrast, t begins from 1 to (using to represent the dataset length), the phase space reconstruction converts into the vector in a ‐dimensional area, as demonstrated:

whereas, Inline graphic is specified as the embedded dimensions that govern the space complexity, and signifies the delay time that determines the time interval between data points. To choose suitable values for delay times and embedded dimensions, it is promising to take the internal model of the time series inside the phase space.

TOA-based parameter tuning process

Finally, the parameter tuning process is carried out through TOA to strengthen the classification performance of DBN²¹. This model optimizes parameters and illustrates efficiency in searching the hyperparameter space. Also, the model improves DBN accuracy and generalization and provides faster convergence, compared to manual tuning or grid search. TOA also avoids local minima, thus ensuring more robust performance. The method is also considered computationally efficient than other metaheuristic methods. TOA ensures that the model fully utilizes the discriminative power of the fused CNN features and maintains high optimization quality, resulting in robust, reliable hand gesture classification. TOC is also a heuristic optimizer model derived from the natural process of tornadoes. This model mimics the communications between windstorms, tornadoes, and thunderstorms, combining the biological phenomena of the Coriolis force to handle the searching procedure and finally discover the global best solution. This model mimics the succeeding biological phenomenon:

Tornadoes It characterizes the best solution in the present population using robust attraction abilities.

Thunderstorms It symbolizes sub-optimum solutions using particular local searching capabilities.

Windstorms They characterize normal solutions that are responsible for exploring the searching region. By mimicking the communications between these natural processes (like windstorms developing into thunderstorms and tornadoes) and joining the actual properties of the Coriolis force, this model may effectively balance exploitation and exploration in the searching area. In the model’s initialization, a specified number of tornadoes, windstorms, and thunderstorms are created. Each location of the individual is distributed at random within the searching region, and its fitness value is measured. The upgrades for the speeds and windstorm locations are implemented based on Eqs. (2)–(4). The parameters for this work are fixed as a population size of 50 and 300 iterations.

The storm location upgrades include dual sections: evolution towards the tornado and the thunderstorm. The evolution towards the tornado is as demonstrated:

whereas, Inline graphic refers to the location of the storm in the size, refers to the tornado’s position in the dimension, means random weight, and stands for the velocity of the storm in the size. The evolution towards the thunderstorm is as shown:

Here, Inline graphic denotes thunderstorm’s position in the dimension, signifies tornado’s position in the dimension, and means a randomly generated number from a uniform distribution.

The storm’s speed upgrade is subject to the Coriolis force, and the equation is as demonstrated:

Now Inline graphic denotes the velocity of the storm in the dimension, represents the scaling factor, means the inertia weight, refers to the random coefficient, indicates the Coriolis force coefficient, illustrate the radius parameter, and symbolizes the Coriolis force term. Algorithm 1 illustrates the TOA model.

Table 1 portrays the hyperparameters of the TOA method. This model is initialized with a population size of 50, highlighting the overall tornadoes, thunderstorms, and windstorms, and is run for 300 iterations to optimize the solution.

Table 1.

Hyperparameter settings of the TOA technique.

Parameter	Description	Value
POPU_SIZE	Number of storms	50
ITER	Optimum optimization cycle number	300
	Random weight for tornado attraction	0–1
	Arbitrary number for thunderstorm evolution	0–1
	Scaling factor for velocity update	User-defined
	Inertia weight	User-defined
	Random coefficient	User-defined
	Coriolis force coefficient	User-defined
	Radius parameter	User-defined

Open in a new tab

The TOA originates a fitness function (FF) to achieve enhanced performance of classification. It defines a progressive number to characterize the improved performance of candidate solutions. In this paper, the minimization of the classification error rate is reflected as the FF, as shown in Eq. (5).

Experimental validation

The performance evaluation of the FFHGR-SLATOA approach is investigated under the GR dataset²². The technique is simulated using Python 3.6.5 on a PC with an i5-8600 k, 250 GB SSD, GeForce 1050Ti 4 GB, 16 GB RAM, and 1 TB HDD. Parameters include a learning rate of 0.01, ReLU activation, 50 epochs, 0.5 dropout, and a batch size of 5. The utilized dataset comprises 20,000 images in total under five classes, such as Thumbs UP, Thumbs Down, Left Swipe, Right Swipe, and Stop. Each class has 4000 images.

Figure 3 illustrates the confusion matrices of the FFHGR-SLATOA technique under diverse epochs under the GR dataset. Under epoch 500, misclassifications were higher with diverse gesture instances. As training progressed to epochs 1000 and 1500, the number of correctly classified instances increased, with diagonal entries in the matrices growing larger, highlighting improved per-class accuracy. By epochs 2000 to 3000, the matrices exhibit robust diagonal dominance, emphasizing that most gestures were correctly predicted. Thus, the progression of the confusion matrices clearly highlights consistent enhancement, depicting the efficiency in distinguishing between similar hand gestures over training iterations.

Fig. 3 — Confusion matrices of the FFHGR-SLATOA technique under diverse epochs under the GR dataset.

Table 2 depicts the GR of the FFHGR-SLATOA methodology under diverse epochs on the GR dataset. The results suggest that the FFHGR-SLATOA methodology appropriately recognized the instances. On 500 epochs, the FFHGR-SLATOA methodology attains an average Inline graphic of 97.05%, of 92.64%, of 92.63%, of 92.63%, and of 95.39%. Moreover, on 1500 epochs, the FFHGR-SLATOA methodology attains an average of 98.42%, of 96.06%, of 96.06%, of 96.06%, and of 97.54%. Besides, under 2500 epochs, the FFHGR-SLATOA model attains an average of 98.67%, of 96.68%, Inline graphic of 96.68%, of 96.68%, and of 97.93%. Finally, under 3000 epochs, the FFHGR-SLATOA model attains an average of 99.14%, of 97.85%, of 97.85%, of 97.85%, and of 98.66%. The consistently high metrics across all classes confirm the robustness and reliability of the FFHGR-SLATOA model for real-time sign language recognition applications.

Table 2.

GR outcome of the FFHGR-SLATOA model under distinct epochs on the GR dataset.

Classes
Epoch—500
Thumbs UP	97.26	92.30	94.13	93.20	96.08
Thumbs Down	96.64	92.36	90.67	91.51	94.40
Left Swipe	97.24	94.02	92.05	93.03	95.29
Right Swipe	97.38	93.00	94.00	93.50	96.12
Stop	96.75	91.50	92.30	91.90	95.08
Average	97.05	92.64	92.63	92.63	95.39
Epoch—1000
Thumbs UP	98.27	95.36	96.03	95.69	97.43
Thumbs Down	97.97	94.66	95.20	94.93	96.93
Left Swipe	98.19	96.31	94.57	95.43	96.83
Right Swipe	98.24	95.13	96.13	95.62	97.45
Stop	97.87	94.90	94.40	94.65	96.57
Average	98.11	95.27	95.26	95.26	97.04
Epoch—1500
Thumbs UP	98.56	96.33	96.45	96.39	97.77
Thumbs Down	98.45	95.94	96.30	96.12	97.64
Left Swipe	98.40	96.09	95.90	96.00	97.46
Right Swipe	98.46	96.24	96.05	96.15	97.56
Stop	98.26	95.70	95.60	95.65	97.26
Average	98.42	96.06	96.06	96.06	97.54
Epoch—2000
Thumbs UP	98.68	96.35	97.08	96.71	98.08
Thumbs Down	98.63	95.95	97.25	96.60	98.11
Left Swipe	98.56	96.70	96.05	96.38	97.62
Right Swipe	98.55	96.49	96.25	96.37	97.69
Stop	98.47	96.71	95.57	96.14	97.38
Average	98.58	96.44	96.44	96.44	97.78
Epoch—2500
Thumbs UP	98.74	96.41	97.32	96.86	98.21
Thumbs Down	98.84	96.36	97.90	97.12	98.49
Left Swipe	98.65	97.17	96.03	96.59	97.66
Right Swipe	98.60	96.34	96.70	96.52	97.89
Stop	98.53	97.15	95.45	96.29	97.38
Average	98.67	96.68	96.68	96.68	97.93
Epoch—3000
Thumbs UP	99.21	97.95	98.10	98.03	98.79
Thumbs Down	99.29	97.80	98.70	98.25	99.07
Left Swipe	99.06	98.13	97.15	97.64	98.34
Right Swipe	99.09	97.49	97.95	97.72	98.66
Stop	99.05	97.89	97.35	97.62	98.41
Average	99.14	97.85	97.85	97.85	98.66

Open in a new tab

In Fig. 4, the training (TRAN) Inline graphic and validation (VALD) results of the FFHGR-SLATOA method under numerous epochs on the GR dataset are exemplified. The values are calculated through an interval of 0–3000 epochs. The figure underlined that the TRAN and VALD values show maximal tendencies, indicating the proficiency of the FFHGR-SLATOA technique with enhanced solution through different iterations. Furthermore, the TRAN and VALD Inline graphic remains nearer through the epoch counts, which signifies lesser overfitting and reveals greater outcomes of the FFHGR-SLATOA technique, ensuring reliable prediction on unseen instances.

In Fig. 5, the TRAN and VALD losses graph of the FFHGR-SLATOA approach under several epochs on the GR dataset is revealed. The loss values are calculated through an interval of 0–3000 epochs. It is depicted that the TRAN and VALD values show a minimal tendency, reporting the ability of the FFHGR-SLATOA methodology in balancing a trade-off between data fitting and generalization. The constant decrease in loss values furthermore safeguards the superior solution of the FFHGR-SLATOA methodology and adjusts the prediction outcomes.

In Fig. 6, the precision-recall (PR) curve examination of the FFHGR-SLATOA approach under diverse epochs on the GR dataset provides insight into its solution by plotting Precision against Recall for every label. The figure shows that the FFHGR-SLATOA approach consistently achieves greater values of PR through diverse labels, representing its proficiency to preserve a substantial part of true positive predictions among all positive predictions (precision), likewise acquiring a considerable proportion of actual positives (recall). The continuous increase in PR solutions across all class labels reveals the efficacy of the FFHGR-SLATOA methodology in the classification procedure.

In Fig. 7, the ROC graph of the FFHGR-SLATOA model under numerous epochs on the GR dataset is examined. The performance suggests that the FFHGR-SLATOA approach accomplishes superior ROC solutions across all classes, representing substantial proficiency in differentiating classes. These consistent trends of maximal ROC values across diverse classes indicate the capability of the FFHGR-SLATOA approach in forecasting class labels, underlining the strong nature of the classification procedure.

Table 3 and Fig. 8 compare the solutions of the FFHGR-SLATOA technique with present methodologies under the GR dataset^1,23. The solutions underlined that the RGB + Flow, DenseImage Net, 3DCNN + MLP, 3D CNN, Two-stream CNN-LSTM, Inception LSTM, and Xception-LSTM methodologies have stated poor outcomes. Simultaneously, the projected FFHGR-SLATOA technique informed enhanced outcomes with superior Inline graphic , , and of 99.14%, 97.85%, 97.85%, and 97.85%, respectively.

Table 3.

Comparative analysis of FFHGR-SLATOA model with existing methods under the GR dataset.

Methods
RGB + Flow	98.25	93.37	83.45	92.33
DenseImage Net	78.12	94.75	86.87	83.62
3DCNN + MLP	98.12	94.39	90.60	79.90
3D CNN	90.00	88.86	91.16	83.37
Two-stream CNN-LSTM	91.25	93.50	86.25	79.77
Inception LSTM	96.73	91.58	83.61	82.97
Xception-LSTM	70.53	81.66	86.25	96.75
FFHGR-SLATOA	99.14	97.85	97.85	97.85

Open in a new tab

Fig. 8 — Comparative analysis of FFHGR-SLATOA model with existing methods under the GR dataset.

Table 4 and Fig. 9 indicate the comparison evaluation of the FFHGR-SLATOA approach with existing techniques under the Sign Language MNIST dataset^24,25. The CNN attained slightly improved Inline graphic of 89.00%, of 88.70%, of 88.00%, and of 88.50%, while RNN and LSTM exhibited slightly varied performance with an accuracy of 90.00% and 85.00% respectively. Furthermore, Conv LSTM and GRU-LSTM techniques illustrated moderate values with an of 87.00% and 78.89%, highlighting limitations. However, superior values were illustrated by the FFHGR-SLATOA model with an Inline graphic of 97.56%, of 97.79%, of 97.50%, and of 97.74%.

Table 4.

Comparison evaluation of the FFHGR-SLATOA approach with existing methods under the Sign Language MNIST dataset.

Approach
CNN	89.00	88.70	88.00	88.50
RNN	90.00	88.20	87.90	89.00
LSTM	85.00	82.00	88.00	85.00
Conv LSTM	87.00	85.00	88.00	86.00
GRU-LSTM	78.89	85.95	84.40	82.55
FFHGR-SLATOA	97.56	97.79	97.50	97.74

Open in a new tab

Fig. 9 — Comparison evaluation of the FFHGR-SLATOA approach with existing methods under the Sign Language MNIST dataset.

Table 5 and Fig. 10 specify the comparison assessment of the FFHGR-SLATOA technique with existing methods under the American Sign Language (ASL) dataset^26,27. The LSTM-CNN model reached an Inline graphic of 91.00%, of 90.00%, of 89.00%, and of 89.50%, while ML-CNN, DPCNN, VGG16, and LSTM-GRU techniques showed moderate values ranging from 86.00% to 90.00% and lower , , and values. Finally, the FFHGR-SLATOA model illustrated higher of 97.77%, of 97.68%, of 97.80%, and of 97.74%, highlighting its efficiency in capturing both spatial and temporal features.

Table 5.

Comparison analysis of the FFHGR-SLATOA approach with existing models under the ASL dataset.

Approach
LSTM-CNN	91.00	90.00	89.00	89.50
ML-CNN	88.00	87.00	86.00	87.50
DPCNN	90.00	89.00	87.00	88.00
VGG16	86.00	84.00	85.00	83.00
LSTM-GRU	88.11	77.91	80.29	84.45
FFHGR-SLATOA	97.77	97.68	97.80	97.74

Open in a new tab

Fig. 10 — Comparison analysis of the FFHGR-SLATOA approach with existing models under the ASL dataset.

Table 6 indicates the ablation study analysis of the FFHGR-SLATOA methodology. The DBN with ConvNeXt Base without parameter tuning achieved an Inline graphic of 94.69%, of 93.49%, of 93.34%, and F1-Score of 93.26%. Additionally, the DBN + ConvNext Base + TOA technique attained an of 95.32%, of 94.22%, of 94.22%, and of 94.10%. Likewise, by integrating DBN with VGG16 without tuning resulted in an of 96.01%, of 94.82%, of 94.95%, and Inline graphic of 94.81%, additionally increasing to an of 96.73%, of 95.59%, of 95.59%, and of 95.71% with TOA. DBN with EfficientNet-V2 without tuning achieved an of 97.58%, of 96.37%, of 96.37%, and of 96.43%, which improved to an of 98.35%, of 97.14%, of 97.01%, and of 97.25% with TOA tuning. However, the overall FFHGR-SLATOA technique outperformed all the above combinations with an Inline graphic of 99.14%, of 97.85%, of 97.85%, and of 97.85%, thus highlighting efficiency.

Table 6.

Ablation study outcomes of the FFHGR-SLATOA methodology.

Methodology
DBN + ConvNext Base (Without parameter tuning)	94.69	93.49	93.34	93.26
DBN + ConvNext Base + TOA (With parameter tuning)	95.32	94.22	94.22	94.10
DBN + VGG16 (Without parameter tuning)	96.01	94.82	94.95	94.81
DBN + VGG16 + TOA (With parameter tuning)	96.73	95.59	95.59	95.71
DBN + EfficientNet-V2 (Without parameter tuning)	97.58	96.37	96.37	96.43
DBN + EfficientNet-V2 + TOA (With parameter tuning)	98.35	97.14	97.01	97.25
FFHGR-SLATOA (DBN with fusion-based feature extraction process with TOA parameter tuning)	99.14	97.85	97.85	97.85

Open in a new tab

Table 7 exemplifies the computational efficiency analysis of the FFHGR-SLATOA model²⁸. The FFHGR-SLATOA model is highly lightweight and fast, requiring only 21.08 G FLOPS, 589 MB GPU memory, and an inference time of 1.67 s. Compared to other methods like YOLOv3-tiny-T and YOLOv7, the FFHGR-SLATOA method illustrates significantly lower computational cost and faster processing while maintaining competitive performance, making it appropriate for real-time deployment.

Table 7.

Computational efficiency comparison including FLOPS, GPU memory usage, and inference time.

Methods	FLOPS (G)	GPU (M)	Inference Time (sec)
YOLOv3-tiny-T	144.90	4917	5.68
ShuffleNetv2-YOLOv3	72.60	3341	8.71
YOLOv51	93.80	3966	8.41
YOLOv7	93.00	3445	3.18
YOLOv51 + E-ELAN	108.00	3832	4.97
YOLOv51 + ShuffleNetv2	57.00	3466	5.42
FFHGR-SLATOA	21.08	589	1.67

Open in a new tab

Conclusion

This paper presents an FFHGR-SLATOA model to aid hearing- and speech-impaired people. The aim is to develop an innovative DL-based HGR model to enhance communication accessibility for hearing- and speech-impaired individuals. Initially, the image pre-processing stage employs MF to upgrade image quality by extracting the noise. Furthermore, the fusion of ConvNeXt Base, VGG16, and EfficientNet-V2 techniques is utilized for the feature extraction process. Moreover, the FFHGR-SLATOA model utilizes the DBN technique for classification. Finally, the parameter tuning process is performed by using the TOA model to increase the classification performance of the DBN model. The experimental analysis of the FFHGR-SLATOA approach is performed under the GR dataset. The comparison study of the FFHGR-SLATOA approach portrayed a superior accuracy value of 99.14% over existing models. The limitations include insufficient testing in real-world sign language scenarios, which may restrict its practical applicability. Another limitation includes poor usability and accessibility as no small-scale user-centric or deployment-oriented testing, such as involving real signers, has been conducted. The robustness of the FFHGR-SLATOA model under diverse lighting conditions, backgrounds, and camera types remains ambiguous as the analysis was accomplished on a controlled dataset. Furthermore, the performance of the model is not properly explored with dynamic gestures or continuous sign sequences. Future work may concentrate on large-scale, real-world testing, inclusion of diverse user populations, and evaluation across varying environmental conditions. Improvements could also explore real-time deployment and adaptive learning for personalized gesture recognition.

Acknowledgements

The authors extend their appreciation to the King Salman center for Disability Research for funding this work through Research Group no KSRG-2024-343

Author contributions

Najm Alotaibi: Conceptualization, methodology, validation, investigation, writing—original draft preparation, Reham Al-Dayil: Conceptualization, methodology, writing—original draft preparation, writing—review and editing Nojood O Aljehane: methodology, validation, writing—original draft preparation Mohammed Rizwanullah: software, visualization, validation, data curation, writing—review and editing.

Data availability

The data that support the findings of this study are openly available at https://www.kaggle.com/datasets/imsparsh/gesture-recognition, https://www.kaggle.com/datasets/muhammadkhalid/sign-language-for-numbers, https://www.kaggle.com/datasets/ayuraj/asl-dataset, reference number^22,24,26.

Competing interests

The authors declare no competing interests.

Footnotes

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

1.Al-Hammadi, M. et al. Deep learning-based approach for sign language gesture recognition with efficient hand gesture representation. IEEE Access8, 192527–192542 (2020). [Google Scholar]
2.Skaria, S., Al-Hourani, A. & Evans, R. J. Deep-learning methods for hand-gesture recognition using ultra-wideband radar. IEEE Access8, 203580–203590 (2020). [Google Scholar]
3.Mujahid, A. et al. Real-time hand gesture recognition based on deep learning YOLOv3 model. Appl. Sci.11(9), 4164 (2021). [Google Scholar]
4.Sugimoto, M., Zin, T. T., Toriu, T. & Nakajima, S. Robust rule-based method for human activity recognition. IJCSNS Int. J. Comput. Sci. Netw. Secur.11, 37–43 (2011). [Google Scholar]
5.Côté-Allard, U. et al. Deep learning for electromyographic hand gesture signal classification using transfer learning. IEEE Trans. Neural Syst. Rehabil. Eng.27(4), 760–771 (2019). [DOI] [PubMed] [Google Scholar]
6.Almjally, A., Algamdi, S. A., Aljohani, N. & Nour, M. K. Harnessing attention-driven hybrid deep learning with combined feature representation for precise sign language recognition to aid deaf and speech-impaired people. Sci. Rep.15(1), 32255 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
7.Moin, A. et al. A wearable biosensing system with in-sensor adaptive machine learning for hand gesture recognition. Nat. Electron.4(1), 54–63 (2021). [Google Scholar]
8.Daniel, E., Kathiresan, V. & Sindhu, P. Real time sign recognition using YOLOv8 object detection algorithm for Malayalam sign language. Fusion Pract. Appl.1, 135–235 (2025). [Google Scholar]
9.Rastgoo, R., Kiani, K. & Escalera, S. Hand sign language recognition using multi-view hand skeleton. Expert Syst. Appl.150, 113336 (2020). [Google Scholar]
10.Basheri, M. Automated gesture recognition using zebra optimization algorithm with deep learning model for visually challenged people. Fusion Pract. Appl. 16(1) (2024).
11.Alabduallah, B., Al Dayil, R., Alkharashi, A. & Alneil, A. A. Innovative hand pose based sign language recognition using hybrid metaheuristic optimization algorithms with deep learning model for hearing impaired persons. Sci. Rep.15(1), 9320 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
12.Alaimahal, A., Vasuki, S., Harini, T. P., Niranjana, B. & Lavaniya, M. Sign language recognition with image processing using deep learning LSTM Model. In 2025 IEEE International Students’ Conference on Electrical, Electronics and Computer Science (SCEECS) 1–6 (IEEE, 2025).
13.Singhal, P., Verma, S., Gupta, R., Kumar, R. & Arya, R. K. February. Vision-based hand gesture recognition system for assistive communication using neural networks and GSM integration. In 2025 2nd International Conference on Computational Intelligence, Communication Technology and Networking (CICTN) 891–895 (IEEE, 2025).
14.Shegokar, A., Kale, T., Patil, L. & Gupta, P. Sign language detection system using CNN and HOG: Bridging the communication gap for deaf and hearing communities. In 2025 IEEE International Conference on Interdisciplinary Approaches in Technology and Management for Social Innovation (IATMSI) vol. 3 1–6 (IEEE, 2025).
15.Vaidhya, G. K. & Anand, G. P. Dynamic Doubled-handed sign language Recognition for hearing- and speech-impaired people using Vision Transformers (2024).
16.Tan, C. K., Lim, K. M., Lee, C. P., Chang, R. K. Y. & Alqahtani, A. SDViT: Stacking of distilled vision transformers for hand gesture recognition. Appl. Sci.13(22), 12204 (2023). [Google Scholar]
17.Alyami, S. & Luqman, H. Swin-MSTP: Swin transformer with multi-scale temporal perception for continuous sign language recognition. Neurocomputing617, 129015 (2025). [Google Scholar]
18.Herbaz, N., El Idrissi, H. & Badri, A. Advanced sign language recognition using deep learning: A study on Arabic sign language (ArSL) with VGGNet and ResNet50 models (2025).
19.Aksoy, S. Multi-input melanoma classification using MobileNet-V3-large architecture. J. Autom. Mob. Robot. Intell. Syst. 73–84 (2025).
20.Liu, Y., Zhao, Z., Zhang, Z. & Yang, Y. A novel sea surface temperature prediction model using DBN-SVR and spatiotemporal secondary calibration. Remote Sens.17(10), 1681 (2025). [Google Scholar]
21.Zhao, X. et al. Optimization design of lazy-wave dynamic cable configuration based on machine learning. J. Mar. Sci. Eng.13(5), 873 (2025). [Google Scholar]
22.https://www.kaggle.com/datasets/imsparsh/gesture-recognition.
23.Hax, D. R. T., Penava, P., Krodel, S., Razova, L. & Buettner, R. A novel hybrid deep learning architecture for dynamic hand gesture recognition. IEEE Access12, 28761–28774 (2024). [Google Scholar]
24.https://www.kaggle.com/datasets/muhammadkhalid/sign-language-for-numbers.
25.Baihan, A., Alutaibi, A. I., Alshehri, M. & Sharma, S. K. Sign language recognition using modified deep learning network and hybrid optimization: A hybrid optimizer (HO) based optimized CNNSa-LSTM approach. Sci. Rep.14(1), 26111 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
26.https://www.kaggle.com/datasets/ayuraj/asl-dataset.
27.Kothadiya, D. et al. Deepsign: Sign language detection and recognition using deep learning. Electronics11(11), 1780 (2022). [Google Scholar]
28.Chen, R. & Tian, X. Gesture detection and recognition based on object detection in complex background. Appl. Sci.13(7), 4480 (2023). [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

[CR1] 1.Al-Hammadi, M. et al. Deep learning-based approach for sign language gesture recognition with efficient hand gesture representation. IEEE Access8, 192527–192542 (2020). [Google Scholar]

[CR2] 2.Skaria, S., Al-Hourani, A. & Evans, R. J. Deep-learning methods for hand-gesture recognition using ultra-wideband radar. IEEE Access8, 203580–203590 (2020). [Google Scholar]

[CR3] 3.Mujahid, A. et al. Real-time hand gesture recognition based on deep learning YOLOv3 model. Appl. Sci.11(9), 4164 (2021). [Google Scholar]

[CR4] 4.Sugimoto, M., Zin, T. T., Toriu, T. & Nakajima, S. Robust rule-based method for human activity recognition. IJCSNS Int. J. Comput. Sci. Netw. Secur.11, 37–43 (2011). [Google Scholar]

[CR5] 5.Côté-Allard, U. et al. Deep learning for electromyographic hand gesture signal classification using transfer learning. IEEE Trans. Neural Syst. Rehabil. Eng.27(4), 760–771 (2019). [DOI] [PubMed] [Google Scholar]

[CR6] 6.Almjally, A., Algamdi, S. A., Aljohani, N. & Nour, M. K. Harnessing attention-driven hybrid deep learning with combined feature representation for precise sign language recognition to aid deaf and speech-impaired people. Sci. Rep.15(1), 32255 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR7] 7.Moin, A. et al. A wearable biosensing system with in-sensor adaptive machine learning for hand gesture recognition. Nat. Electron.4(1), 54–63 (2021). [Google Scholar]

[CR8] 8.Daniel, E., Kathiresan, V. & Sindhu, P. Real time sign recognition using YOLOv8 object detection algorithm for Malayalam sign language. Fusion Pract. Appl.1, 135–235 (2025). [Google Scholar]

[CR9] 9.Rastgoo, R., Kiani, K. & Escalera, S. Hand sign language recognition using multi-view hand skeleton. Expert Syst. Appl.150, 113336 (2020). [Google Scholar]

[CR10] 10.Basheri, M. Automated gesture recognition using zebra optimization algorithm with deep learning model for visually challenged people. Fusion Pract. Appl. 16(1) (2024).

[CR11] 11.Alabduallah, B., Al Dayil, R., Alkharashi, A. & Alneil, A. A. Innovative hand pose based sign language recognition using hybrid metaheuristic optimization algorithms with deep learning model for hearing impaired persons. Sci. Rep.15(1), 9320 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR12] 12.Alaimahal, A., Vasuki, S., Harini, T. P., Niranjana, B. & Lavaniya, M. Sign language recognition with image processing using deep learning LSTM Model. In 2025 IEEE International Students’ Conference on Electrical, Electronics and Computer Science (SCEECS) 1–6 (IEEE, 2025).

[CR13] 13.Singhal, P., Verma, S., Gupta, R., Kumar, R. & Arya, R. K. February. Vision-based hand gesture recognition system for assistive communication using neural networks and GSM integration. In 2025 2nd International Conference on Computational Intelligence, Communication Technology and Networking (CICTN) 891–895 (IEEE, 2025).

[CR14] 14.Shegokar, A., Kale, T., Patil, L. & Gupta, P. Sign language detection system using CNN and HOG: Bridging the communication gap for deaf and hearing communities. In 2025 IEEE International Conference on Interdisciplinary Approaches in Technology and Management for Social Innovation (IATMSI) vol. 3 1–6 (IEEE, 2025).

[CR15] 15.Vaidhya, G. K. & Anand, G. P. Dynamic Doubled-handed sign language Recognition for hearing- and speech-impaired people using Vision Transformers (2024).

[CR16] 16.Tan, C. K., Lim, K. M., Lee, C. P., Chang, R. K. Y. & Alqahtani, A. SDViT: Stacking of distilled vision transformers for hand gesture recognition. Appl. Sci.13(22), 12204 (2023). [Google Scholar]

[CR17] 17.Alyami, S. & Luqman, H. Swin-MSTP: Swin transformer with multi-scale temporal perception for continuous sign language recognition. Neurocomputing617, 129015 (2025). [Google Scholar]

[CR18] 18.Herbaz, N., El Idrissi, H. & Badri, A. Advanced sign language recognition using deep learning: A study on Arabic sign language (ArSL) with VGGNet and ResNet50 models (2025).

[CR19] 19.Aksoy, S. Multi-input melanoma classification using MobileNet-V3-large architecture. J. Autom. Mob. Robot. Intell. Syst. 73–84 (2025).

[CR20] 20.Liu, Y., Zhao, Z., Zhang, Z. & Yang, Y. A novel sea surface temperature prediction model using DBN-SVR and spatiotemporal secondary calibration. Remote Sens.17(10), 1681 (2025). [Google Scholar]

[CR21] 21.Zhao, X. et al. Optimization design of lazy-wave dynamic cable configuration based on machine learning. J. Mar. Sci. Eng.13(5), 873 (2025). [Google Scholar]

[CR22] 22.https://www.kaggle.com/datasets/imsparsh/gesture-recognition.

[CR23] 23.Hax, D. R. T., Penava, P., Krodel, S., Razova, L. & Buettner, R. A novel hybrid deep learning architecture for dynamic hand gesture recognition. IEEE Access12, 28761–28774 (2024). [Google Scholar]

[CR24] 24.https://www.kaggle.com/datasets/muhammadkhalid/sign-language-for-numbers.

[CR25] 25.Baihan, A., Alutaibi, A. I., Alshehri, M. & Sharma, S. K. Sign language recognition using modified deep learning network and hybrid optimization: A hybrid optimizer (HO) based optimized CNNSa-LSTM approach. Sci. Rep.14(1), 26111 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR26] 26.https://www.kaggle.com/datasets/ayuraj/asl-dataset.

[CR27] 27.Kothadiya, D. et al. Deepsign: Sign language detection and recognition using deep learning. Electronics11(11), 1780 (2022). [Google Scholar]

[CR28] 28.Chen, R. & Tian, X. Gesture detection and recognition based on object detection in complex background. Appl. Sci.13(7), 4480 (2023). [Google Scholar]

PERMALINK

Enhanced feature fusion with hand gesture recognition system for sign language accessibility to aid hearing and speech impaired individuals

Najm Alotaibi

Reham Al-Dayil

Nojood O Aljehane

Mohammed Rizwanullah

Abstract

Introduction

Related works

The proposed methodology

Fig. 1.

MF-based image pre-processing

Fusion of feature extractor

ConvNeXt base model

VGG16 method

EfficientNet-V2

Fig. 2.

Classification using DBN model

TOA-based parameter tuning process

Algorithm 1.

Table 1.

Experimental validation

Fig. 3.

Table 2.

Fig. 4.

Fig. 5.

Fig. 6.

Fig. 7.

Table 3.

Fig. 8.

Table 4.

Fig. 9.

Table 5.

Fig. 10.

Table 6.

Table 7.

Conclusion

Acknowledgements

Author contributions

Data availability

Competing interests

Footnotes

References

Associated Data

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases