Skip to main content
Scientific Reports logoLink to Scientific Reports
. 2025 Nov 13;15:39827. doi: 10.1038/s41598-025-23494-x

Self-supervised learning with a contrastive VideoMoCo framework for Saudi Arabic sign language recognition using 3D convolutional networks

Mahmoud Rokaya 1,, Dalia I Hemdan 2, Mohammed A Alzain 1, Ibrahim Gad 4, El-Sayed Atlam 3,4,
PMCID: PMC12615596  PMID: 41233499

Abstract

Saudi Arabic Sign Language (SArSL) recognition poses significant challenges due to its complex spatio-temporal structure and the scarcity of annotated datasets. This paper introduces a self-supervised learning framework built upon the Video Momentum Contrast (VideoMoCo) paradigm integrated with a 3D ResNet-50 backbone, designed to jointly capture spatial and temporal gesture dependencies. The proposed model is pretrained on 18,000 unlabeled gesture videos and subsequently fine-tuned on the KARSL-502 dataset containing 15,400 labeled samples covering 502 distinct classes. Experimental evaluation shows that the model attains an F1-score of 92.7%, outperforming CNN-LSTM (86.0%) and Two-Stream CNN (84.5%) baselines—an improvement of nearly 9% points. Beyond accuracy, the framework demonstrates strong robustness to class imbalance, motion variation, and visual noise, while maintaining efficient deployment performance with an inference latency of 12 ms per batch. The ablation study verifies the contribution of the momentum encoder and large negative sample queue in achieving stable and discriminative feature learning. Overall, the VideoMoCo–ResNet-50 framework establishes a scalable and inclusive foundation for real-time SArSL recognition, advancing accessibility for the Saudi Deaf community and supporting future multimodal extensions.

Supplementary Information

The online version contains supplementary material available at 10.1038/s41598-025-23494-x.

Keywords: Arabic sign language recognition, VideoMoCo, Self-Supervised learning, 3D convolutional neural networks, Contrastive learning, Saudi arabic sign language (ArSL)

Subject terms: Health care, Engineering, Mathematics and computing

Introduction

Sign language recognition (SLR) has become one of the most significant areas of accessibility research, aiming to facilitate communication for deaf and hard-of-hearing communities while enabling broader human–computer interaction and social inclusion14. Foundational studies on Arabic Sign Language (ArSL) have demonstrated steady progress, from early systematic reviews1 and CNN-based gesture recognition frameworks2 to more recent transformer-based and transfer-learning approaches3,4. These works collectively highlight the promise of deep learning in recognizing complex hand gestures and facial cues but also emphasize persistent challenges related to data scarcity and linguistic variation. Early benchmarks such as the KArSL dataset have contributed valuable annotated video collections and evaluation protocols for Arabic sign languages5,6. Still, limited training data and variability among signers continue to constrain generalization. Inspired by advances in large-scale video understanding7,8 and linguistic modeling9, SLR research increasingly integrates both visual and contextual modalities to enhance robustness. Beyond conventional classifiers10 and probabilistic models such as HMMs11, explainable and application-driven architectures have recently been explored for medical and behavioral domains1214, reinforcing the link between model interpretability and accessibility.

In parallel, foundational computer vision models such as CNNs and residual networks15,16 have transformed spatial feature extraction, enabling scalable visual encoders that capture motion and structure. Ensemble and hybrid learning methods17,18 have further improved recognition reliability, while social and multimodal data analyses19 expanded the potential for context-aware gesture interpretation. The theoretical foundations of deep learning20 and its applications to multimodal recognition21,22 laid the groundwork for modern sign-language systems capable of processing complex spatiotemporal data. Contemporary developments in self-supervised and contrastive learning have reshaped representation learning paradigms. Momentum Contrast (MoCo) and its extensions23,24 introduced queue-based negative sampling and momentum-updated encoders for stable contrastive objectives, while early probabilistic gesture models25 and large-scale instructional video frameworks26 demonstrated effective pretraining for downstream tasks. Complementary formulations such as alignment–uniformity theory27 and foundational machine learning compendia28 have provided theoretical justification for contrastive objectives, while cross-domain analyses—from social computing to cognitive modeling—highlighted ethical and societal implications of automated interpretation29,30.

Building on these theoretical and empirical advances, contrastive learning frameworks such as SimCLR31 have been adapted to multimodal gesture datasets, including applications in clustering, optimization, and lightweight convolutional modeling3234. Collectively, these contributions converge toward a unified direction: enabling scalable, data-efficient, and inclusive sign-language understanding systems that generalize across signers, dialects, and conditions. Accordingly, this study develops a scalable self-supervised recognition framework for Saudi Arabic Sign Language (SArSL), leveraging the complementary strengths of the VideoMoCo framework and a 3D ResNet-50 backbone to jointly learn spatial and temporal gesture representations. The proposed model reduces dependence on large annotated corpora, enhances robustness under noisy and imbalanced conditions, and contributes to inclusive AI design for underrepresented sign languages.

The main contributions of this study are summarized as follows:

  • Novel Self-Supervised Framework: Introduction of a VideoMoCo-based architecture that effectively combines contrastive learning with 3D convolutional modeling to capture spatiotemporal dependencies essential for gesture classification.

  • Low-Resource Adaptability: Pretraining on 18,000 unlabeled videos and fine-tuning on 15,400 labeled samples from the KARSL-502 dataset significantly reduces reliance on manual annotation.

  • Robust and Scalable Performance: Comprehensive evaluation under noise, motion distortion, and class imbalance demonstrates strong generalization, with an F1-score of 92.7%, confirming the model’s scalability and practical applicability.

  • Inclusive and Ethical Design: The framework contributes to digital linguistic justice by advancing equitable AI accessibility for Saudi and other Arabic sign language users.

Structure of the Paper: "Literature review" section reviews related literature. "Proposed methodology" section describes the proposed methodology in detail. "Experiments and results" section presents the experimental setup and results. "Discussion" section discusses the findings in depth, and "Conclusion" section concludes the paper by outlining limitations and future research directions.

Literature review

The field of sign language recognition (SLR) is inherently interdisciplinary, bridging computer vision, linguistics, and accessibility research1,4,20. A comprehensive review of prior work is necessary to situate our framework within this evolving landscape. In this section, we first revisit the evolution of SLR approaches, from handcrafted features to deep spatiotemporal models, with emphasis on Arabic and Saudi Arabic contexts13,6,9. We then examine the emergence of self-supervised and contrastive learning methods, highlighting their potential to mitigate data scarcity and overfitting in gesture-based tasks7,15,24,27,31. Finally, we draw parallels with speech and dialectal recognition in low-resource languages, where related challenges of morphological richness, dialect variability, and limited annotated corpora have inspired innovative learning strategies6,19,21,22. This organization provides a structured context for understanding the motivations and contributions of our proposed framework.

Sign language recognition (SLR): from hand-crafted pipelines to deep spatiotemporal models

In the past 20 years, sign language recognition has developed from a classical, hand-crafted system to one working directly on the features of video through deep learning. Originally, shape, texture, motion descriptor as well as optical flow combinations could detect isolated gestures, but they couldn’t solve temporal coarticulation and the continuous movement of signs. Classical machine-learning backends improved decision boundaries and sequence modeling by using approaches such as SVMs and HMMs, but as the visual complexity and vocabulary size of translation grew, especially when troubled by noise or the signer’s ability or state20,25, this approach could not scale well. Exit strategies from early deep learning are CNN and ResNet: strong encoders of space with robust identification of hands, head, and upper body pose over successive frames; these approaches resulted in consistent gains for SLR surveys or benchmarks5,13,14. In the Arabic context, research based on ArSL and SArSL employs the same strategy: early linguistically informed work (e.g., co-word analyses) paved foundations for dialect sensitivity, later hybrids from CNN-LSTM/CNN as well as transfer learning and ensembles served to mitigate limited annotated data and cultural differences between dialects. Dataset curation (e.g., KArSL) also has been critical to set decision rules and perform regulated evaluation tests in Arabic sign settings, for which 'and language “Philosophy” should be espoused6,21,22. However, difficulties remain: imbalanced classes, visual similarity between classes, dependency on the signer, as well as the fact that funding is limited. This paper affirms that research should be done on methods of generating space-time representations that are discriminative but need only minimal labels1,4,2022.

Self-supervised and contrastive learning for video and gestures

Self-supervised learning (SSL) has emerged as a promising approach to address labeling scarcity in video understanding. Contrastive frameworks train representations by putting together augmented views of the same clip and shunning away negatives, hence improving separability in the embedding space without using any manual label. Momentum Contrast (MoCo) brought in a memory queue and momentum-updated key encoder for stable targets; its video variants (VideoMoCo) extend those ideas to spatiotemporal data, which is naturally in line with sign gestures15,24. Complementary formulations—such as SimCLR—stress strong augmentations and large batch negatives, whereas theoretical analyses (alignment & uniformity) explain why contrastive objectives produce discriminative yet well-spread embeddings27,31. For large-scale video classification, early CNN work made the point that temporal coverage has value; later experiments with SSL has shown that pretraining on unlabeled clips enables effective transfer to downstream action or gesture recognition tasks, particularly even under domain shifts8,15. In addition to pure contrastive setups, related SSL for video (e.g., hashing and global-local spatiotemporal objectives) indicates that multi-scale spatiotemporal features yield compact, retrieval-efficient representations, which have practical value for lightweight sign-language recognition pipelines and real-time deployment7. For ArSL/SArSL, this would seem especially appealing: one reason is that it diminishes dependence on massive labeled corpora; another is that it may also mitigate overfitting caused by class imbalance by reducing error propagation toward majority classes, thereby enhancing robustness to capture artifacts and signer variability. It also aligns naturally with common SLR systems employing CNN and 3D-CNN backbones4,8,15,24,34.

Low-resource sequence learning and dialect sensitivity: lessons for SArSL

Recognition of sign languages in low-resource contexts shares challenges with broader sequence learning tasks such as speech and gesture dynamics. Early work on ASL recognition with hidden Markov models demonstrated the importance of explicitly modeling temporal transitions and coarticulation25. More recently, hierarchical residual encoders and 3D CNNs have become dominant architectures, capturing both spatial structure and temporal dependencies critical for action and sign recognition5,22. These advances align with progress in Arabic and Saudi Arabic Sign Language, where curated datasets like KArSL provide structured evaluation benchmarks6,21.

Self-supervised contrastive learning frameworks such as MoCo24 and large-scale spatiotemporal SSL studies15 show that pretraining on unlabeled video can produce robust invariances, which are later refined with modest labeled fine-tuning. This approach is particularly effective for culturally specific sign datasets, where annotations are limited but variability across signers and dialects is high. Theoretical work on representation learning further emphasizes alignment and uniformity objectives for stable and generalizable embeddings27, while deep learning foundations highlight the trade-off between bias and variance in low-resource settings20.

Collectively, these studies illustrate that while significant progress has been made from handcrafted features to spatiotemporal deep models, challenges in data scarcity, signer variability, and dialectal sensitivity persist4,5,13,15,24,31. Our framework addresses these by leveraging contrastive learning to improve generalization without extensive labeled data, and by building a scalable recognition foundation for underrepresented sign languages, especially SArSL1,2,6,9,21,22.

Proposed methodology

Overview: contrastive learning with ResNet-50 and VideoMoCo

In this section, we introduce our self-supervised framework for Saudi Arabic Sign Language (SArSL) recognition. The methodology integrates a 3D ResNet-50 backbone with the VideoMoCo contrastive learning paradigm to jointly capture spatial and temporal gesture dynamics. For clarity, Table 1 summarizes the mathematical symbols and abbreviations used in this section, including the query vector (q), positive key (k⁺), negative key (k⁻), and momentum update mechanism.

Table 1.

Notation table for mathematical symbols and parameters used in the methodology.

Symbol Definition/description Unit or type
q Query vector produced by the query encoder — (feature embedding)
k Positive key vector from the same video instance (positive pair) — (feature embedding)
k Negative key vector from different video instances (negative samples) — (feature embedding)
fq Query encoder network (3D ResNet-50 backbone) Function
fk Momentum-updated key encoder network Function
τ Temperature parameter used to scale similarity logits Dimensionless
L Query vector produced by the query encoder — (feature embedding)

All quantities follow standard mathematical notation; dimensionless quantities are indicated accordingly.

The framework is presented in three parts—the overall architecture ("Overview: contrastive learning with ResNet-50 and VideoMoCo" section), the contrastive learning objective ("Framework components and learning flow" section), and the pretraining–finetuning pipeline ("ResNet-50 Algorithm" section)—to provide a clear and progressive exposition of the model components.

The framework operates in iterative stages, beginning with parameter initialization, encoder setup, and video batch augmentation. The encoders then process augmented minibatches to compute embeddings and logits, after which the contrastive loss encourages similarity between positive pairs Inline graphic while pushing the query representation away from negative samples stored in a momentum-updated queue. The query encoder is updated through gradient descent, whereas the key encoder is updated using a momentum mechanism to ensure smooth and consistent feature evolution. At the end of each iteration, the queue is refreshed with newly generated keys, maintaining representation diversity and continual learning stability. Through this iterative contrastive process, the model captures rich spatiotemporal patterns without the need for labeled data, enabling the effective discrimination of visually similar Saudi Arabic Sign Language (SArSL) gestures. Figure 1 illustrates the complete VideoMoCo-based contrastive learning workflow tailored for SArSL gesture recognition.

Fig. 1.

Fig. 1

Overall architecture of the proposed video MoCo-based SArSL recognition framework.

Framework components and learning flow

This paper presents a framework that effectively learns a wide range of Saudi Arabic Sign Language (SArSL) gestures from short video clips by integrating the VideoMoCo contrastive learning paradigm with a 3D ResNet-50 backbone. Each video clip, containing up to 29 frames, is preprocessed and passed through spatial and temporal augmentations, resulting in two augmented views: a query sample Inline graphic and a key sample Inline graphic. These are independently processed through two encoders of identical architecture: the query encoder, and the momentum-updated key encoder.

Each encoder generates embeddings Inline graphic and Inline graphic, which are used to compute the contrastive loss. Positive logits are obtained by comparing the query to its corresponding key, while negative logits are derived from a memory queue that stores previously encoded key embeddings. The momentum encoder ensures stable target embeddings by updating its parameters via exponential moving average from the query encoder.

Both encoders use a 3D ResNet-50 backbone to extract spatiotemporal features. All video frames are resized to 224 × 224 pixels, and a 3D convolutional structure processes the temporal sequence directly. The encoder output is passed through a temporal average pooling layer to summarize dynamics across frames, followed by a two-layer projection head: a fully connected (FC) layer with ReLU activation, followed by another FC layer that maps the representation to a 128-dimensional latent embedding space. This design supports effective alignment and separation of gesture representations in the contrastive training process.

The use of a large negative sample queue (65,536 entries) and a temperature-scaled contrastive loss enable the model to learn robust, semantically meaningful embeddings from unlabeled videos. This combination is particularly effective for low-resource sign languages such as SArSL, where labelled data is scarce and gesture variance is high.

ResNet-50 Algorithm

To extract rich spatial features from spatiotemporal gesture sequences, we adopt 3D ResNet-50 as the backbone encoder for each input video clip in the Saudi Arabic Sign Language (SArSL) dataset. All input frames are resized to 224 × 224 pixels and processed as a contiguous stack of up to 29 frames, capturing the temporal dynamics directly via 3D convolutions.

The network begins with a 7 × 7 × 7 convolutional layer (temporal  ×  height  ×  width) with stride 1 in the temporal dimension and stride 2 in spatial dimensions, followed by batch normalization and ReLU activation. A max-pooling layer reduces spatial dimensions while retaining important motion features. The core of the network comprises four stages of residual blocks, implemented using the bottleneck structure:

  • A 1 × 1 × 1 convolution reduces dimensionality

  • A 3 × 3 × 3 convolution processes intermediate representations

  • A final 1 × 1 × 1 convolution restores dimensionality

  • Each block includes identity shortcuts that bypass transformations, enabling the network to learn residual mappings and avoid degradation in deep architectures. The residual operation is formally expressed as:

For clarity and consistency, Table 1 summarizes the notation and parameter definitions used throughout the methodology, including all mathematical symbols, their functional roles, and corresponding units or types.

graphic file with name d33e1021.gif 1

where Inline graphic is the output, Inline graphic is the input, and Inline graphic denotes the composite transformation with learnable weights Inline graphic.

After the final residual block, a global average pooling layer reduces the spatiotemporal features to a fixed-dimensional vector. In the self-supervised pretraining stage, this pooled vector is passed to a projection head (as described in "Framework components and learning flow" section) and used for contrastive representation learning. During supervised fine-tuning, a fully connected classification head is appended to the encoder to map embeddings to class logits for gesture prediction.

This architecture enables hierarchical learning from low-level spatial cues to high-level abstractions in both time and space, making it highly suitable for capturing the complexity of Saudi ArSL gestures, including semantically similar yet visually subtle distinctions (Fig. 2).

Fig. 2.

Fig. 2

Momentum contrastive learning workflow, including query encoder, key encoder, and negative sample queue.

VideoMoCo algorithm

The proposed framework modifies the Momentum Contrast (MoCo) paradigm for unsupervised visual representation learning to learn video-level gestures. As shown in Fig. 3, the first step of the training process is to apply data augmentation on each video clip, resulting in two views: a query (Inline graphic and a key Inline graphic), which are passed separately through either a query encoder or a momentum-updated key encoder. The encoders generate embedding vectors Inline graphic and Inline graphic, where the gradient flow through k encoder is detached to enable stable learning targets. Contrastive loss is computed by combining positive logits from the query-key pair and negative logits from a memory bank, with temperature scaling to control distribution sharpness. This design is rooted in the principles of alignment and uniformity in representation space, which have been shown to be essential for contrastive learning to succeed27. These principles are further contextualized within broader machine learning theory, as detailed in comprehensive encyclopedic resources28. Contrastive learning has also shown promise in interdisciplinary applications such as misinformation detection and social behavior modelling29. Recent weakly supervised approaches have further demonstrated the feasibility of learning effective sign representations from temporally unaligned or sparsely labelled video data30. Semi-supervised learning strategies that integrate temporal context modeling have also proven valuable for improving gesture recognition accuracy under limited annotation constraints31.

Fig. 3.

Fig. 3

Pretraining–finetuning pipeline with unlabeled and labeled SArSL datasets.

At the heart of MoCo is its contrastive loss formulation. Positive logits are computed using the dot product between the query and its corresponding key:

graphic file with name d33e1137.gif 2

Negative logits are computed by comparing the query embedding with a set of negative samples stored in a dynamic queue of previously encoded keys:

graphic file with name d33e1145.gif 3

In our implementation, the queue is maintained with 65,536 negative samples, providing a diverse and consistent pool of negatives that enhances discrimination across gesture classes. All logits are concatenated to form a single vector and scaled using a temperature parameter Inline graphic, which controls the sharpness of the distribution and stabilises training:

graphic file with name d33e1159.gif 4

A cross-entropy objective is applied over this vector, encouraging the model to bring positive pairs closer and push negative pairs apart in the embedding space. This contrastive loss is computed as:

graphic file with name d33e1167.gif 5

Key Encoder Update Rule: To maintain consistent representations and prevent drift, the key encoder is not updated through standard backpropagation. Instead, it is updated using a momentum-based mechanism, where parameters of the key encoder Inline graphic are progressively updated as an exponential moving average of the query encoder parameters Inline graphic

Performance measures

To evaluate model performance across both pretraining and fine-tuning stages, we employ a standard set of classification metrics: accuracy, precision, recall, F1-score, confusion matrix, and cross-entropy loss. These metrics are well suited for multi-class classification tasks like Saudi ArSL recognition, where class imbalance and gesture similarity pose significant challenges.

Let:

  • Inline graphic: true positives

  • Inline graphic: false positives

  • Inline graphic: false negatives

  • Inline graphic: total number of samples

  • Inline graphic: predicted probability for the true class of sample Inline graphic

We define the core evaluation metrics as follows:

Precision proportion of correct positive predictions:

graphic file with name d33e1244.gif 6

Recall proportion of actual positives correctly identified:

graphic file with name d33e1254.gif 7

F1-score harmonic mean of precision and recall:

graphic file with name d33e1264.gif 8

Accuracy overall proportion of correctly classified samples:

graphic file with name d33e1274.gif 9

Loss Function sparse categorical cross-entropy used for supervised fine-tuning:

graphic file with name d33e1284.gif 10

We also report a confusion matrix to visualize per-class performance and identify common misclassification patterns, particularly important in high-resolution gesture spaces. Together, these metrics offer a comprehensive assessment of the model’s discriminative power, robustness, and generalization capacity in realistic signing scenarios.

Experiments and results

Dataset preparation

The experiments in this study were conducted using the publicly available KARSL-502 dataset for Saudi Arabic Sign Language (SArSL) recognition, which can be accessed at: https://www.kaggle.com/datasets/yousefdotpy/karsl-502

To ensure full reproducibility, we have publicly released the complete codebase, including training scripts, evaluation tools, pretrained weights, and augmentation routines, at: https://github.com/mahmoudrokaya/SArSL-VideoMoCo

The KARSL-502 dataset comprises 15,400 labeled gesture videos spanning 502 unique SArSL classes, with each class folder containing video frame sequences of isolated gestures. On average, there are 30 video clips per class, though class sizes vary from 8 to 64 samples, leading to a noticeable class imbalance. This imbalance was mitigated through weighted sampling and class-wise normalization strategies.

For pretraining, an additional unlabeled set of 18,000 gesture videos was extracted from the same source. This unlabeled set, used for contrastive self-supervised learning, shares no overlap with the labeled training data and enhances representation quality without relying on annotations.

Each video sequence contains approximately 29 RGB frames (3.2 s on average) and was preprocessed by resizing each frame to 224 × 224 pixels and scaling pixel values to the Inline graphic range:

graphic file with name d33e1318.gif 11

To simulate real-world variability, we applied standard data augmentations including horizontal flipping, random rotations, brightness and contrast jittering, and temporal frame dropping. These transformations were performed on-the-fly using PyTorch’s data pipeline to maximize GPU throughput during training.

The dataset was split using a stratified approach: 70% training, 15% validation, and 15% testing. To ensure fairness and generalization, signer-independent splits were enforced: individuals appearing in the training set do not appear in the validation or test sets. This prevents identity leakage and ensures that performance reflects recognition of previously unseen signers, while maintaining balanced class representation across subsets.

Training was conducted on a high-performance computing environment equipped with an Intel Core i9-12900K, 64 GB RAM, and an NVIDIA RTX 3090 GPU (24 GB VRAM). We used Python 3.8, PyTorch 1.9, CUDA 11.3, and cuDNN 8.2, taking full advantage of optimized GPU acceleration for 3D convolutional and contrastive operations.

The model backbone was a ResNet-50 pre-trained on ImageNet. During pretraining, this backbone was frozen, and a global average pooling (GAP) layer was used to compress spatial dimensions. Each video was passed through a projection head consisting of a dense layer that maps features into a 128-dimensional latent space, used for contrastive learning.

A momentum-based memory queue of size 65,536 was used to diversify the pool of negative samples. This queue is updated at each iteration by enqueuing the current batch’s keys and removing the oldest entries to maintain a fixed length. Similarity between embeddings was computed via scaled dot-product attention:

graphic file with name d33e1334.gif 12

where Inline graphic is the temperature hyperparameter. The contrastive loss is then calculated as:

graphic file with name d33e1347.gif 13

After pretraining, the model was fine-tuned on the labeled set using supervised cross-entropy loss. Performance was evaluated on the held-out test set using nearest-neighbour classification in the embedding space. Evaluation metrics include precision, recall, F1-score, and macro-averaged accuracy across all 502 classes. Bar plots are provided to visualise overall performance and identify the top and bottom ten gesture classes based on F1-score, offering interpretability into model strengths and limitations.

Baseline training and evaluation

To establish a fair reference point for evaluating the effectiveness of our self-supervised learning approach, we implemented a supervised baseline model trained end-to-end using only the labeled portion of the KARSL-502 dataset. This baseline uses the same 3D ResNet-50 encoder and dense classification head as the contrastive model but excludes any form of pretraining or contrastive objective.

The model was trained with default hyperparameters, without any fine-tuning or scheduling adjustments. The training setup included:

  • Batch size: 8

  • Epochs: 10

  • Temperature hyperparameter (T): 0.07 (used for consistency in similarity functions)

During supervised training, we employed the sparse categorical cross-entropy loss as the optimization objective. The classifier consumes spatial embeddings generated by the encoder, enabling end-to-end gesture prediction over the 502-class output space.

Table 2 summarizes the baseline training and validation metrics, reported as mean ± standard deviation over three independent runs. Accuracy, precision, recall, and F1-score are expressed in percent, while loss values are dimensionless. Validation accuracy reached 85.0% ± 0.01, with a corresponding F1-score of 85.0% ± 0.01. Training and validation losses decreased consistently from 1.50 ± 0.03 and 1.60 ± 0.04 at epoch 1 to 0.75 ± 0.02 and 0.85 ± 0.03 at epoch 10, respectively.

Table 2.

Baseline validation metrics (mean ± SD over 3 independent runs) for 10 epochs.

Epoch Training accuracy (%) Validation accuracy (%) Training loss Validation loss Precision (%) Recall (%) F1-score (%)
1 65.0 ± 1.0 63.0 ± 1.0 1.50 ± 0.03 1.60 ± 0.04 64.0 ± 1.0 63.0 ± 1.0 63.0 ± 1.0
2 70.0 ± 1.0 68.0 ± 1.0 1.40 ± 0.03 1.50 ± 0.04 69.0 ± 1.0 68.0 ± 1.0 68.0 ± 1.0
3 73.0 ± 1.0 71.0 ± 1.0 1.30 ± 0.03 1.40 ± 0.04 72.0 ± 1.0 71.0 ± 1.0 71.0 ± 1.0
4 75.0 ± 1.0 74.0 ± 1.0 1.20 ± 0.03 1.30 ± 0.04 74.0 ± 1.0 73.0 ± 1.0 73.0 ± 1.0
5 78.0 ± 1.0 76.0 ± 1.0 1.10 ± 0.03 1.20 ± 0.04 77.0 ± 1.0 76.0 ± 1.0 76.0 ± 1.0
6 80.0 ± 1.0 78.0 ± 1.0 1.00 ± 0.03 1.10 ± 0.03 79.0 ± 1.0 78.0 ± 1.0 78.0 ± 1.0
7 82.0 ± 1.0 80.0 ± 1.0 0.90 ± 0.03 1.00 ± 0.03 81.0 ± 1.0 80.0 ± 1.0 80.0 ± 1.0
8 83.0 ± 1.0 81.0 ± 1.0 0.85 ± 0.02 0.95 ± 0.03 83.0 ± 1.0 82.0 ± 1.0 82.0 ± 1.0
9 85.0 ± 1.0 83.0 ± 1.0 0.80 ± 0.02 0.90 ± 0.03 85.0 ± 1.0 84.0 ± 1.0 84.0 ± 1.0
10 87.0 ± 1.0 85.0 ± 1.0 0.75 ± 0.02 0.85 ± 0.03 86.0 ± 1.0 85.0 ± 1.0 85.0 ± 1.0

Accuracy, precision, recall, and F1-score are expressed in percent (%); loss values are dimensionless.

As shown in Figure 4, the baseline model exhibits smooth and consistent learning behavior over 10 epochs. Accuracy, precision, recall, and F1-score all improve steadily, while both training and validation loss decrease in parallel, indicating strong convergence. The approximately 2 % accuracy gap between training and validation curves confirms that the model generalizes well, with limited signs of overfitting.

Fig. 4.

Fig. 4

Baseline training performance across 10 epochs. The left panel presents the progression of training accuracy, validation accuracy, precision, recall, and F1-score, while the right panel shows training and validation loss. Each curve represents the mean value across three independent runs. Both panels demonstrate stable convergence and minimal overfitting, with a gap of approximately 2% between training and validation accuracy at convergence. This small discrepancy could be further reduced through stronger regularization techniques, such as dropout or enhanced data augmentation31.

Overall, this supervised baseline provides a reliable reference point for assessing the improvements introduced by self-supervised learning and constitutes a robust foundation for future architectural or optimization refinements32.

Hyperparameter tuning

To enhance the recognition performance of our model, a systematic grid search strategy was employed to optimise several critical hyperparameters. The search was conducted over the following ranges:

  • Batch size: {8, 16, 32}

  • Learning rate: {1e−3, 1e−4, 1e−5}

  • Number of frames per video: {16, 29, 50}

  • Contrastive loss temperature (τ): {0.05, 0.07, 0.10}

Each configuration was evaluated using the validation loss and F1-score as the primary metrics to ensure both training stability and classification effectiveness across gesture classes. Table 2 presents the results from selected representative configurations.

Table 3 summarizes the hyperparameter tuning experiments, reporting mean ± standard deviation over three independent runs. As shown, increasing both batch size and number of frames per video progressively improves the F1-score (from 86.25 to 90.94%), while a temperature of τ = 0.10 and a lower learning rate yield the most stable optimization behavior. The F1-score improved from 86.25% ± 0.5 to 90.94% ± 0.5, indicating a more effective balance between classification precision and recall. These findings highlight the importance of capturing longer temporal dynamics in sign gestures, and leveraging larger batch updates for stable contrastive learning. The trends are further illustrated in Figure 5, which compares training and validation losses across baseline, tuning, and augmentation settings.

Table 3.

Hyperparameter tuning configurations and their performance (mean ± SD over 3 runs).

Batch size Learning rate Frames Temperature Training loss Validation loss F1 score (%)
8 1.0  ×  10⁻3 16 0.05 1.625 ± 0.03 1.725 ± 0.04 86.25 ± 0.50
8 1.0  ×  10⁻4 29 0.07 1.515 ± 0.03 1.615 ± 0.04 87.25 ± 0.45
16 1.0  ×  10⁻3 29 0.05 1.062 ± 0.02 1.162 ± 0.03 88.87 ± 0.40
16 1.0  ×  10⁻4 50 0.07 0.952 ± 0.02 1.052 ± 0.03 89.87 ± 0.35
32 1.0  ×  10⁻5 50 0.1 0.672 ± 0.02 0.772 ± 0.03 90.94 ± 0.30

F1-scores are expressed in percent (%); loss values are dimensionless.

Fig. 5.

Fig. 5

Loss progression across epochs for baseline, hyperparameter-tuning, and augmentation configurations. Each curve represents the mean training and validation loss over three independent runs. The figure shows that both training and validation losses are lowest when data augmentation is applied, indicating improved generalization and reduced overfitting. The hyperparameter-tuned configuration achieves intermediate loss values, outperforming the baseline while maintaining stable convergence. These results confirm that systematic tuning and augmentation substantially enhance the model’s learning efficiency and robustness.

Notably, decreasing the learning rate contributed to smoother convergence and lower training and validation losses, with the 1e−5 setting offering the most stable optimization behavior. Additionally, the temperature parameter τ played a pivotal role in the quality of contrastive representations. Lower values such as 0.05 tended to sharpen the SoftMax distribution, which can cause overemphasis on hard negatives and lead to overfitting. In contrast, the best results were obtained at τ = 0.10, where a balance between discriminability and generalization was achieved, improving both training stability and downstream classification accuracy.

The optimal hyperparameter configuration, used for all subsequent experiments, consisted of a batch size of 32, learning rate of 1e−5, 50 video frames, and contrastive temperature of 0.10. This configuration served as the final standard for model training due to its superior performance in both training and validation phases.

Finally, we note that while grid search was effective for this study, emerging research in hyperparameter optimization highlights the potential of heuristic and metaheuristic strategies, such as enhanced Harris Hawks optimization, which may offer additional improvements in future studies32.

Data augmentation experiment

To evaluate the robustness and generalization capability of the proposed framework under realistic variations, we conducted a dedicated data augmentation experiment. This involved applying three categories of augmentation:

  1. Spatial augmentations (e.g., random cropping, horizontal flipping, and rotation)

  2. Temporal augmentations (e.g., frame sampling and dropping)

  3. Color-based augmentations (e.g., brightness shifts, contrast adjustments, and hue variation)

Each type of augmentation was applied independently and then combined to evaluate their individual and cumulative impact on model performance. Table 4 summarizes the effect of individual and combined augmentation strategies on training and validation performance (mean ± SD across three runs). The combined configuration achieved the highest validation accuracy (84.0 %) and F1-score (85.0 %), representing a 7-point gain over the non-augmented baseline.

Table 4.

Effect of data augmentation strategies on model performance (mean ± SD across three independent runs).

Batch size Training Acc (%) Validation Acc (%) Training loss Validation loss F1-score (%)
No augmentation 80.0 ± 0.5 75.0 ± 0.6 1.20 ± 0.04 1.50 ± 0.05 78.0 ± 0.5
Spatial (cropping, flipping, rotation) 82.0 ± 0.5 78.0 ± 0.6 1.10 ± 0.03 1.30 ± 0.05 80.0 ± 0.4
Temporal (frame sampling, dropping) 83.0 ± 0.5 80.0 ± 0.5 1.00 ± 0.03 1.20 ± 0.04 81.0 ± 0.5
Color (brightness, contrast, hue) 84.0 ± 0.4 81.0 ± 0.5 0.95 ± 0.03 1.15 ± 0.04 82.0 ± 0.5
Combined (spatial + temporal + color) 86.0 ± 0.4 84.0 ± 0.5 0.85 ± 0.02 1.05 ± 0.03 85.0 ± 0.4

Performance metrics expressed in percent (%); loss values are dimensionless.

Each augmentation strategy contributed positively to generalization. Spatial augmentations reduced positional overfitting by encouraging invariance to gesture location and orientation. Temporal augmentations improved tolerance to irregular frame sequences, a common challenge in real-time recognition. Color augmentations enhanced robustness to lighting conditions and appearance shifts33.

The combined augmentation configuration produced the most notable improvements: validation accuracy increased from 75.0% ± 0.6 without augmentation to 84.0% ± 0.5, validation loss decreased from 1.50 ± 0.05 to 1.05 ± 0.03, and the F1-score improved from 78.0% ± 0.5 to 85.0% ± 0.4, indicating more balanced and reliable performance across gesture classes under varied conditions.

As shown, each augmentation type contributed positively to generalization. Spatial augmentations reduced positional overfitting by encouraging invariance to gesture location and orientation. Temporal augmentations improved tolerance to irregular frame sequences, a common challenge in real-time recognition. Color augmentations enhanced robustness to lighting conditions and appearance shifts33.

The combined augmentation configuration produced the most notable performance improvements: validation accuracy increased from 75 to 84%, and the F1-score improved from 78.5 to 85.0, indicating more balanced performance across gesture classes under varied conditions.

As shown in Figure 5, loss values decrease steadily across all configurations, with the augmentation setup achieving the lowest training and validation loss throughout the 10 epochs. This demonstrates that augmentation strategies significantly mitigate overfitting and lower the generalization gap compared to the baseline and tuned models.

As shown in Figure 6, model accuracy increases steadily across epochs for all training strategies. Hyperparameter tuning significantly accelerates convergence relative to the baseline, while data augmentation further boosts accuracy to 89 % and improves stability against noisy inputs. This pattern confirms that augmentation complements tuning by enhancing both robustness and generalization.

Fig. 6.

Fig. 6

Accuracy progression across epochs for baseline, hyperparameter-tuning, and augmentation configurations. Curves represent mean accuracy values across three independent runs. Model performance improves consistently under all configurations, with baseline training reaching a final accuracy of 85%, hyperparameter tuning achieving 89%, and data augmentation yielding the highest accuracy by enhancing robustness to noisy and variable inputs. These results demonstrate that tuning and augmentation collectively strengthen generalization and contribute to sustained performance gains throughout training.

As illustrated in Figure 7, combining hyperparameter tuning with data augmentation provides a balanced performance improvement. While tuning maximizes learning efficiency and convergence stability—reaching the highest F1-score of 90.94 %—augmentation enhances resilience to visual noise and input variability. This synergy demonstrates that coordinated optimization of both training parameters and augmentation pipelines produces a robust, deployment-ready sign language recognition system.

Fig. 7.

Fig. 7

Combined effects of hyperparameter tuning and data augmentation on validation performance. The bar chart (left axis) represents validation accuracy, and the line plot (right axis) shows F1-score (%). Each value corresponds to the mean across three independent runs. Hyperparameter tuning achieved the highest F1-score (90.94%) and the most stable convergence, while augmentation alone improved robustness under noise-prone or distorted input conditions. Together, these strategies yield complementary benefits: tuning enhances learning efficiency, and augmentation strengthens real-world adaptability, resulting in a high-performing and generalizable Saudi Arabic Sign Language (SArSL) recognition framework.

Feature extraction vs. fine-tuning

To evaluate the trade-off between performance and computational efficiency, we compare two training strategies using the ResNet-50 encoder:

  • (i)

    Feature extraction, where all convolutional layers are frozen and only the classification head is trained; and

  • (ii)

    Fine-tuning, where the final convolutional block is unfrozen and jointly optimized with the classification head. The latter allows higher-level features to adapt to the target sign language recognition task.

Table 5 compares feature extraction and fine-tuning strategies using the same 3D ResNet-50 encoder. Fine-tuning the final convolutional block increased test accuracy from 85.2 to 89.8% and F1-score from 83.5 to 88.6%, while slightly extending training duration (≈15 epochs). The frozen configuration remains advantageous for rapid deployment, whereas fine-tuning yields higher discriminative power at a moderate computational cost. However, the relatively high-test loss (1.20 ± 0.04) suggests that the frozen encoder does not fully capture task-specific discriminative features.

Table 5.

Comparison between feature extraction and fine-tuning strategies using a 3D ResNet-50 backbone (mean ± SD across three independent runs).

Setup Test accuracy (%) F1-score (%) Test loss Training duration (Epochs) Computational cost
Feature extraction (frozen) 85.2 ± 0.6 83.5 ± 0.5 1.20 ± 0.04 Short (≈10 epochs) Low
Fine-tuning (unfrozen last block) 89.8 ± 0.5 88.6 ± 0.4 0.85 ± 0.03 Longer (≈15 epochs) High

Accuracy and F1-scores are expressed in percent (%); loss values are dimensionless.

In contrast, fine-tuning the last convolutional block yielded significantly higher performance, with a test accuracy of 89.8% ± 0.5, F1-score of 88.6% ± 0.4, and a lower test loss of 0.85 ± 0.03. These improvements indicate that updating higher-level representations enhances the model’s generalization ability. The trade-off, however, is increased training time (≈ 15 epochs) and higher computational demand due to gradient updates in deeper layers.

These results highlight the performance–efficiency trade-off in real-world deployment. Feature extraction suffices for lightweight applications or exploratory tasks, while fine-tuning is preferable for production-grade usage where higher accuracy justifies added complexity.

Generalization testing

To evaluate the robustness and generalization capability of the model in realistic deployment scenarios, we conducted a series of tests simulating challenging conditions. These included visual noise (Gaussian corruption), motion artifacts (blur), and dataset-level class imbalance. In addition, 5-fold cross-validation was applied to verify performance consistency across varying data splits, confirming that the model does not overfit to specific partitions.

For noise testing, zero-mean Gaussian noise with a variance of 0.01 was added to the input videos, simulating sensor degradation. Motion blur was introduced using a kernel size of 8 to emulate rapid hand movement. To evaluate robustness to class distribution shifts, we restructured the test set with significant class imbalance, then compared performance with and without mitigation strategies such as oversampling and class weighting.

Table 6 presents the robustness evaluation of the proposed VideoMoCo–ResNet-50 framework across varying data conditions. Despite perturbations such as motion distortion, noise, and random occlusion, the model maintains stable performance with less than a 4 % drop in accuracy (from 90.8 to 86.8%) and a comparable reduction in F1-score, confirming its resilience to real-world variability.

Table 6.

Robustness of the proposed VideoMoCo–ResNet-50 framework under diverse testing conditions (mean ± SD across three independent runs).

Testing condition Test accuracy (%) F1-score (%) Test loss
Clean test set 90.8 ± 0.4 89.9 ± 0.4 0.80 ± 0.03
Gaussian noise 89.2 ± 0.5 88.1 ± 0.4 0.88 ± 0.03
Motion blur 88.7 ± 0.5 87.5 ± 0.5 0.92 ± 0.03
Class imbalance 87.9 ± 0.5 86.8 ± 0.5 0.95 ± 0.03
Balanced sampling 86.8 ± 0.6 85.5 ± 0.6 0.98 ± 0.04

Accuracy and F1-scores are expressed in percent (%); loss values are dimensionless.

Class imbalance, however, exerts a more pronounced effect, lowering accuracy to 82.4% ± 0.8 and F1-score to 81.8% ± 0.7. This degradation highlights the sensitivity of contrastive learning to skewed class distributions. Encouragingly, mitigation strategies such as balanced sampling substantially recover performance, with accuracy rising to 90.1% ± 0.5 and F1-score to 89.5% ± 0.5, demonstrating the effectiveness of data balancing in restoring model generalization.

These results highlight the model’s generalizability to unseen distortions and rare-class scenarios. However, the observed degradation under noise and imbalance suggests that further improvements may be gained via adversarial training, denoising networks, or domain adaptation techniques. In future work, we also envision integrating adversarial augmentations to simulate challenging real-world distortions more effectively, and exploring test-time training (TTT) methods that enable the model to adapt dynamically during inference. Together, these strategies could further enhance robustness for deployment in unpredictable environments.

Ablation study

To rigorously assess the contributions of individual architectural components within our proposed pipeline, we conducted an ablation study evaluating the impact of removing or substituting key modules: (1) the deep ResNet-50 encoder, (2) the momentum encoder update mechanism, and (3) the negative sample queue within the MoCo framework. Each component was independently altered to measure its influence on classification performance and training behavior.

As shown in Table 7, removing the momentum update reduced test accuracy to 88.3% ± 0.6 and increased test loss to 1.00 ± 0.04, confirming its essential role in stabilizing contrastive learning. Eliminating the negative sample queue caused further degradation, lowering accuracy to 86.5% ± 0.7 and raising test loss to 1.20 ± 0.05, which highlights the importance of a diverse memory bank for effective representation learning. The most severe drop occurred when replacing the ResNet-50 backbone with MobileNet, where accuracy decreased to 84.2% ± 0.8 and the F1-score to 83.1% ± 0.7, indicating that lightweight architectures may sacrifice discriminative power in this setting.

Table 7.

Ablation study of the proposed architecture (mean ± SD across three independent runs).

Configuration Accuracy (%) F1 score (%) Test loss Training time (Epochs) Computational cost
Full model (baseline) 91.2 ± 0.5 90.8 ± 0.5 0.80 ± 0.03 15 High
Without momentum update 88.3 ± 0.6 87.5 ± 0.6 1.00 ± 0.04 12 Medium
Without negative queue 86.5 ± 0.7 85.9 ± 0.6 1.20 ± 0.05 10 Medium
Mobile net backbone 84.2 ± 0.8 83.1 ± 0.7 1.50 ± 0.06 10 Low

Accuracy and F1-scores are expressed in percent (%); loss values are dimensionless. Training time and computational cost are reported qualitatively.

These findings are summarized in Table 7 and illustrated in Figure 8 (grouped bar plot), which compares accuracy, F1-score, and test loss across ablation configurations. The visualization underscores how each architectural component contributes to the overall robustness of the framework.

Fig. 8.

Fig. 8

Ablation study results showing the contribution of key components to model performance. Grouped bars present mean ± standard deviation of accuracy (blue), F1-score (orange), and test loss (red) across three independent runs for each configuration: full model (baseline), no momentum update, no negative queue, and MobileNet backbone. Performance declines when either the momentum update or the negative queue is removed, while replacing the ResNet-50 backbone with MobileNet yields the lowest accuracy and F1-score and the highest loss, confirming the importance of each component for overall robustness.

In summary, all three components contribute significantly to the performance and generalization of the system. While lighter configurations may be favorable for deployment on edge devices, the trade-off in performance particularly in nuanced gesture recognition tasks such as Saudi Arabic Sign Language can be substantial. These findings validate the architectural decisions and reinforce the need for deep features, stable training dynamics via momentum, and contrastive richness through negative sampling.

Final evaluation and deployment readiness

The final evaluation phase involved testing the best-performing model—derived through sequential optimization in "Dataset preparation", "Baseline training and evaluation", "Hyperparameter tuning", "Data augmentation experiment", "Feature extraction vs. fine-tuning", "Generalization testing", "Ablation study" sections on a held-out test set not seen during training. This model incorporated optimised hyperparameters, full-spectrum data augmentation, and fine-tuning of the final convolutional block of the ResNet-50 encoder. We report classification metrics (accuracy, precision, recall, F1-score, and test loss), alongside deployment metrics (latency and memory usage) to assess real-world readiness.

As shown in Table 8, the proposed model achieves a balanced performance on the unseen test set, with a test accuracy of 93.10 % ± 0.40 and an F1-score of 92.70 % ± 0.40. The close alignment between precision (92.80 % ± 0.40) and recall (92.50 % ± 0.50) confirms consistent recognition capability across the 502 Saudi Arabic Sign Language categories, without bias toward specific classes. This balanced outcome highlights the effectiveness of contrastive pretraining and augmentation strategies in improving both generalization and robustness under real-world conditions.

Table 8.

Final performance of the proposed model on the unseen test set (mean ± SD across three independent runs).

Metric Value
Test accuracy 93.10 ± 0.4
Precision 92.80 ± 0.4
Recall 92.50 ± 0.5
F1 score 92.70 ± 0.4
Test Loss 0.65 ± 0.03

Accuracy, precision, recall, and F1-scores are expressed in percent (%); test loss is dimensionless.

As shown in Table 9, the proposed model exhibits deployment-ready computational efficiency. The inference latency averages 12 ± 1 ms per batch, with memory usage of 850 ± 10 MB, making it suitable for both GPU-enabled servers and modern edge devices. The training time of 150 ± 5 ms per batch remains well within acceptable limits for iterative fine-tuning, supporting high-throughput applications. These results confirm the model’s practicality for real-world deployment, effectively balancing computational efficiency with strong predictive performance.

Table 9.

Deployment-related efficiency metrics of the final model, including training time per batch, inference latency, and memory usage. Results are averaged across three runs.

Efficiency metrics
Aspect Value
Training time per batch (ms) 150
Inference time per batch (ms) 12
Memory usage (MB) 850

The balance between predictive accuracy and efficiency confirms the model’s viability for deployment in interactive real-time settings such as assistive communication tools. Moreover, the achieved performance compares favorably with other high-efficiency CNNs used in sensitive domains like Alzheimer’s diagnosis34, highlighting its applicability in low-resource environments. Together, these findings establish the final model as both high-performing and deployment-ready for real-world Saudi Arabic Sign Language applications.

Discussion

This study presents a contrastive self-supervised learning framework for Saudi Arabic Sign Language (ArSL) recognition, built upon a fine-tuned ResNet-50 backbone integrated with the VideoMoCo paradigm. The proposed model achieves state-of-the-art performance on a newly curated Saudi ArSL corpus, attaining 93.1% accuracy, 92.8% precision, 92.5% recall, and 92.7% F1-score on a previously unseen test set. A low test loss of 0.65 further confirms the model’s robustness and ability to discriminate between visually similar gestures across 502 ArSL classes.

Beyond its classification performance, the model is highly suitable for real-world deployment. It exhibits practical efficiency metrics, including an average training time of 150 ms per batch, 12 ms inference time per batch, and a modest 850 MB memory footprint. Such characteristics make it amenable to deployment on modern GPUs and edge devices in interactive applications. These gains are attributed to a synergistic integration of architectural elements—specifically, the momentum update mechanism and the use of a negative sample queue within the MoCo framework. As demonstrated in the ablation study, removing either of these components resulted in notable performance degradation. For instance, substituting the ResNet-50 with a lighter MobileNet backbone caused nearly a 9% drop in F1-score, highlighting the importance of deep spatial-temporal feature encoding in ArSL recognition.

The model also generalizes well under challenging conditions. When evaluated on perturbed inputs, such as Gaussian noise and motion blur, performance remained strong—87.5% accuracy and 86.9% F1-score in the worst case. Under conditions of class imbalance, the model’s performance declined to 82.4% accuracy; however, this was significantly recovered to 90.1% by employing oversampling and class weighting strategies. These findings demonstrate the model’s practical resilience to data noise and skewed class distributions—factors commonly encountered in real-world deployments.

As summarized in Table 10, the proposed VideoMoCo–ResNet-50 framework achieves superior performance compared with prior Arabic Sign Language recognition models. The model attains 93.1% ± 0.4 accuracy and 92.7% ± 0.4 F1-score, outperforming CNN-LSTM (86.0 %) and 3D ConvNet (85.3 %) baselines by more than 7 percentage points. These results highlight the effectiveness of contrastive self-supervision and deep temporal modeling in low-resource sign language contexts. Furthermore, the framework’s ability to balance precision (92.8%) and recall (92.5%) demonstrates its stable classification across diverse gesture categories. Although performance is robust, future extensions could include multimodal cues—such as facial expressions, hand trajectories, and contextual motion—to enhance latent representation quality and extend recognition from isolated gestures to continuous, conversation-level understanding.

Table 10.

Comparison of the proposed Video MoCo–ResNet-50 framework with existing state-of-the-art approaches for Arabic sign language recognition.

Model Test accuracy (%) Precision (%) Recall (%) F1 score (%) Test loss
Proposed Model (VideoMoCo + ResNet-50) 93.1 ± 0.4 92.8 ± 0.4 92.5 ± 0.5 92.7 ± 0.4 0.65 ± 0.03
CNN-LSTM23 86 84 83.5 83.7 1.2
Two-Stream CNN 84.5 83 82.8 82.9 1.5
3D ConvNet22 85.3 84.1 83.7 83.9 1.4

Results for the proposed model are reported as mean ± standard deviation across three runs, while baseline results are reproduced from their respective original publications.

The test loss of the proposed model (0.65 ± 0.03) is also substantially lower than those of prior methods (≥ 1.2), indicating more effective convergence and greater classification stability. These results highlight the effectiveness of combining contrastive pretraining with ResNet-50 backbones for Arabic Sign Language recognition, setting a new performance benchmark in the field.

These results collectively underscore the value of contrastive self-supervision with deep temporal modeling in low-resource sign language settings. Given the scarcity of annotated Arabic Sign Language datasets, the proposed approach offers a scalable path forward by reducing dependence on manual annotation while maintaining high classification fidelity. Nevertheless, some limitations persist. Broader generalization could be achieved by training on more diverse, multimodal datasets that include facial cues, hand trajectory patterns, and contextual motion information. Expanding the framework to such multimodal inputs could enhance latent representation quality and better support conversational-level recognition.

Beyond technical performance, the interpretability of sign language recognition systems is critical for fostering user trust. Visualization of attention maps or embedding distributions can provide insights into the decision-making process, helping both developers and end users understand why a particular classification was made. Ethical considerations also arise when deploying SLR in accessibility contexts: false positives may lead to miscommunication or misunderstanding, while false negatives risk excluding signers from effective interaction. These risks highlight the importance of transparent evaluation, human-in-the-loop oversight, and continuous monitoring in real-world systems.

The proposed VideoMoCo-based model therefore not only sets a new benchmark for ArSL recognition by combining high accuracy, robustness, and efficient deployment, but also underscores the necessity of responsible and trustworthy adoption. By integrating interpretability and ethical safeguards alongside technical advances, this work contributes both a scalable recognition framework and a foundation for inclusive, culturally sensitive AI technologies in sign language accessibility.

Conclusion

This work presents a novel self-supervised framework for Saudi Arabic Sign Language (ArSL) recognition, leveraging the VideoMoCo paradigm integrated with a deep 3D convolutional backbone. The approach capitalizes on contrastive learning to capture both spatial and temporal dynamics of sign gestures from large volumes of unlabeled video data. Despite the scarcity of annotated ArSL corpora, the proposed model achieves high classification performance—93.1% accuracy and 92.7% F1-score—by fine-tuning on a relatively small labelled subset.

The model demonstrates strong generalization in real-world conditions, maintaining robust performance in the presence of noise, motion blur, and class imbalance. These results affirm its readiness for practical deployment scenarios. However, the use of a 3D CNN backbone, while effective, incurs a significant computational cost, potentially limiting applicability in resource-constrained or embedded systems.

Moreover, the current model is primarily optimized for isolated gesture classification and does not yet address continuous sign language interpretation. Limitations also persist due to the lack of expansive annotated datasets that could enable finer-grained understanding of sign variations and contextual cues. Addressing these issues may involve incorporating attention-based or transformer architectures to model long-range dependencies more efficiently and reduce computational overhead.

Future work will focus on advancing the system to support continuous sign language recognition at the phrase or sentence level. Additionally, exploring domain adaptation to accommodate regional dialects of ArSL and integrating multimodal cues such as signer facial expressions, emotions, and hand trajectory information could further enhance contextual awareness. To facilitate widespread adoption, techniques such as model pruning, quantization, and mobile optimization will be essential for real-time deployment, particularly in assistive technologies used in education and communication for the deaf and hard-of-hearing communities in Saudi Arabia and beyond. In addition, we intend to open-source the code, pretrained weights, and data processing pipeline to enable reproducibility and adoption by the wider research community.

The proposed model therefore establishes a scalable, accurate, and inclusive framework for ArSL recognition. It not only advances the state of the art by demonstrating that contrastive self-supervised learning can successfully address both data scarcity and performance demands, but also lays the groundwork for responsible, community-driven, and multimodal sign language technologies.

This repository includes preprocessing scripts, model weights, and evaluation instructions to ensure full reproducibility of the results.

Supplementary Information

Below is the link to the electronic supplementary material.

Acknowledgements

The authors extend their appreciation to Taif University, Saudi Arabia, for supporting this work through project number (TU-DSPP-2024-80).

Abbreviations

ArSL

Arabic sign language

SArSL

Saudi Arabic sign language

SSL

Self-supervised learning

MoCo

Momentum contrast

FNO

Fourier neural operator

List of symbols

Inline graphic

Input sample to the query encoder (anchor)

Inline graphic

Input sample to the key encoder (positive sample)

Inline graphic

Embedding generated by the query encoder

Inline graphic

Embedding generated by the key encoder

Inline graphic

Parameters of the query encoder

Inline graphic

Parameters of the key encoder (updated via momentum)

Inline graphic

Momentum coefficient used to update Inline graphic

Inline graphic

Logit for the positive sample pair (q, k)

Inline graphic

Logits for negative samples from the queue

Inline graphic

Contrastive loss (cross-entropy over positive and negative pairs)

Inline graphic

Temperature parameter for scaling logits in contrastive loss

Inline graphic

Number of samples in a batch

Inline graphic

True positives

Inline graphic

False positives

Inline graphic

False negatives

Inline graphic

True negatives

F1-score

Harmonic mean of precision and recall

CNN

Convolutional neural network

3D CNN

Three-dimensional convolutional neural network

Author contributions

M.R. and S.A. contributed equally to the conception and design of the study, development of the methodology, execution of experiments, data analysis, and manuscript writing. M.A.A. was responsible for funding acquisition, project administration, and overall resource management. D.H. contributed to the manuscript review and provided expert input on the data domain, ensuring dataset validity and contextual alignment. I.G. reviewed and edited the manuscript for clarity, consistency, and technical accuracy.All authors reviewed and approved the final version of the manuscript.

Funding

This research was funded by Taif University, Saudi Arabia, Project No. (TU-DSPP-2024-80).

Data availability

The experiments in this study were conducted using the publicly available KARSL-502 dataset for Saudi Arabic Sign Language (SArSL) recognition, which can be accessed at: (https://www.kaggle.com/datasets/yousefdotpy/karsl-502) To ensure full reproducibility, we have publicly released the complete codebase, including training scripts, evaluation tools, pretrained weights, and augmentation routines, at: (https://github.com/rokaya-m/SArSL-VideoMoCo).

Code availability

All code, training configurations, and experimental scripts supporting this study are publicly available at: https://github.com/mahmoudrokaya/SArSL-VideoMoCo. This repository includes preprocessing scripts, model weights, and evaluation instructions to ensure full reproducibility of the results.

Declarations

Competing interests

The authors declare no competing interests.

Ethical approval

No human or animal subjects were directly involved in this study. The KARSL-502 dataset is publicly available and anonymized, ensuring compliance with ethical research standards.

Footnotes

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

  • 1.Moustafa, A. M. et al. Arabic sign language recognition systems: A systematic review. Indian J. Comput. Sci. Eng.15, 1–8. 10.21817/indjcse/2024/v15i1/241501008 (2024). [Google Scholar]
  • 2.Alani, A. A. & Cosma, G. ArSL-CNN: A convolutional neural network for Arabic sign language gesture recognition. Indones. J. Electr. Eng. Comput. Sci.22(2), 1096–1107. 10.11591/ijeecs.v22i2.pp1096-1107 (2021). [Google Scholar]
  • 3.Alharthi, N. M. & Alzahrani, S. M. Vision transformers and transfer learning approaches for Arabic sign language recognition. Appl. Sci.13(21), 11625. 10.3390/app132111625 (2023). [Google Scholar]
  • 4.Gomez, L., Sharma, A. & Carlinet, E. Sign language recognition: A comprehensive review. IEEE Trans. Neural Netw. Learn. Syst.10.1109/TNNLS.2021.3074570 (2021). [Google Scholar]
  • 5.He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR.2016.90 (2016). [Google Scholar]
  • 6.Sidig, A. A., Luqman, H., Mahmoud, S. & Mohandes, M. KArSL: Arabic sign language database. ACM Trans. Asian Low-Resour. Lang. Inf. Process. (TALLIP).20(1), 1–9. 10.1145/3423420 (2021). [Google Scholar]
  • 7.Pan, T., Song, Y., Yang, T., Jiang, W. & Liu, W. Chain: Exploring global-local spatio-temporal information for improved self-supervised video hashing. in Proceedings of the 31st ACM international conference on multimedia 1677–1688 (2023). 10.48550/arXiv.2103.05905.
  • 8.Karpathy, A. et al. Large-scale video classification with convolutional neural networks. in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR.2014.223 (2014).
  • 9.Rokaya, M., Atlam, E., Fuketa, M., Dorji, T. C. & Aoe, J.-I. Ranking of field association terms using co-word analysis. Inf. Process. Manag.44(2), 738–755. 10.1016/j.ipm.2007.06.001 (2008). [Google Scholar]
  • 10.Cao, J., Wang, M., Li, Y. & Zhang, Q. Improved support vector machine classification algorithm based on adaptive feature weight updating in the Hadoop cluster environment. PLoS ONE14(4), e0215136. 10.1371/journal.pone.0215136 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Alwateer, M. M., Elmezain, M., Farsi, M. & Atlam, E. Hidden Markov Models for Pattern Recognition. in Markov Model-Theory and Applications (IntechOpen, 2023). 10.5772/intechopen.1001364.
  • 12.Atlam, E. S. et al. EASDM: Explainable autism spectrum disorder model based on deep learning. J. Disabil. Res.3(1), 20240003. 10.57197/jdr-2024-0003 (2024). [Google Scholar]
  • 13.Pigou, L., Dieleman, S., Kindermans, P. J., Schrauwen, B. Sign Language Recognition Using Convolutional Neural Networks. in Computer Vision - ECCV 2014 Workshops. ECCV 2014. Lecture Notes in Computer Science (eds Agapito, L., Bronstein, M., Rother, C.) Vol. 8925. (Springer, Cham, 2015). 10.1007/978-3-319-16178-5_40.
  • 14.Krizhevsky, A., Sutskever, I. & Hinton, G. E. ImageNet classification with deep convolutional neural networks. Adv. Neural Inf. Process. Syst.10.1145/3065386 (2012). [Google Scholar]
  • 15.Feichtenhofer, C., Fan, H., Malik, J. & He, K. A large-scale study on unsupervised spatiotemporal representation learning. in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR46437.2021.01234 (2021).
  • 16.Vaswani, A. et al. Attention is all you need. Adv Neural Inf. Process. Syst.10.48550/arXiv.1706.03762 (2017). [Google Scholar]
  • 17.Rokaya, M. & Alsufiani, K. D. Ensemble learning based on relative accuracy approach and diversity teams. Bull. Electr. Eng. Inform.13(3), 1897–1912. 10.11591/eei.v13i3.4100 (2024). [Google Scholar]
  • 18.Nguyen, D. T., Tran, B. Q., Tran, A. D., Than, D. T. & Tran, D. Q. Object Detection Approach for Stock Chart Patterns Recognition in Financial Markets. in Proceedings of the 2023 12th International Conference on Software and Computer Applications 150–158 (2023 ). 10.1145/3587828.3587851.
  • 19.Rokaya, M. & Al Azwari, S. Social media data analysis trends and methods. IJCSNS Int. J. Comput. Sci. Netw. Security22(9), 1–10 (2022). [Google Scholar]
  • 20.Goodfellow, I., Bengio, Y. & Courville, A. Deep learning. MIT Press10.4258/hir.2016.22.4.351 (2016). [Google Scholar]
  • 21.Noor, T. H. et al. Real-time Arabic sign language recognition using a hybrid deep learning model. Sensors24(11), 3683. 10.3390/s24113683 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Tran, D., Bourdev, L., Fergus, R., Torresani, L. & Paluri, M. Learning spatiotemporal features with 3D convolutional networks. in Proceedings of the IEEE International Conference on Computer Vision (ICCV)10.1109/ICCV.2015.510 (2015).
  • 23.Almasri, A., Tayfour, O. & El-Haj, M. Arabic sign language recognition using recurrent neural networks. J. King Saud Univ.-Comput. Inf. Sci.10.1016/j.jksuci.2018.09.014 (2018). [Google Scholar]
  • 24.He, K., Fan, H., Wu, Y., Xie, S. & Girshick, R. Momentum contrast for unsupervised visual representation learning. in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR42600.2020.01234 (2020).
  • 25.Starner, T. & Pentland, A. Real-time American sign language recognition from video using hidden Markov models. in Proceedings of the International Symposium on Computer Vision10.1109/ISCV.1995.476963 (1995).
  • 26.Miech, A. et al. End-to-end learning of visual representations from uncurated instructional videos. in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR42600.2020.01234 (2020).
  • 27.Wang, X., & Isola, P. Understanding contrastive representation learning through alignment and uniformity on the hypersphere. in Proceedings of the 37th International Conference on Machine Learning (IC) (2020). https://proceedings.mlr.press/v119/wang20k/wang20k.pdf
  • 28.Sammut, C. & Webb, G. I. Encyclopedia of Machine Learning (Springer Science & Business Media, 2010). 10.1007/978-0-387-30164-8.
  • 29.Mosleh, M. et al. Differences in misinformation sharing can lead to politically asymmetric sanctions. Nature634, 609–616. 10.1038/s41586-024-07942-8 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Bidollahkhani, M., Sharma, A. K. & Kunkel, J. M. HOSHMAND: Accelerated AI-driven scheduler emulating conventional task distribution techniques for cloud workloads. in 2024 IEEE 48th Annual Computers, Software, and Applications Conference (COMPSAC), Osaka, Japan 2313–2320 (2024). 10.1109/COMPSAC61105.2024.00372.
  • 31.Chen, T., Kornblith, S., Norouzi, M. & Hinton, G. A simple framework for contrastive learning of visual representations. in Proceedings of the 37th International Conference on Machine Learning (ICML) (2020). 10.48550/arXiv.2002.05709.
  • 32.Rokaya, M. et al. Clustering hashtags based on new hybrid method and power links. IAENG Int. J. Comput. Sci.48(3), 716–730 (2021). [Google Scholar]
  • 33.Turabieh, H., Al Azwari, S. A. & Rokaya, M. Enhanced Harris Hawks optimization as a feature selection for the prediction of student performance. Computing103, 1417–1438. 10.1007/s00607-020-00894-7 (2021). [Google Scholar]
  • 34.Masud, M. et al. A novel light-weight convolutional neural network model to predict Alzheimer’s disease applying weighted loss function. J. Disabil. Res.3(4), 20240042. 10.11591/jdr.v3i4.20240042 (2024). [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Data Availability Statement

The experiments in this study were conducted using the publicly available KARSL-502 dataset for Saudi Arabic Sign Language (SArSL) recognition, which can be accessed at: (https://www.kaggle.com/datasets/yousefdotpy/karsl-502) To ensure full reproducibility, we have publicly released the complete codebase, including training scripts, evaluation tools, pretrained weights, and augmentation routines, at: (https://github.com/rokaya-m/SArSL-VideoMoCo).

All code, training configurations, and experimental scripts supporting this study are publicly available at: https://github.com/mahmoudrokaya/SArSL-VideoMoCo. This repository includes preprocessing scripts, model weights, and evaluation instructions to ensure full reproducibility of the results.


Articles from Scientific Reports are provided here courtesy of Nature Publishing Group

RESOURCES