Volleyball training video classification description using the BiLSTM fusion attention mechanism

Zhao Ruiye

doi:10.1016/j.heliyon.2024.e34735

. 2024 Jul 16;10(15):e34735. doi: 10.1016/j.heliyon.2024.e34735

Volleyball training video classification description using the BiLSTM fusion attention mechanism^☆

Zhao Ruiye ¹

PMCID: PMC11320147 PMID: 39144994

Abstract

This study aims to explore methods for classifying and describing volleyball training videos using deep learning techniques. By developing an innovative model that integrates Bi-directional Long Short-Term Memory (BiLSTM) and attention mechanisms, referred to BiLSTM-Multimodal Attention Fusion Temporal Classification (BiLSTM-MAFTC), the study enhances the accuracy and efficiency of volleyball video content analysis. Initially, the model encodes features from various modalities into feature vectors, capturing different types of information such as positional and modal data. The BiLSTM network is then used to model multi-modal temporal information, while spatial and channel attention mechanisms are incorporated to form a dual-attention module. This module establishes correlations between different modality features, extracting valuable information from each modality and uncovering complementary information across modalities. Extensive experiments validate the method's effectiveness and state-of-the-art performance. Compared to conventional recurrent neural network algorithms, the model achieves recognition accuracies exceeding 95 % under Top-1 and Top-5 metrics for action recognition, with a recognition speed of 0.04 s per video. The study demonstrates that the model can effectively process and analyze multimodal temporal information, including athlete movements, positional relationships on the court, and ball trajectories. Consequently, precise classification and description of volleyball training videos are achieved. This advancement significantly enhances the efficiency of coaches and athletes in volleyball training and provides valuable insights for broader sports video analysis research.

Keywords: Recurrent neural networks, Volleyball exercise videos, BiLSTM, Attention mechanisms, Multi-modal

1. Introduction

Volleyball is a highly demanding team sport, requiring both technical and tactical proficiency. Efficient training and guidance are crucial for success. Athletes and coaches need a thorough understanding and mastery of volleyball techniques, tactics, and technical details. In modern training, videos are extensively used to enhance skills, analyze game performance, and develop tactical strategies, becoming essential tools for teaching and analysis [[1], [2], [3]]. Traditional methods of manual description, which rely on coach demonstrations and verbal instructions, can be labor-intensive, inconsistent, and subjective. Additionally, athletes’ visual perception and reaction abilities vary due to individual differences [4]. Thus, there is a need for an innovative approach to make volleyball training more flexible and efficient, offering personalized guidance to athletes and ultimately enhancing their skill levels.

The primary task of video classification and description is to extract key information from videos using effective feature extraction methods, analyze this information, and ultimately classify the video content while generating the most relevant text descriptions [5,6]. Volleyball training videos contain rich temporal information and multimodal features such as motion actions, postures, and ball trajectories. Capturing and utilizing this information for accurate classification and description is a challenging problem. Deep learning technology, a branch of artificial intelligence (AI), can be used to extract key information from volleyball video data, including athlete movements, ball trajectories, and positional changes on the court [7]. Before the widespread adoption of deep learning-based methods for volleyball training video classification and description, prior research primarily utilized recurrent neural network (RNN) and their variants, such as Long Short-Term Memory (LSTM) and Gated Recurrent Units, for related video content tasks. For instance, Zhang et al. (2020) proposed a long-term RNN that combines convolutional neural networks (CNNs) with LSTM for video classification and description [8]. This method effectively captures temporal dependencies between video frames, enhancing the accuracy of video content classification. However, the model encounters issues such as vanishing gradients and high computational costs when dealing with long-duration sequences. Wang et al. (2021) focused on action recognition using LSTM, validated on a large-scale video dataset [9]. Their approach demonstrated the superiority of LSTM in capturing temporal dynamics in videos, but also highlighted performance bottlenecks when processing long-duration video sequences. Almahadin et al. (2024), in their study on video prediction and generation, utilized LSTM networks for sequential prediction of video frames to explore the spatiotemporal characteristics of video content [10]. Although this approach performed well in short-term video prediction, its effectiveness diminished when handling long-term dependency information. RNNs are particularly useful for establishing time-series models to capture the dynamic changes and decision-making processes of athletes [11,12]. Modeling and classifying the temporal information in volleyball videos enable the automatic identification and classification of different actions or scenes, providing a foundation for generating accurate and useful descriptions of volleyball training videos. This approach not only enhances the analysis and understanding of training videos but also offers valuable guidance for improving volleyball training methodologies. These studies establish the theoretical foundation but also highlight limitations, including the high computational costs associated with processing complex temporal information and performance degradation when dealing with long-term dependencies in extended video sequences. To address these challenges, this study introduces a novel deep learning model — Bi-directional Long Short-Term Memory (BiLSTM-MAFTC). This model integrates Bidirectional Long Short-Term Memory (BiLSTM) with an attention mechanism to alleviate the drawbacks of current approaches and achieve improved performance.

The motivation behind this study is to explore the application of RNN technology in deep learning to enhance the automation and accuracy of volleyball training video descriptions. The innovation of this study lies in incorporating the Bi-directional Long Short-Term Memory (BiLSTM) algorithm for modeling multi-modal temporal information and using spatial and channel attention mechanisms to construct a dual-attention module. This approach results in a volleyball motion video classification model based on BiLSTM and attention mechanisms, aimed at improving the generation of volleyball training video descriptions. Consequently, this study is expected to provide athletes with a more efficient and personalized training experience, thereby enhancing their skill levels. Additionally, it serves as an experimental reference for interdisciplinary research in sports science and AI.

The structure of this study is organized as follows: Section 1 introduces the background of volleyball training exercises and the motivation for developing video description methods. Section 2 reviews related work in the field. Section 3 describes in detail the constructed volleyball motion video classification model, the methodology used, and the experimental evaluations conducted. Section 4 analyzes the experimental results. Section 5 summarizes and discusses the research findings and their significance, as well as the limitations and future prospects of this study.

2. Literature review

This section reviews the current state of research on video description methods, covering both traditional manual techniques and their associated challenges, as well as recent advancements in AI for automated video description generation. Additionally, it summarizes the current applications of RNNs in video classification and recognition, highlighting their effectiveness in handling sequential data and capturing temporal information. These reviews provide readers with a theoretical foundation and an overview of relevant work, thereby setting the stage for the detailed introduction and methodology presented in subsequent sections.

2.1. Current state of research on video description methods

In the past, video descriptions primarily relied on manually written text descriptions or voiceovers. While these methods provided some information, they were plagued by issues of subjectivity, inconsistency, and high workload. Recently, researchers have explored the use of AI to automate video description generation, resulting in numerous studies on the topic. Akbari et al. (2022) [13] introduced a court video data recognition model based on AI technology, detailing the creation and analysis methods of this database, which contributes to research and practice in digital forensics. Zhong et al. (2022) [14] proposed a deep semantic and attention network for unsupervised video summarization, aiding in the automated generation of video summaries and the extraction of key information. Ur Rehman et al. (2023) [15] examined the application of deep learning in video classification, emphasizing its effectiveness in handling video data. Zaoad et al. (2023) [16] introduced a hybrid deep learning approach based on attention mechanisms for Bengali video caption generation, demonstrating potential in automatic caption generation. Wang et al. (2023) [17] discussed the use of blockchain technology for risk prediction and credibility detection in public opinion in online videos, showing that this technology can effectively identify and categorize public opinion, enhancing the credibility and accuracy of video descriptions. These studies illustrate the advancements in AI and deep learning for automating video description, highlighting the potential for improved efficiency and accuracy in handling video data.

2.2. Current applications of RNNs in video classification and recognition

RNNs are a powerful neural network architecture well-suited for processing sequential data. Their ability to capture time-series information in videos makes them highly promising for video description tasks. Several researchers have explored their applications in this context. Li et al. (2021) [18] proposed a deep learning approach for medical image recognition and analysis. Their results showed that deep learning techniques could fuse multiple medical images into a single image, enhancing the quality and informativeness of the images. Ghosh (2022) [19] utilized Faster R–CNN and RNNs for gait recognition, including scenarios involving carried objects. This method efficiently identifies gait features and has potential applications in security. Li & Cao (2023) [20] introduced a human action recognition information processing system based on Long Short-Term Memory (LSTM) with RNNs. This system is used for identifying human actions and has broad applications in areas such as sports analysis and health monitoring. Wang (2023) [21] presented an intelligent music performance assistance system based on LSTM with RNNs. Operating in edge computing environments, this system contributes to improving the quality of music performances. Sörensen et al. (2023) [22] used sequential deep neural networks to uncover the mechanisms of dynamic object recognition in humans. Li (2023) utilized Long Short-Term Memory (LSTM) networks for posture recognition. Experimental results show that the proposed method significantly enhances the recognition accuracy of five types of ball sports postures [23]. The findings indicate that models with attention mechanisms effectively tackle the problem of visual information loss. Li et al. (2023) investigated volleyball movement pose recognition using keypoint sequences, proposing a data preprocessing method that enhanced features like angles and relative distances. Their approach, coupled with an LSTM-Attention model, significantly improved action recognition accuracy. For instance, transforming keypoint coordinates through coordinate system conversion alone boosted recognition accuracy by at least 0.01 for five volleyball movement poses. Moreover, their study underscored the scientifically designed LSTM-Attention model's competitive performance in action recognition [24]. In a related study, Gao et al. (2021) tackled issues of sentence errors and visual information loss in volleyball video descriptions caused by insufficient linguistic learning information. They introduced a multi-head model combining LSTM networks with attention mechanisms to enhance the intelligent generation of volleyball video descriptions. The integration of attention mechanisms allowed the model to prioritize crucial areas of the video during sentence generation. Comparative experiments across different models confirmed the effectiveness of integrating attention mechanisms in mitigating visual information loss [25]. Despite the advancements made by these studies, there remains ample opportunity for further exploration and improvement in areas such as multimodal information fusion, real-time processing capabilities, and model generalization. Building upon these foundations, this research explores the synergistic application of BiLSTM and attention mechanisms to enhance the accuracy and efficiency of volleyball training video content analysis. This study aims to contribute precise and efficient technical tools for volleyball training and competitions, while also introducing novel perspectives and methodologies for advancing research in sports video analysis.

2.3. Summary

Through the analysis of research conducted by the aforementioned scholars, it is evident that video description recognition is becoming increasingly intelligent and finding applications across various domains. Studies such as those by Akbari et al. (2022), Ur Rehman et al. (2023), and Zaoad et al. (2023) demonstrate its relevance in areas like courtroom evidence, video summarization, classification, and automatic caption generation. Moreover, the utilization of RNNs and LSTM in video recognition and analysis extends across diverse fields, including medical image processing, gait recognition, human action recognition, music performance assistance, and dynamic object recognition. This broad spectrum of applications underscores the versatility and significance of these technologies in advancing video-related tasks. While existing studies offer theoretical foundations and technical insights for related fields, there remains a research gap in the domain of volleyball training video classification description. This study addresses this gap by introducing an innovative model that integrates BiLSTM and attention mechanisms to enhance the accuracy and efficiency of volleyball training video content description. The model incorporates the encoding of different modal features, modeling of temporal information, and dual attention mechanisms to effectively handle multimodal temporal information, thus improving the quality of classification descriptions. Unlike previous research, the novelty of this study lies in the design of a model that combines BiLSTM and attention mechanisms to comprehensively process multimodal temporal information, exploiting the complementarity of information between different modalities and offering new insights into volleyball training video analysis. By bridging this research gap, this study provides coaches and athletes with more efficient tools for technical analysis, holding significant practical and research significance.

3. Methodology for volleyball motion video description based on RNNs

The methodology section presents a detailed description of the construction process for the multimodal volleyball motion video temporal information modeling and classification model proposed in this study, utilizing the BiLSTM fusion attention mechanism. Initially, the section outlines the collection and preprocessing steps for volleyball motion video data, covering tasks such as data acquisition, frame sampling, resizing, and data augmentation. Subsequently, it delves into the optimization challenges of RNNs, with a specific focus on the structure and functionality of BiLSTM and its ability to capture temporal features in videos. Furthermore, the section analyzes the application of attention mechanisms in extracting temporal features from videos, including the combined use of channel attention and spatial attention mechanisms. Finally, a comprehensive model framework is proposed in this section, aiming to effectively integrate multimodal features and temporal information to enhance the classification accuracy of volleyball motion videos.

3.1. Data collection and preprocessing for volleyball motion videos

Volleyball motion videos encompass visual media content that documents volleyball matches or training sessions. Such matches typically take place indoors or on beach courts. These videos showcase the layout of the playing area, featuring details like court dimensions, markings, and fencing. Moreover, they capture various volleyball actions, such as serving, passing, spiking, blocking, and digging. These actions constitute fundamental elements of volleyball matches, crucial for scoring points and dictating game control [[26], [27], [28]]. In the realm of volleyball, these motion videos find extensive use in training sessions, technical analysis, match reviews, and educational purposes.

In this study, volleyball motion video data is sourced from the SportsMOT dataset (https://deeperaction.github.io/datasets/sportsmot.html) [29]. This dataset tracks all players on soccer, volleyball, and basketball courts, addressing the lack of benchmarks for multi-object tracking in sports scenes. It comprises approximately 150,000 frames and 1.6 million annotated bounding boxes, featuring fast-moving and dynamically changing targets with similar but distinguishable appearances. The dataset is not only large in scale but also of high quality, densely annotating the position bounding boxes and unique IDs of all on-field players. It maintains the original IDs when a player re-enters the frame after exiting [30]. The dataset includes 240 videos, each with a resolution of 720P and a frame rate of 25FPS. For this study, relevant volleyball motion videos from the dataset were selected, and preprocessing was applied to the obtained video data, as illustrated in Fig. 1.

In Fig. 1, video data preprocessing stands as a crucial step conducted before deep learning or computer vision tasks are performed on volleyball motion videos. Initially, experiments collect video data and convert it into a sequence of frame images. This process involves decoding the video file frame by frame and saving each frame in a common image format, such as JPEG or PNG. Following this, experiments proceed to sample and resize the frames. Frame sampling can be achieved through regular interval sampling or based on specific conditions (such as frame rate) to ensure that the selected frames adequately represent the video content. Resizing frames utilizes image processing techniques like bilinear interpolation or nearest-neighbor interpolation to ensure uniform dimensions across all frames, facilitating subsequent processing. Depending on task requirements, experiments may need to select frames from specific time periods for analysis, enabling models to learn actions or scenes within that timeframe. The selection of these periods can vary based on video content characteristics, such as game phases or critical actions. Moreover, generating additional training samples by applying techniques like mirror flipping, rotation, cropping, scaling, and color transformation to the original frames enhances data augmentation, thus improving the model's generalization capability. In supervised learning, experiments provide labels or annotations for the frames, further enhancing the model's generalization ability. Subsequently, experiments extract features from each frame, after which the dataset is segmented, and data normalization and standardization are performed. This involves scaling pixel values to the range of 0–1 or applying zero-mean unit variance normalization. The data is then organized and managed for model training and evaluation purposes. These steps ensure data quality, consistency, and suitability, effectively supporting subsequent deep learning tasks.

3.2. Optimization of RNNs

RNN is a type of neural network designed for processing sequential data. It achieves this by incorporating loops between adjacent time steps, enabling it to retain memory of previous information. This structure allows RNN to handle sequences of variable lengths and use previous outputs as inputs for subsequent steps [31]. RNN finds wide application in fields like natural language processing, speech recognition, and other domains requiring time series data processing. RNNs maintain previous information through self-looped neurons and are capable of capturing temporal dependencies in data. However, they are susceptible to the problems of vanishing or exploding gradients, especially when dealing with long sequences [32]. BiLSTM represents a variant of RNNs. It simultaneously considers both forward and backward information of time series data through two opposing-direction LSTM layers [33]. This structure proves particularly adept at addressing sequence prediction problems, given its capability to capture long-term dependencies. In this study, BiLSTM is employed to model the temporal information present in volleyball training videos. BiLSTM incorporates two independent LSTM layers—one processes the sequence from the beginning, while the other processes it from the end—after which their outputs are combined. This enables BiLSTM to simultaneously consider both the past and future contextual information of the video, thus better capturing long-range dependencies in video sequences [[34], [35], [36]]. The process of applying the BiLSTM algorithm to extract temporal features from videos is illustrated in Fig. 2.

In Fig. 2, BiLSTM comprehensively captures temporal features and contextual information in video sequences by simultaneously considering both forward and backward dependencies. During execution, BiLSTM initially processes the video sequence through a forward LSTM layer, followed by a backward LSTM layer. Subsequently, it combines the hidden states of these forward and backward LSTM layers to generate the ultimate temporal feature representation. This approach facilitates the utilization of both past and future information within the video sequence, thereby enhancing the representational capacity of the features [37,38]. In this context, $x_{t - 1}$ , $x_{t}$ , and $x_{t + 1}$ respectively denote the feature input values at time $t - 1$ , $t$ , and $t + 1$ , while $o_{t - 1}$ , $o_{t}$ , and $o_{t + 1}$ represent the output values at time $t - 1$ , $t$ , and $t + 1$ .

When applying the BiLSTM algorithm to extract temporal features from video, the features $x_{t}$ and $h_{t - 1}$ are input into the BiLSTM model. The volleyball video data is processed through a sigmoid function to obtain the coefficients $f_{t}$ and $i_{t}$ , and the input is passed through an activation function to generate temporary unit variables ${\tilde{c}}_{t}$ , as shown in Equations (1), (2), (3):

Equation 1.

(1)

Equation 2.

(2)

Equation 3.

(3)

In Equations (1), (2), (3), w represents the weight matrix, and $b$ represents the bias vector. Therefore, the calculation of the hidden state $h_{t}$ at time $t$ is given by Equations (4), (5):

Equation 4.

(4)

Equation 5.

(5)

Here, w and b represent the relevant weights for the gate units and memory cells. $c_{t}$ and $h_{t}$ denote the state of memory cells and the value of the LSTM's hidden state at time $t$ . The computation of ${\vec{h}}_{t}$ and ${\overset{⃖}{h}}_{t}$ is given by Equation (5), where → and ← respectively represent forward and backward text.

Finally, the Softmax function is employed to classify the volleyball video data information, as shown in Equation (6):

Equation 6.

(6)

In Equation (6), $W^{'}$ refers to the weight matrix of the fully connected layer, $f l a t t e n (M *)$ indicates the flattening of the $M *$ matrix into a vector, $b^{'}$ represents the bias terms for the fully connected layer, $i$ denotes the index of the classification category, $C$ signifies the total number of categories, and $S o f t \max_{i}$ represents the resulting calculation vector.

3.3. Analysis of applying attention mechanism to Video Temporal Feature Extraction

In videos, numerous factors, including objects, lighting, and locations, may undergo continuous changes. Moreover, convolution operations demonstrate local connectivity, potentially causing disparities between the visual features extracted from a specific video region and manually annotated features, thereby impacting the accuracy of textual descriptions [39]. Hence, to establish correlations between different modal features, this study introduces a soft attention structure incorporating both channel attention and spatial attention mechanisms. The attention mechanism serves as a resource allocation strategy enabling models to dynamically focus on the most crucial parts of the information under processing [40]. In the realm of video description, this mechanism empowers the model to concentrate on key frames within the video, thereby enhancing the accuracy of the description. In the BiLSTM-Multimodal Attention Fusion Temporal Classification (BiLSTM-MAFTC) model outlined here, the attention mechanism is leveraged to augment the model's capability to recognize pivotal actions within the video. The channel attention mechanism is employed to establish dependencies between features, while the spatial attention mechanism fosters connectivity among different positions in video frames, thereby aggregating global contextual semantic information [41,42]. Post-training, the model identifies the optimal allocation parameters to optimize the encoding features of the video. The channel attention mechanism is depicted in Fig. 3.

In Fig. 3, matrix Z represents the input feature maps generated by the intermediate layers of the CNN, with dimensions H × W × C. Matrix S is the attention scores matrix computed from matrix Z, indicating the importance or weight of each channel in attention calculation, with dimensions H × W × C. Assuming the input visual features are denoted as $X \in R^{C \times H \times W}$ , these feature maps are compressed and dimensionality reduced using global average pooling operations. Simultaneously, within the spatial domain, the feature maps with an input size of $H \times W$ are squeezed along the channel domain, allowing layers closer to the input to obtain a larger receptive field. The compressed channel features can be represented as a set of vector weights $z \in R_{c}$ . The C-th vector weight z is expressed as in Equation (7):

Equation 7.

(7)

In Equation (7), the features need to pass through a gate unit consisting of an activation function S to acquire the global spatial receptive field formed by the compressed channels, as shown in Equation (8):

Equation 8.

(8)

In Equation (8), $σ$ represents the ReLU activation function. $W_{1} \in R^{\frac{C}{r} \times C}$ and $W_{2} \in R^{\frac{C}{r} \times C}$ refer to the weighted multiplication of the input features with the weight distribution of the channel features, further multiplied by a self-learned scaling parameter $α$ to obtain the channel attention mechanism, as demonstrated in Equation (9):

Equation 9.

(9)

In Equation (9), $\tilde{X} = [{\tilde{x}}_{1}, {\tilde{x}}_{2}, \dots, {\tilde{x}}_{c}]$ and $F_{s c a l e} (s_{c}, x_{c})$ denote the element-wise multiplication between the input features $s_{c} \in R^{C \times H \times W}$ and the gate unit $x_{c}$ channel-wise. $α$ represents the self-learned scaling parameter, which starts from 0 and continuously learns and optimizes to adjust the output proportions.

After undergoing compression and activation modules, the input visual features $X \in R^{C \times H \times W}$ yield output visual features $\tilde{X} \in R^{C \times H \times W}$ with channel attention. This process enhances the feature's discriminative ability by utilizing the weight distribution and interdependence among channels [43].

Videos contain a wealth of content information, leading to numerous target features that may not necessarily be the focus of video descriptions. Spatial attention mechanisms empower the model to concentrate on the most crucial regions, aggregating contextual information into local features and enhancing feature representation [44]. The structure of the spatial attention mechanism is illustrated in Fig. 4.

As depicted in Fig. 4, assuming the input visual features are $X \in R^{C \times H \times W}$ , these features are transformed into three new feature vectors, $J$ , $K$ , and $L$ , after passing through a convolutional layer. Among them, ${J, K} \in R^{C \times H \times W}$ represents the feature vector J concatenated with K. Feature vectors J and K are then transformed into $J \in R^{C \times N}$ through matrix transposition, where $N = H \times W$ denotes the pixel values of the image. Additionally, the two transformed feature vectors undergo matrix multiplication. The resulting matrix is passed through a softmax function to obtain the spatial attention mechanism weight distribution $S \in R^{N \times N}$ , as shown in Equation (10):

Equation 10.

(10)

The weight distribution S is multiplied with the feature vector $L$ , and the result of this multiplication is reconstructed as $R \in C \times N$ . Finally, it is further multiplied by a self-learned scaling factor $β$ and summed element-wise with the input feature $X$ to obtain the weighted feature $\hat{X} \in R^{C \times H \times W}$ with spatial attention, as shown in Equation (11):

Equation 11.

(11)

In Equation (11), $i$ and $j$ refer to the i-th and j-th regions in the input feature maps, while $s_{j i}$ represents the correlation between region i and region $j$ . Spatial attention, as measured by $s_{j i}$ , quantifies the correlation between features corresponding to two regions in the video. A higher numerical value indicates a stronger degree of correlation.

3.4. Analysis and construction of a multimodal volleyball motion video temporal information modeling and classification model based on BiLSTM fusion attention mechanism

Multimodal learning involves simultaneously processing and integrating data from various sources like vision and audition [45]. In sports video analysis, this approach aids models in thoroughly understanding athletes’ movements and scenes. The model proposed in this study enhances the accuracy of classifying and describing volleyball training videos by integrating both visual and audio information. Video classification description entails automatically analyzing video content using computer vision and natural language processing techniques to generate descriptive text [46]. The aim is to automatically classify and describe different actions and scenes in volleyball training videos to assist coaches and athletes in technical analysis. This study introduces a video classification method for multi-modal temporal information modeling and fusion based on BiLSTM with an integrated attention mechanism, as illustrated in Fig. 5. The input of BiLSTM-MAFTC model includes multimodal features of volleyball sports videos, such as athlete movements, spatial relationships on the court, and ball trajectories. The processing pipeline of the model consists of several key steps: first, preprocessing of input volleyball sports videos, including feature extraction from videos and corresponding audio features. Next, the preprocessed video and audio features undergo feature encoding, where position encoding ensures that the model perceives the relative positions of features in the sequence, and modality encoding is used to differentiate features from different modalities. BiLSTM layers are utilized to model the temporal information of encoded video and audio features separately, with BiLSTM capturing past and future context information in video sequences through independent forward and backward LSTM layers to better understand video content. The model also introduces spatial attention and channel attention mechanisms, constructing a dual attention module, where the spatial attention mechanism enables the model to focus on the most important regions in the video, and the channel attention mechanism helps the model recognize dependencies between different feature channels. Then, through the dual attention module, video and audio features are effectively fused with temporal information to extract valuable information inherent in each modality and discover complementary information brought by different modalities. Finally, the fused features are classified through a fully connected layer, and the softmax activation function is used to output the final classification results. The design of the entire model aims to enhance the automation and accuracy of describing volleyball training video content through deep learning techniques, thereby providing efficient technical analysis tools for volleyball training and matches. This way, the model can accurately process and analyze multimodal temporal information and generate text descriptions most relevant to the video content, which is crucial for improving the efficiency of coaches and athletes.

Fig. 5 — Framework of the multi-modal volleyball video temporal information modeling and classification model with BiLSTM fusion attention mechanism.

In this model, to effectively capture multi-modal temporal information, the approach in this study leverages the SlowFast network model to extract video features and audio features separately. When given a video segment $V$ , this method concatenates the Slow and Fast channel features of the corresponding audio A using average pooling, resulting in the feature vector $A \in R^{D}$ . Finally, these feature vectors are mapped to a space of dimension $d_{a}$ using a fully connected neural network with a softmax activation function, yielding audio features represented as $X_{a} \in R^{d_{a}}$ , where $d_{a}$ denotes the predefined feature dimension. Consequently, video features $X_{v}$ for video V and corresponding audio features $X_{a}$ are obtained.

The approach to modeling temporal information with BiLSTM and encoding positional and modal information for multimodal features can be outlined as follows: Firstly, BiLSTM is utilized to capture temporal information within the video sequence. This allows for the effective capture of temporal correlation and contextual information between video frames. Specifically, BiLSTM traverses the entire video sequence, progressively processing each frame and passing the hidden state of each time step to the next, thereby modeling the temporal features of the video sequence. Secondly, handling multimodal features involves encoding positional and modal information. Positional encoding typically employs methods such as Sin-Cosine positional encoding or absolute positional encoding to ensure that the model can perceive the relative positions of features in the sequence. Modal encoding, on the other hand, aims to differentiate features from different modalities. This can be achieved through simple unique encoding or more complex modal relationship learning methods. These encoded pieces of information are then concatenated or weighted fused with the original features to enhance the model's ability to perceive positional and modal information. Finally, the multimodal temporal information modeling layer adopts a BiLSTM model to simultaneously capture the correlation between time and modality. During this process, the multimodal features at each time step undergo forward and backward propagation through the BiLSTM layer, comprehensively capturing the temporal information within the video sequence and the correlation between modalities.

In the multi-modal temporal modeling layer, the input video features $X_{v}$ and audio features $X_{a}$ are initially projected to a lower dimension $D$ to filter out unnecessary noise. To effectively utilize the temporal information of actions in the input sequence, this study introduces position encoding p and employs the BiLSTM algorithm to learn absolute position encoding for context. This position encoding is shared between video and audio modalities, allowing the modeling of respective temporal information while maintaining continuity of information between past and future. Additionally, to distinguish between video and audio features, modality encoding $m_{v} \in R^{D}$ and $m_{a} \in R^{D}$ as learnable vectors are introduced, as shown specifically in Equations (12), (13):

Equation 12.

(12)

Equation 13.

(13)

In Equations (12), (13), $g_{v} (\cdot)$ and $g_{a} (\cdot)$ represent the fully connected layers that project video features and audio features from dimensions $d_{v}$ and $d_{a}$ to a lower dimension $D$ , respectively. $p \in R^{(w + 2) \times D}$ stands for position encoding, where $w$ represents the number of consecutive actions in this video segment. $d_{v}$ and $d_{a}$ respectively denote the input dimensions of the video and audio modality sequences.

In constructing the Dual Attention Module, inspiration was drawn from the concept of the Convolutional Block Attention Module to form a structure by concatenating channel and spatial attention [47]. Firstly, this concatenated structure processes visual features to capture interdependencies between channels, thereby enhancing feature distinctiveness. Secondly, this module can discern the contribution of different modalities to feature data, thereby generating higher-quality feature representations. This aids in obtaining encoded features for the entire video.

In this study's model, the specific algorithm for the BiLSTM fusion attention mechanism is shown in pseudo-code in Algorithm 1. Algorithm 1 necessitates the implementation of the BiLSTM_forward and BiLSTM_backward functions to execute the forward and backward propagation logic based on the BiLSTM's working principle. The Fusion function integrates video and audio features, while the Attention function computes channel attention and spatial attention. The Classifier function makes the final classification decision. Additionally, the Loss function and optimization steps must be implemented according to the chosen loss function and optimization algorithm.

Algorithm 1.

Volleyball Sports Video Classification Based on BiLSTM Fusion Attention Mechanism

Input: Video dataset D, including multimodal video features V and audio features A; Model parameters: learning rate α, number of iterations T, and other hyperparameters

Output: Classification result

C

1: Initialize model parameters, including weights of BiLSTM network and attention mechanism

2: For each sample (

V_{i}, A_{i})

in the dataset, perform the following steps:

3: Execute data preprocessing, including frame sampling, size adjustment, data augmentation, etc., to obtain preprocessed video features

V_{i}^{'}

and audio features

A_{i}^{'}

4: Apply position encoding and modality encoding to

V_{i}^{'}

and

A_{i}^{'}

to obtain encoded features

V_{i}^{″}

and

A_{i}^{″}

5: Use forward and backward BiLSTM networks to process the encoded video and audio features separately:

V_{B i L S T M} = B i L S T M_{f o r w a r d (V_{i}^{″})}

A_{B i L S T M} = B i L S T M_b a c k w a r d (A_{i}^{″})

7: Fuse the outputs of BiLSTM networks to obtain fusion features F:

F = F u s i o n (V_{B i L S T M}, A_{B i L S T M})

9: Apply channel attention and spatial attention mechanisms to optimize feature representation:

10:

F_{a t t e n d} = A t t e n t i o n (F)

11: Input the optimized features

F_{a t t e n d}

into the classifier to obtain classification results:

12:

C_{i} = C l a s s i f i e r (F_{a t t e n d})

13: Calculate the loss function and update model parameters using the gradient descent algorithm:

14:

l o s s = L o s s (C_{i}, t r u e l a b e l)

15: Optimize model weights

16: End loop

17: Return the final classification result

C

Open in a new tab

4. Results and discussion

The Results and Discussion section presents the findings of the experimental evaluation and compares the performance of various algorithms. Utilizing metrics like Top-1 and Top-5, this section offers a comprehensive analysis of the recognition accuracy of the BiLSTM-MAFTC model in volleyball motion video classification tasks, contrasting it with algorithms proposed by other researchers. Moreover, it investigates the efficiency of different algorithms in terms of video classification recognition speed, assessing the model's effectiveness. These results highlight the efficacy and superiority of the BiLSTM-MAFTC model in describing volleyball motion videos, offering more precise assistance for volleyball training and competitions.

4.1. Experimental evaluation

To assess the performance of the proposed multi-modal volleyball sports video temporal information modeling and classification model based on the BiLSTM fusion attention mechanism, the SportsMOT dataset serves as the primary data source in this study. The SportsMOT dataset is a comprehensive collection tailored for multi-object tracking in sports scenarios. Encompassing approximately 150,000 image frames and 1.6 million annotated bounding boxes, it spans various sports disciplines including volleyball, soccer, and basketball. During the preprocessing phase, relevant video segments pertaining to volleyball are selected from the dataset and subjected to necessary preprocessing operations to ensure data quality and consistency.

Preprocessing involves several key steps such as converting videos into sequences of frame images, sampling, resizing frames, and potentially applying data augmentation techniques like mirror flipping, rotation, and scaling. These measures are undertaken to bolster the model's ability to generalize across different scenes and actions while ensuring uniform frame sizes for subsequent deep learning tasks. The specific model network parameters are shown in Table 1.

Table 1.

Model network parameters.

Parameter Name	Value
LSTM Hidden Units	128
Channel Attention Layers	2
Spatial Attention Layers	2
Feature Reduction Dimension	512
Position Encoding Method	Sin-Cosine
Modality Encoding Method	Learnable
Train/Test Ratio	7:3
Optimizer	SGD
Batch Size	32
Learning Rate	0.01
Learning Rate Decay	0.1 (at 50th and 75th epochs)
Regularization	Weight decay 0.0005, Dropout 0.1
Training Epochs	100
Loss Function	Cross-Entropy

Open in a new tab

In Table 1, the experimental setup employs the following configurations for training and evaluating the model:

1)
Hardware Environment: The experiments are conducted on a computer featuring an Intel(R) Xeon(R) CPU ES-2680 and NVIDIA GeForce RTX 2080 Ti GPU. This hardware setup ensures ample computational power to handle large-scale datasets effectively.
2)
Software Environment: The operating system utilized is Ubuntu 16.04.1, with PyTorch 1.8.0 serving as the chosen deep learning framework. Python is employed as the programming language, PyCharm serves as the integrated development environment, and CUDA version 11.3 is utilized for GPU acceleration.
3)
Network Parameters: Carefully designed network parameters are tailored to accommodate the characteristics of volleyball motion videos. Specific parameter settings include the optimizer (SGD), batch size (32), learning rate (0.01), and training epochs (100). Additionally, the learning rate is decayed by a factor of 0.1 at the 50th and 75th epochs to facilitate convergence.
4)
Regularization: To mitigate the risk of overfitting, weight decay (0.0005) and dropout (0.1) are employed as regularization techniques during the training process.

These techniques help enhance the model's generalization capability and prevent it from memorizing the training data excessively.

To assess the efficacy of the model, Top-1 and Top-5 metrics are employed for classification evaluation. Top-1 accuracy gauges the proportion of model predictions in a classification task that precisely match the true labels. This metric holds paramount importance in overall classification accuracy assessment, directly reflecting the model's correctness in individual sample predictions. For the BiLSTM-MAFTC model, which focuses on multimodal volleyball motion video temporal information modeling and classification using the BiLSTM fusion attention mechanism, Top-1 accuracy aids in evaluating the model's precision in identifying sports actions depicted in the videos. Conversely, Top-5 accuracy offers more leniency, determining whether the true label falls within the model's top five predictions. This metric allows for a degree of tolerance, especially in volleyball motion video classification tasks that may involve subtle or ambiguous actions. Top-5 accuracy provides a comprehensive assessment of the model's performance while accommodating certain errors, thus bolstering evaluation robustness and objectivity. Together, Top-1 and Top-5 accuracy offer a comprehensive evaluation of the model's performance in volleyball motion video classification tasks, effectively validating the reliability and practicality of the multimodal temporal information modeling and classification model based on the BiLSTM fusion attention mechanism. In the comparative analysis, the performance of the BiLSTM-MAFTC model is benchmarked against RNNs [48], BiLSTM [49], and models proposed by researchers in the relevant field, including Zaoad et al. (2023) and Li & Cao (2023). All accuracy metrics are obtained through 10-fold cross-validation under optimal parameters. Additionally, each algorithm's performance in terms of video classification recognition speed is compared.

During model training and fine-tuning, specific technical methods are employed to ensure the effectiveness of the proposed approach. In comparative experiments with other methods, identical data preprocessing and training steps are strictly adhered to, ensuring fairness and result comparability. Each comparative method undergoes fine-tuning to optimize its performance under identical conditions. Thus, the fine-tuning process does not introduce bias but rather validated the superiority of the proposed method through fair comparison.

4.2. Analysis of classification accuracy results of different algorithms

An analysis of the convergence of various algorithms on the training set is shown in Fig. 6.

In Fig. 6, an analysis of the loss values for different algorithms across multiple iterations demonstrates their convergence on the training set. As the number of iterations increases, the loss values for all algorithms generally decrease and gradually stabilize. This trend indicates ongoing optimization of the model with increasing iterations, leading to progressively stable performance. Of note, the model proposed in this study consistently achieves the lowest loss values in most iterations, hovering around 0.11. This indicates significant effectiveness in minimizing the disparity between predicted and actual values, highlighting strong convergence performance. Consequently, the model algorithm developed in this study exhibits robust stability and reliability.

The classification accuracy of each algorithm is analyzed using Top-1 and Top-5 metrics, as shown in Fig. 7.

In Fig. 7, the recognition results of algorithm models developed in this study, as well as those of RNNs, BiLSTM, and models proposed by researchers Zaoad et al. (2023) and Li & Cao (2023), are analyzed in terms of both Top-1 and Top-5 accuracy. The BiLSTM-MAFTC model outperforms other models significantly, achieving Top-1 and Top-5 recognition accuracies surpassing 95.03 %. This is at least 3.61 % higher than the accuracy attained by algorithms utilized by other researchers (such as RNNs, BiLSTM, etc.). The hierarchy of recognition accuracy for the algorithms, from highest to lowest, is as follows: the model algorithm developed in this study > the model algorithm proposed by Li & Cao (2023) > the model algorithm proposed by Zaoad et al. (2023) > BiLSTM > RNNs. Consequently, the multi-modal volleyball motion video temporal information modeling and classification model based on the BiLSTM fusion attention mechanism constructed in this study exhibits superior recognition accuracy compared to algorithms by other researchers. This enables the model to effectively capture motion information in volleyball motion videos, thereby offering more precise support for the intelligent development of volleyball.

4.3. Analysis of recognition efficiency results of different algorithms

A comparison analysis of the time required for different algorithms to change with the number of iterations is shown in Fig. 8.

In Fig. 8, the recognition time for various algorithms across multiple iterations is analyzed. As the number of iterations increases, the required time initially decreases and then stabilizes, indicating a convergence trend. Compared to other models, the BiLSTM-MAFTC model stabilizes at around 0.04 s per iteration. This demonstrates a significant efficiency advantage over models proposed by other researchers. The recognition times, from least to greatest, are as follows: the BiLSTM-MAFTC model < the model by Li & Cao (2023) < the model by Zaoad et al. (2023) < BiLSTM < RNNs. Notably, the RNN algorithm has the slowest recognition speed for volleyball motion videos, taking approximately 2.02 s per iteration. Therefore, the multi-modal volleyball motion video temporal information modeling and classification model based on the BiLSTM fusion attention mechanism achieves better performance in video description within a shorter time frame.

4.4. Analysis of ablation experiments

The BiLSTM-MAFTC model's performance is further analyzed through ablation experiments. These experiments compare the classification accuracy of the BiLSTM-MAFTC model with two variations: a BiLSTM model without the attention mechanism and an LSTM model without the attention mechanism. The results of these comparisons are illustrated in Fig. 9.

Fig. 9 illustrates a comparison of the classification accuracy of the BiLSTM-MAFTC model against a BiLSTM model without an attention mechanism and an LSTM model without an attention mechanism, as analyzed through ablation experiments. The experimental data reveal that, starting from the first epoch, the BiLSTM-MAFTC model achieves higher Top-1 accuracy than the traditional BiLSTM and LSTM models. This performance gap widens progressively as training continues. By the 100th epoch, the BiLSTM-MAFTC model attains an accuracy of 95.822 %, significantly surpassing the BiLSTM (88.982 %) and LSTM (80.781 %) models. These findings validate the effectiveness of integrating an attention mechanism and multimodal feature fusion in volleyball video classification. The BiLSTM-MAFTC model's superior performance in capturing key actions and scenes is evident. Even though the baseline models show performance improvement in later training stages, the BiLSTM-MAFTC model consistently maintains a stable lead. These results underscore the critical role of deep learning techniques and multimodal information fusion strategies in enhancing the accuracy of volleyball training video classification.

To evaluate the performance of the Transformer model in volleyball video classification tasks, a direct comparison is made with the proposed BiLSTM with attention mechanism model. Using the same volleyball video dataset, the Transformer model is adjusted to handle video data characteristics. A 3D CNN is utilized to extract spatiotemporal features from video frames, which are then fed into the Transformer model. For audio data, a pre-trained acoustic model is employed to extract audio features, which are also input into the Transformer model. Additionally, a multimodal fusion strategy is designed to effectively combine video and audio information. Under identical training and testing protocols, the performance of both the BiLSTM model and the Transformer model is evaluated using metrics such as classification accuracy, recall, F1 score, training time, and inference time. The performance comparison is presented in Table 2:

Table 2.

Performance comparison between BiLSTM and transformer models.

Metric/Model	BiLSTM with Attention Mechanism	Transformer Model
Accuracy (%)	95.03	94.22
Recall (%)	94.89	93.67
F1 Score	94.96	93.94
Training Time (s/epoch)	120	150
Inference Time (s/video)	0.04	0.06
GPU Utilization (%)	70	75

Open in a new tab

As shown in Table 2, the BiLSTM with attention mechanism model slightly outperforms the Transformer model in classification accuracy, recall, and F1 score. This may be due to BiLSTM's strength in capturing long-term dependencies in video sequences. However, the Transformer model has slightly longer training and inference times, likely due to the extensive self-attention computations. Both models show high GPU utilization, with the BiLSTM model being slightly more computationally efficient. The performance comparison across different datasets is shown in Table 3:

Table 3.

Performance comparison across different datasets.

Dataset/Metric	BiLSTM Accuracy	Transformer Accuracy	Improvement
SportsMOT	95.03 %	94.22 %	0.81 %
Volleyball-X	93.67 %	92.15 %	1.52 %
DAVIS Challenge	91.45 %	90.23 %	1.22 %
Multi-Sport	92.89 %	91.56 %	1.33 %

Open in a new tab

Table 3 indicates that on the SportsMOT dataset, the BiLSTM model achieves slightly higher accuracy than the Transformer model, with an improvement of 0.81 %. On the Volleyball-X dataset, the BiLSTM model outperforms the Transformer model by 1.52 % in accuracy. For the DAVIS Challenge dataset, which focuses on sports video segmentation tasks, the BiLSTM model demonstrates a 1.22 % higher accuracy. On the Multi-Sport dataset, the BiLSTM model's accuracy is 1.33 % higher than that of the Transformer model, indicating an advantage in analyzing various types of sports videos. Although the Transformer model slightly underperforms the BiLSTM model in specific tasks, it still shows good generalization capabilities across different datasets. With further structural adjustments and parameter optimizations, the performance of the Transformer model could be improved. Overall, considering both performance and computational efficiency, the BiLSTM with attention mechanism model currently performs better for this task. However, the Transformer model holds potential, especially in multimodal information fusion.

4.5. Discussion

The use of a BiLSTM model fused with an attention mechanism for modeling temporal information in multimodal volleyball motion videos has yielded remarkable results. Achieving Top-1 and Top-5 accuracies exceeding 95.03 % marks a significant improvement over traditional RNNs and standalone BiLSTM algorithms. This highlights the effectiveness of incorporating attention mechanisms and leveraging bidirectional information flow in capturing complex temporal patterns within volleyball motion videos. These advancements have notable implications for various applications, including sports analytics, motion recognition, and human-computer interaction. The data not only validate the model's effectiveness in action recognition but also reflect the powerful potential of deep learning technology in sports video analysis. In volleyball training videos, accurately identifying and describing athletes' movements can lead to precise classification descriptions, which are crucial for enhancing training efficiency and competitive performance. The BiLSTM-MAFTC model can capture subtle movement variations and complex tactical patterns, providing coaches and athletes with more accurate training feedback and analysis tools.

Furthermore, the model's high recognition speed enables real-time training and match analysis. In volleyball training, coaches and athletes often need to quickly identify key moments in game segments. The BiLSTM-MAFTC model provides accurate classification results in a very short time, significantly improving training and analysis efficiency. This rapid responsiveness is crucial for real-time tactical adjustments and immediate feedback, helping athletes make quick decisions during matches and enhancing their performance.

The practical value of the BiLSTM-based model fused with attention mechanism in volleyball training video analysis is evident in several aspects. For instance, in technical action analysis, coaches can leverage the model to precisely identify and describe key technical actions such as spikes, serves, and blocks, thus offering targeted improvement suggestions. Detailed video descriptions generated by the model enable athletes to independently review their performance after training, deepen their understanding of technical aspects, and hasten skill enhancement. Moreover, in tactical strategy formulation, the model can discern and analyze opponents' tactical layouts and movement patterns, providing coaches with data support to devise response strategies and enhance match preparation and execution efficiency. These applications not only enhance the specificity and efficiency of training but also offer robust support for decision-making in matches, significantly bolstering the team's competitiveness. With ongoing technological advancements and further optimization of the model, its potential applications in volleyball training and matches will broaden, contributing significantly to the scientific training and competitive level improvement of volleyball sports.

The methodology and experimental design of this study offer valuable insights and serve as a reference for future research endeavors. The innovative integration of BiLSTM with attention mechanism effectively enhances the model's capacity to handle multimodal information. Through the incorporation of spatial attention and channel attention mechanisms, the model not only identifies key actions in videos but also comprehends the correlations between different modalities, resulting in richer and more precise video descriptions. This multimodal fusion approach introduces a novel perspective for the comprehensive analysis of volleyball training videos, enabling coaches and athletes to grasp the intricacies of games and training from various dimensions. Furthermore, the extensive validation conducted on the SportsMOT dataset underscores the model's generalization ability and practical utility. Future research endeavors could explore more intricate model structures building upon this foundation or extend the proposed method to other domains of sports video analysis. Such endeavors would facilitate the convergence of sports science and AI technology, contributing to the advancement of volleyball and other sports projects.

5. Conclusion

This study implements volleyball sports video classification and description using deep learning techniques, introducing a multimodal video analysis model based on BiLSTM and attention mechanisms. By leveraging advanced data processing technologies and complex neural network structures, the model achieves a recognition accuracy exceeding 95 % in volleyball sports video classification tasks. This achievement significantly enhances the model's capability to recognize actions and scenes accurately. Consequently, it provides precise and efficient technical support for volleyball training and competitions, while offering valuable insights and guidance for researchers and practitioners in sports video analysis. Additionally, the methodological approach and experimental findings of this study contribute new ideas and potential applications for intelligent video analysis across diverse sports disciplines.

Despite these significant advancements, several limitations and challenges are evident. Firstly, the model's performance is contingent upon the quality and diversity of the training data. Larger or more complex volleyball video datasets may necessitate further optimization and adjustments to enhance model robustness. Secondly, the training and fine-tuning processes of the model require substantial computational resources and time, potentially restricting its deployment in resource-constrained environments. Moreover, the model may exhibit sensitivity to noise and outliers in input data, underscoring the importance of rigorous data preprocessing and cleaning during practical applications.

Data availability statements

All data generated or analyzed during this study are included in this published article [Database DOI number is: 10.6084/m9 figshare. 25560870].

CRediT authorship contribution statement

Zhao Ruiye: Writing – review & editing, Writing – original draft, Supervision, Software, Methodology, Investigation, Formal analysis, Data curation, Conceptualization.

Declaration of competing interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Footnotes

^☆

✩ This document is the result of the research project funded by xxx.

^{Appendix A}

Supplementary data to this article can be found online at https://doi.org/10.1016/j.heliyon.2024.e34735.

Appendix A. Supplementary data

The following are the Supplementary data to this article:

Multimedia component 1

mmc1.txt^{(1.8KB, txt)}

Multimedia component 2

mmc2.docx^{(16.4KB, docx)}

References

1.DeCouto B.S., Smeeton N.J., Williams A.M. Skilled performers show right parietal lateralization during anticipation of volleyball attacks. Brain Sci. 2023;13(8):1204. doi: 10.3390/brainsci13081204. [DOI] [PMC free article] [PubMed] [Google Scholar]
2.Siregar S., Kasih I., Pardilla H. The effectiveness of E-learning-based volleyball service video media on students affected by covid-19 at faculty of sports science, universitas negeri medan. Phys. Educ. Theory Methodol. 2022;22(1):7–13. [Google Scholar]
3.Limroongreungrat W., Mawhinney C., Kongthongsung S., et al. Landing error scoring system: data from youth volleyball players. Data Brief. 2022;41 doi: 10.1016/j.dib.2022.107916. [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Brown D.M.Y., Cairney J., Azimi S., et al. Towards the development of a quality youth sport experience measure: understanding participant and stakeholder perspectives. PLoS One. 2023;18(7) doi: 10.1371/journal.pone.0287387. [DOI] [PMC free article] [PubMed] [Google Scholar]
5.Prudviraj J., Reddy M.I., Vishnu C., et al. AAP-MIT: attentive atrous pyramid network and memory incorporated transformer for multisentence video description. IEEE Trans. Image Process. 2022;31:5559–5569. doi: 10.1109/TIP.2022.3195643. [DOI] [PubMed] [Google Scholar]
6.Tao H., Lu M., Hu Z., et al. Attention-aggregated attribute-aware network with redundancy reduction convolution for video-based industrial smoke emission recognition. IEEE Trans. Ind. Inf. 2022;18(11):7653–7664. [Google Scholar]
7.Sharma V., Gupta M., Pandey A.K., et al. A review of deep learning-based human activity recognition on benchmark video datasets. Appl. Artif. Intell. 2022;36(1) [Google Scholar]
8.Zhang X., Yu Y., Gao Y., Chen X., Li W. Research on singing voice detection based on a long-term recurrent convolutional network with vocal separation and temporal smoothing. Electronics. 2020;9(9):1458. [Google Scholar]
9.Wang W., Huang Z., Tian R. Deep learning networks-based action videos classification and search. Int. J. Pattern Recogn. Artif. Intell. 2021;35(7) [Google Scholar]
10.Almahadin G., Subburaj M., Hiari M., Sathasivam Singaram S., Kolla B.P., Dadheech P., Vibhute A.D., Sengan S. Enhancing video anomaly detection using spatio-temporal autoencoders and convolutional LSTM networks. SN Computer Science. 2024;5(1):2024. [Google Scholar]
11.Srilakshmi G., Joe I.R.P. Sports video retrieval and classification using focus u-net based squeeze excitation and residual mapping deep learning model. Eng. Appl. Artif. Intell. 2023;126 [Google Scholar]
12.Dasgupta M., Bandyopadhyay O., Chatterji S. Detection of helmetless motorcycle riders by video captioning using deep recurrent neural network. Multimed. Tool. Appl. 2023;82(4):5857–5877. [Google Scholar]
13.Akbari Y., Al-Maadeed S., Al-Máadeed N., et al. A new forensic video database for source smartphone identification: description and analysis. IEEE Access. 2022;10:20080–20091. [Google Scholar]
14.Zhong S.H., Lin J., Lu J., et al. Deep semantic and attentive network for unsupervised video summarization. ACM Trans. Multimedia Comput. Commun. Appl. (TOMM) 2022;18(2):1–21. [Google Scholar]
15.Ur Rehman A., Belhaouari S.B., Kabir M.A., et al. On the use of deep learning for video classification. Appl. Sci. 2023;13(3):2007. [Google Scholar]
16.Zaoad M.S., Mannan M.M.R., Mandol A.B., et al. An attention-based hybrid deep learning approach for Bengali video captioning. J. King Saud Univ. - Comput. Inf. Sci. 2023;35(1):257–269. [Google Scholar]
17.Wang Z., Zhang S., Zhao Y., et al. Risk prediction and credibility detection of network public opinion using blockchain technology. Technol. Forecast. Soc. Change. 2023;187 [Google Scholar]
18.Li Y., Zhao J., Lv Z., et al. Medical image fusion method by deep learning. Int. J. Cogn. Comput. Eng. 2021;2:21–29. [Google Scholar]
19.Ghosh R. A Faster R-CNN and recurrent neural network based approach of gait recognition with and without carried objects. Expert Syst. Appl. 2022;205 [Google Scholar]
20.Li X., Cao X. Human motion recognition information processing system based on LSTM Recurrent Neural Network Algorithm. J. Ambient Intell. Humaniz. Comput. 2023;14(7):8509–8521. [Google Scholar]
21.Wang Y. Intelligent auxiliary system for music performance under edge computing and long short-term recurrent neural networks. PLoS One. 2023;18(5) doi: 10.1371/journal.pone.0285496. [DOI] [PMC free article] [PubMed] [Google Scholar]
22.Sörensen L.K.A., Bohté S.M., De Jong D., et al. Mechanisms of human dynamic object recognition revealed by sequential deep neural networks. PLoS Comput. Biol. 2023;19(6) doi: 10.1371/journal.pcbi.1011169. [DOI] [PMC free article] [PubMed] [Google Scholar]
23.Li X. Study on volleyball-movement pose recognition based on joint point sequence. Comput. Intell. Neurosci. 2023:2023. doi: 10.1155/2023/2198495. [DOI] [PMC free article] [PubMed] [Google Scholar]
24.Li X. Study on volleyball-movement pose recognition based on joint point sequence. Comput. Intell. Neurosci. 2023;2023(1) doi: 10.1155/2023/2198495. [DOI] [PMC free article] [PubMed] [Google Scholar]
25.Gao Y., Mo Y., Zhang H., Huang R., Chen Z. Research on volleyball video intelligent description technology combining the long-term and short-term memory network and attention mechanism. Comput. Intell. Neurosci. 2021;2021(1):2021. doi: 10.1155/2021/7088837. [DOI] [PMC free article] [PubMed] [Google Scholar]
26.İmamoğlu M. The Effect of interactive videos on volleyball education. J. Learn. Teach. Digital Age. 2023;8(2):267–275. [Google Scholar]
27.Suryadi S., Sumiharsono R., Triwahyuni E. Pengaruh metode latihan dan media video terhadap kecepatan reaksi pemain bola voli (the effect of training methods and video media on the reaction speed of volleyball players) J. Pendidik. Pembelajaran. 2023;4(2):1455–1462. [Google Scholar]
28.Yamada Y. Measuring block reaction time in volleyball players using a novel and accurate reaction time measurement system. Int. J. Sport Health Sci. 2023;21:31–35. [Google Scholar]
29.Maglo A., Orcesi A., Denize J., et al. Individual locating of soccer players from a single moving view. Sensors. 2023;23(18):7938. doi: 10.3390/s23187938. [DOI] [PMC free article] [PubMed] [Google Scholar]
30.Nasseri M.H., Babaee M., Moradi H., et al. Online relational tracking with camera motion suppression. J. Vis. Commun. Image Represent. 2023;90 [Google Scholar]
31.Fang W., Chen Y., Xue Q. Survey on research of RNN-based spatio-temporal sequence prediction algorithms. Journal on Big Data. 2021;3(3):97. [Google Scholar]
32.Phuc D.T., Tran Q.T., Van Tinh T., et al. Video captioning in Vietnamese using deep learning. Int. J. Electr. Comput. Eng. 2022;12(3):3092. [Google Scholar]
33.Samee N.A., Mahmoud N.F., Aldhahri E.A., Rafiq A., Muthanna M.S.A., Ahmad I. RNN and BiLSTM fusion for accurate automatic epileptic seizure diagnosis using EEG signals. Life. 2022;12(12):1946. doi: 10.3390/life12121946. [DOI] [PMC free article] [PubMed] [Google Scholar]
34.Sangwan N., Bhatnagar V. Video popularity prediction using stacked bilstm layers. Malays. J. Comput. Sci. 2021;34(3):242–254. [Google Scholar]
35.Paula L.P.O., Faruqui N., Mahmud I., et al. A novel front door security (FDS) algorithm using GoogleNet-BiLSTM hybridization. IEEE Access. 2023;11:19122–19134. [Google Scholar]
36.Manoharan T.A., Radhakrishnan M. Region-Wise brain response classification of ASD children using EEG and BiLSTM RNN. Clin. EEG Neurosci. 2023;54(5):461–471. doi: 10.1177/15500594211054990. [DOI] [PubMed] [Google Scholar]
37.Du G., Wang Z., Gao B., Mumtaz S., Abualnaja K.M., Du C. A convolution bidirectional long short-term memory neural network for driver emotion recognition. IEEE Trans. Intell. Transport. Syst. 2020;22(7):4570–4578. [Google Scholar]
38.Zhou J., Xu Z., Wang S. A novel hybrid learning paradigm with feature extraction for carbon price prediction based on Bi-directional long short-term memory network optimized by an improved sparrow search algorithm. Environ. Sci. Pollut. Control Ser. 2022;29(43):65585–65598. doi: 10.1007/s11356-022-20450-4. [DOI] [PubMed] [Google Scholar]
39.Alnaggar M., Siam A.I., Handosa M., et al. Video-based real-time monitoring for heart rate and respiration rate. Expert Syst. Appl. 2023;225 [Google Scholar]
40.He N., Yang S., Li F., Trajanovski S., Zhu L., Wang Y., Fu X. Leveraging deep reinforcement learning with attention mechanism for virtual network function placement and routing. IEEE Trans. Parallel Distr. Syst. 2023;34(4):1186–1201. [Google Scholar]
41.Ding P., Qian H., Zhou Y., et al. Object detection method based on lightweight YOLOv4 and attention mechanism in security scenes. J. Real-Time Image Process. 2023;20(2):34. [Google Scholar]
42.Wang W., Li Q., Xie J., et al. Research on emotional semantic retrieval of attention mechanism oriented to audio-visual synesthesia. Neurocomputing. 2023;519:194–204. [Google Scholar]
43.Yang G., Yang Y., Lu Z., et al. Sta-tsn: spatial-temporal attention temporal segment network for action recognition in video. PLoS One. 2022;17(3) doi: 10.1371/journal.pone.0265115. [DOI] [PMC free article] [PubMed] [Google Scholar]
44.Feng L., Cheng C., Zhao M., et al. EEG-based emotion recognition using spatial-temporal graph convolutional LSTM with attention mechanism. IEEE J. Biomed. Health Inform. 2022;26(11):5406–5417. doi: 10.1109/JBHI.2022.3198688. [DOI] [PubMed] [Google Scholar]
45.Zhang C., Yang Z., He X., Deng L. Multimodal intelligence: representation learning, information fusion, and applications. IEEE Journal of Selected Topics in Signal Processing. 2020;14(3):478–493. [Google Scholar]
46.El-Komy A., Shahin O.R., Abd El-Aziz R.M., Taloba A.I. Integration of computer vision and natural language processing in multimedia robotics application. Inf. Sci. 2022;7(6):1–12. [Google Scholar]
47.Jiang M., Yin S. Facial expression recognition based on convolutional block attention module and multi-feature fusion. Int. J. Comput. Vis. Robot. 2023;13(1):21–37. [Google Scholar]
48.Wahyudi D., Sibaroni Y. Deep learning for multi-aspect sentiment analysis of TikTok app using the rnn-lstm method. Build. Inform. Technol. Sci. (BITS) 2022;4(1):169–177. [Google Scholar]
49.Li J., Huang F., Qin H., et al. Research on remaining useful life prediction of bearings based on MBCNN-BiLSTM. Appl. Sci. 2023;13(13):7706. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Multimedia component 1

mmc1.txt^{(1.8KB, txt)}

Multimedia component 2

mmc2.docx^{(16.4KB, docx)}

Data Availability Statement

All data generated or analyzed during this study are included in this published article [Database DOI number is: 10.6084/m9 figshare. 25560870].

[bib1] 1.DeCouto B.S., Smeeton N.J., Williams A.M. Skilled performers show right parietal lateralization during anticipation of volleyball attacks. Brain Sci. 2023;13(8):1204. doi: 10.3390/brainsci13081204. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib2] 2.Siregar S., Kasih I., Pardilla H. The effectiveness of E-learning-based volleyball service video media on students affected by covid-19 at faculty of sports science, universitas negeri medan. Phys. Educ. Theory Methodol. 2022;22(1):7–13. [Google Scholar]

[bib3] 3.Limroongreungrat W., Mawhinney C., Kongthongsung S., et al. Landing error scoring system: data from youth volleyball players. Data Brief. 2022;41 doi: 10.1016/j.dib.2022.107916. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib4] 4.Brown D.M.Y., Cairney J., Azimi S., et al. Towards the development of a quality youth sport experience measure: understanding participant and stakeholder perspectives. PLoS One. 2023;18(7) doi: 10.1371/journal.pone.0287387. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib5] 5.Prudviraj J., Reddy M.I., Vishnu C., et al. AAP-MIT: attentive atrous pyramid network and memory incorporated transformer for multisentence video description. IEEE Trans. Image Process. 2022;31:5559–5569. doi: 10.1109/TIP.2022.3195643. [DOI] [PubMed] [Google Scholar]

[bib6] 6.Tao H., Lu M., Hu Z., et al. Attention-aggregated attribute-aware network with redundancy reduction convolution for video-based industrial smoke emission recognition. IEEE Trans. Ind. Inf. 2022;18(11):7653–7664. [Google Scholar]

[bib7] 7.Sharma V., Gupta M., Pandey A.K., et al. A review of deep learning-based human activity recognition on benchmark video datasets. Appl. Artif. Intell. 2022;36(1) [Google Scholar]

[bib8] 8.Zhang X., Yu Y., Gao Y., Chen X., Li W. Research on singing voice detection based on a long-term recurrent convolutional network with vocal separation and temporal smoothing. Electronics. 2020;9(9):1458. [Google Scholar]

[bib9] 9.Wang W., Huang Z., Tian R. Deep learning networks-based action videos classification and search. Int. J. Pattern Recogn. Artif. Intell. 2021;35(7) [Google Scholar]

[bib10] 10.Almahadin G., Subburaj M., Hiari M., Sathasivam Singaram S., Kolla B.P., Dadheech P., Vibhute A.D., Sengan S. Enhancing video anomaly detection using spatio-temporal autoencoders and convolutional LSTM networks. SN Computer Science. 2024;5(1):2024. [Google Scholar]

[bib11] 11.Srilakshmi G., Joe I.R.P. Sports video retrieval and classification using focus u-net based squeeze excitation and residual mapping deep learning model. Eng. Appl. Artif. Intell. 2023;126 [Google Scholar]

[bib12] 12.Dasgupta M., Bandyopadhyay O., Chatterji S. Detection of helmetless motorcycle riders by video captioning using deep recurrent neural network. Multimed. Tool. Appl. 2023;82(4):5857–5877. [Google Scholar]

[bib13] 13.Akbari Y., Al-Maadeed S., Al-Máadeed N., et al. A new forensic video database for source smartphone identification: description and analysis. IEEE Access. 2022;10:20080–20091. [Google Scholar]

[bib14] 14.Zhong S.H., Lin J., Lu J., et al. Deep semantic and attentive network for unsupervised video summarization. ACM Trans. Multimedia Comput. Commun. Appl. (TOMM) 2022;18(2):1–21. [Google Scholar]

[bib15] 15.Ur Rehman A., Belhaouari S.B., Kabir M.A., et al. On the use of deep learning for video classification. Appl. Sci. 2023;13(3):2007. [Google Scholar]

[bib16] 16.Zaoad M.S., Mannan M.M.R., Mandol A.B., et al. An attention-based hybrid deep learning approach for Bengali video captioning. J. King Saud Univ. - Comput. Inf. Sci. 2023;35(1):257–269. [Google Scholar]

[bib17] 17.Wang Z., Zhang S., Zhao Y., et al. Risk prediction and credibility detection of network public opinion using blockchain technology. Technol. Forecast. Soc. Change. 2023;187 [Google Scholar]

[bib18] 18.Li Y., Zhao J., Lv Z., et al. Medical image fusion method by deep learning. Int. J. Cogn. Comput. Eng. 2021;2:21–29. [Google Scholar]

[bib19] 19.Ghosh R. A Faster R-CNN and recurrent neural network based approach of gait recognition with and without carried objects. Expert Syst. Appl. 2022;205 [Google Scholar]

[bib20] 20.Li X., Cao X. Human motion recognition information processing system based on LSTM Recurrent Neural Network Algorithm. J. Ambient Intell. Humaniz. Comput. 2023;14(7):8509–8521. [Google Scholar]

[bib21] 21.Wang Y. Intelligent auxiliary system for music performance under edge computing and long short-term recurrent neural networks. PLoS One. 2023;18(5) doi: 10.1371/journal.pone.0285496. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib22] 22.Sörensen L.K.A., Bohté S.M., De Jong D., et al. Mechanisms of human dynamic object recognition revealed by sequential deep neural networks. PLoS Comput. Biol. 2023;19(6) doi: 10.1371/journal.pcbi.1011169. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib23] 23.Li X. Study on volleyball-movement pose recognition based on joint point sequence. Comput. Intell. Neurosci. 2023:2023. doi: 10.1155/2023/2198495. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib24] 24.Li X. Study on volleyball-movement pose recognition based on joint point sequence. Comput. Intell. Neurosci. 2023;2023(1) doi: 10.1155/2023/2198495. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib25] 25.Gao Y., Mo Y., Zhang H., Huang R., Chen Z. Research on volleyball video intelligent description technology combining the long-term and short-term memory network and attention mechanism. Comput. Intell. Neurosci. 2021;2021(1):2021. doi: 10.1155/2021/7088837. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib26] 26.İmamoğlu M. The Effect of interactive videos on volleyball education. J. Learn. Teach. Digital Age. 2023;8(2):267–275. [Google Scholar]

[bib27] 27.Suryadi S., Sumiharsono R., Triwahyuni E. Pengaruh metode latihan dan media video terhadap kecepatan reaksi pemain bola voli (the effect of training methods and video media on the reaction speed of volleyball players) J. Pendidik. Pembelajaran. 2023;4(2):1455–1462. [Google Scholar]

[bib28] 28.Yamada Y. Measuring block reaction time in volleyball players using a novel and accurate reaction time measurement system. Int. J. Sport Health Sci. 2023;21:31–35. [Google Scholar]

[bib29] 29.Maglo A., Orcesi A., Denize J., et al. Individual locating of soccer players from a single moving view. Sensors. 2023;23(18):7938. doi: 10.3390/s23187938. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib30] 30.Nasseri M.H., Babaee M., Moradi H., et al. Online relational tracking with camera motion suppression. J. Vis. Commun. Image Represent. 2023;90 [Google Scholar]

[bib31] 31.Fang W., Chen Y., Xue Q. Survey on research of RNN-based spatio-temporal sequence prediction algorithms. Journal on Big Data. 2021;3(3):97. [Google Scholar]

[bib32] 32.Phuc D.T., Tran Q.T., Van Tinh T., et al. Video captioning in Vietnamese using deep learning. Int. J. Electr. Comput. Eng. 2022;12(3):3092. [Google Scholar]

[bib33] 33.Samee N.A., Mahmoud N.F., Aldhahri E.A., Rafiq A., Muthanna M.S.A., Ahmad I. RNN and BiLSTM fusion for accurate automatic epileptic seizure diagnosis using EEG signals. Life. 2022;12(12):1946. doi: 10.3390/life12121946. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib34] 34.Sangwan N., Bhatnagar V. Video popularity prediction using stacked bilstm layers. Malays. J. Comput. Sci. 2021;34(3):242–254. [Google Scholar]

[bib35] 35.Paula L.P.O., Faruqui N., Mahmud I., et al. A novel front door security (FDS) algorithm using GoogleNet-BiLSTM hybridization. IEEE Access. 2023;11:19122–19134. [Google Scholar]

[bib36] 36.Manoharan T.A., Radhakrishnan M. Region-Wise brain response classification of ASD children using EEG and BiLSTM RNN. Clin. EEG Neurosci. 2023;54(5):461–471. doi: 10.1177/15500594211054990. [DOI] [PubMed] [Google Scholar]

[bib37] 37.Du G., Wang Z., Gao B., Mumtaz S., Abualnaja K.M., Du C. A convolution bidirectional long short-term memory neural network for driver emotion recognition. IEEE Trans. Intell. Transport. Syst. 2020;22(7):4570–4578. [Google Scholar]

[bib38] 38.Zhou J., Xu Z., Wang S. A novel hybrid learning paradigm with feature extraction for carbon price prediction based on Bi-directional long short-term memory network optimized by an improved sparrow search algorithm. Environ. Sci. Pollut. Control Ser. 2022;29(43):65585–65598. doi: 10.1007/s11356-022-20450-4. [DOI] [PubMed] [Google Scholar]

[bib39] 39.Alnaggar M., Siam A.I., Handosa M., et al. Video-based real-time monitoring for heart rate and respiration rate. Expert Syst. Appl. 2023;225 [Google Scholar]

[bib40] 40.He N., Yang S., Li F., Trajanovski S., Zhu L., Wang Y., Fu X. Leveraging deep reinforcement learning with attention mechanism for virtual network function placement and routing. IEEE Trans. Parallel Distr. Syst. 2023;34(4):1186–1201. [Google Scholar]

[bib41] 41.Ding P., Qian H., Zhou Y., et al. Object detection method based on lightweight YOLOv4 and attention mechanism in security scenes. J. Real-Time Image Process. 2023;20(2):34. [Google Scholar]

[bib42] 42.Wang W., Li Q., Xie J., et al. Research on emotional semantic retrieval of attention mechanism oriented to audio-visual synesthesia. Neurocomputing. 2023;519:194–204. [Google Scholar]

[bib43] 43.Yang G., Yang Y., Lu Z., et al. Sta-tsn: spatial-temporal attention temporal segment network for action recognition in video. PLoS One. 2022;17(3) doi: 10.1371/journal.pone.0265115. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib44] 44.Feng L., Cheng C., Zhao M., et al. EEG-based emotion recognition using spatial-temporal graph convolutional LSTM with attention mechanism. IEEE J. Biomed. Health Inform. 2022;26(11):5406–5417. doi: 10.1109/JBHI.2022.3198688. [DOI] [PubMed] [Google Scholar]

[bib45] 45.Zhang C., Yang Z., He X., Deng L. Multimodal intelligence: representation learning, information fusion, and applications. IEEE Journal of Selected Topics in Signal Processing. 2020;14(3):478–493. [Google Scholar]

[bib46] 46.El-Komy A., Shahin O.R., Abd El-Aziz R.M., Taloba A.I. Integration of computer vision and natural language processing in multimedia robotics application. Inf. Sci. 2022;7(6):1–12. [Google Scholar]

[bib47] 47.Jiang M., Yin S. Facial expression recognition based on convolutional block attention module and multi-feature fusion. Int. J. Comput. Vis. Robot. 2023;13(1):21–37. [Google Scholar]

[bib48] 48.Wahyudi D., Sibaroni Y. Deep learning for multi-aspect sentiment analysis of TikTok app using the rnn-lstm method. Build. Inform. Technol. Sci. (BITS) 2022;4(1):169–177. [Google Scholar]

[bib49] 49.Li J., Huang F., Qin H., et al. Research on remaining useful life prediction of bearings based on MBCNN-BiLSTM. Appl. Sci. 2023;13(13):7706. [Google Scholar]

PERMALINK

Volleyball training video classification description using the BiLSTM fusion attention mechanism☆

Zhao Ruiye

Abstract

1. Introduction

2. Literature review

2.1. Current state of research on video description methods

2.2. Current applications of RNNs in video classification and recognition

2.3. Summary

3. Methodology for volleyball motion video description based on RNNs

3.1. Data collection and preprocessing for volleyball motion videos

Fig. 1.

3.2. Optimization of RNNs

Fig. 2.

3.3. Analysis of applying attention mechanism to Video Temporal Feature Extraction

Fig. 3.

Fig. 4.

3.4. Analysis and construction of a multimodal volleyball motion video temporal information modeling and classification model based on BiLSTM fusion attention mechanism

Fig. 5.

Algorithm 1.

4. Results and discussion

4.1. Experimental evaluation

Table 1.

4.2. Analysis of classification accuracy results of different algorithms

Fig. 6.

Fig. 7.

4.3. Analysis of recognition efficiency results of different algorithms

Fig. 8.

4.4. Analysis of ablation experiments

Fig. 9.

Table 2.

Table 3.

4.5. Discussion

5. Conclusion

Data availability statements

CRediT authorship contribution statement

Declaration of competing interest

Footnotes

Appendix A. Supplementary data

References

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

Volleyball training video classification description using the BiLSTM fusion attention mechanism^☆