Spatio-temporal graph autoencoder for automated evaluation of human actions in 3D in immersive VR-based training for archaeologists

Valerio Pradisi; Marco Raoul Marini; Francesco Castelli Gattinara Di Zubiena; Eduardo Palermo; Edoardo Baiocchi; Saverio Giulio Malatesta; Luigi Cinque

doi:10.1038/s41598-026-46138-0

. 2026 Mar 30;16:10568. doi: 10.1038/s41598-026-46138-0

Spatio-temporal graph autoencoder for automated evaluation of human actions in 3D in immersive VR-based training for archaeologists

Valerio Pradisi ¹, Marco Raoul Marini ^1,^✉, Francesco Castelli Gattinara Di Zubiena ², Eduardo Palermo ², Edoardo Baiocchi ³, Saverio Giulio Malatesta ³, Luigi Cinque ¹

PMCID: PMC13039142 PMID: 41912596

Abstract

Motion capture and analysis is a field of application in constant development, and improvement can be seen in both the field of advanced models and new devices. Its integration with Virtual Reality (VR) can further expand the field of application. Here, a system is built to train all those movements that require precision and accuracy in body movements, using VR to immerse the user in the relevant environment and the Xsens MVN (suit) to track and analyze the movements. A model processes motion data by constructing graphs where nodes represent key movement features. These are input to an Autoencoder (AE), AEforGraph, composed of a Graph Convolutional Network (GCN) for spatial dependencies and a Long-Short Term Memory (LSTM) network for temporal modeling. The encoded representations undergo Semi-Supervised Clustering to classify movements based on their similarity to predefined centroids representing correct execution. The decoder reconstructs the movement to highlight deviations and provide real-time corrective feedback. Live tests confirm the system’s effectiveness in recognizing and analyzing movement patterns, making it a valuable tool for training applications.

Subject terms: Engineering, Mathematics and computing

Introduction

The current epoch is characterized by the rapid growth of new technologies, and their application is becoming important in different fields. The development of more powerful and precise hardware, particularly in Computer Vision, allows for detailed and complete analyses, enabling the application of these types of technologies to be extended to a wide range of fields. In this context, the advancement of Virtual Reality (VR) and Motion Capture has opened new possibilities in the most widely differing sectors.

Among the various applications, one that stands out is training and improving performance in all those fields that require correct posture, such as a movement performed correctly or in total safety, but also activities that require high precision. These devices can be of great support, as they can replace training sites by recreating them in a virtual environment or simulating situations without the need to test them directly in the field. The computing power and rendering speed, combined with the precision of the analysis, can enable fast and effective training and development of subjects in areas where these characteristics can be exploited.

Here, a model is constructed to train specific body movements that can be performed with or without specific equipment for the situation. The case for which it is applied and tested consists of the movements of archaeologists during excavation, using various objects specific to this task, to have a system that could improve safety and precision in digging movements, allowing archaeologists to train on a virtual excavation site without the need to test movements on an actual site. As shown, the model can be easily extended to other applications.

To achieve this goal, the potential of two devices was exploited: the Meta Quest 3 for virtual reality and the Xsens MVN Awinda suit for motion capture. The Quest 3 simulates the specific environment, in this case, an excavation site, built entirely in Unity. On the other hand, the suit was used to capture movements and describe all their characteristics with high precision. At the base of this, a deep learning model was built to determine the correctness or incorrectness of the movement carried out during the simulation phase and provide the user with a response as quickly as possible. The model uses details of the nodes provided by the Xsens to construct a graph, with the nodes being the features necessary to describe the movement. The graphs are then used as input for an Autoencoder (AE), namely AEforGraph, structured with encoders and decoders divided into spatial and temporal parts, to obtain a representation in the latent space of the graphs constructed in such a way as to capture hidden relationships between the features of the sequences and reduce their size.

The AEforGraph is therefore composed of a Graph Convolutional Network (GCN) for the spatial part and a Long-Short Term Memory (LSTM) for the temporal part. Once the latent representation has been obtained, a Semi-Supervised Clustering (SSC) personalised algorithm assigns each extracted sequence to the nearest cluster. The centroid of the cluster, built by the system itself, represents the correct movement. Finally, the decoder is used, and therefore the original reconstruction, to provide the user with a complete description of the movement performed and any errors based on the distance of each node of interest for its relative movement from the centroid itself.

The architectural innovation of AEforGraph lies in its design for spatio-temporal graph sequences focused on representation, where spatial structure, temporal dynamics, and node relevance are jointly encoded into a compact and structured latent space. Unlike conventional models that primarily aim at prediction or local pattern extraction, AEforGraph is explicitly designed to learn a global and compressive representation of graph evolution over time. A key architectural contribution is the incorporation of node importance directly into the encoding process, allowing the model to modulate spatial aggregation and temporal encoding based on the biomechanical relevance of each node.

Therefore, AEforGraph enforces a global and dynamic compression of graph sequences, producing structured latent embeddings that are particularly suited for movement characterization, and error detection in spatio-temporal graph data.

State of the art and technologies

Virtual reality

VR immerses users in computer-generated 3D environments through devices like head-mounted displays (HMDs), enabling natural interaction and creating a strong sense of presence.

VR has revolutionized immersion in gaming, amplifying emotional and physiological responses compared to flat-screen gaming^1,2.

Recent studies highlight the feasibility of cloud-based VR gaming via technologies such as Wi-Fi 6 and Unity Render Streaming, reducing the need for high-end local devices³. Cloud-native VR leverages cloud computing to enable scalable and high-performance experiences on several devices, though challenges like latency and bandwidth remain. Solutions such as edge computing, 5G, and AI-driven management are being developed to address these issues⁴. In construction health and safety, VR enhances hazard recognition and training through immersive simulations⁵, despite challenges with infrastructure and system interoperability⁶. For tourism, VR is used mainly in the pre-travel phase to allow virtual exploration⁷ of destinations, influencing traveler decisions and shifting marketing strategies, especially since 2015^8,9.

In art, VR and AR open new possibilities for creative expression and audience interaction. Artists use game engines (e.g., Unity3D (https://unity.com/), Unreal Engine (https://www.unrealengine.com/)) to design immersive works. At the same time, VR also aids exhibition planning with sparse 3D mapping and motion-based interaction technologies like Kinect^10,11. In art education, VR deepens engagement and understanding of artworks, offering immersive learning experiences that require thoughtful instructional design and technological support¹²

Finally, in medical education, VR provides realistic, risk-free anatomy and surgical training environments. It enhances learning outcomes, though further research is needed to confirm its long-term educational impact^13,14.

VR in rehabilitation and training

VR has become a powerful tool in rehabilitation and training, offering immersive, interactive environments that improve traditional therapeutic and educational approaches¹⁵¹⁶. In rehabilitation, VR engages patients through simulated environments tailored to their needs, enhancing motivation and adherence. X. Dai et al.¹⁷ proposed a system that reconstructs full-body poses using only standard VR devices, avoiding additional sensors. Their approach combines deep learning for real-time accuracy and natural language processing for personalized feedback. In neurological rehabilitation, especially for Parkinson’s disease, VR effectively improves gait speed, stride length, and balance, often outperforming conventional therapies. C. Lei et al.¹⁸ attributed these benefits to VR’s interactive, feedback-rich environment, which also integrates cognitive and sensory tasks for a comprehensive treatment.

In cognitive rehabilitation, VR supports stroke patients with post-stroke cognitive impairment (PSCI), enhancing memory, attention, executive function, and daily activities. X. Chen et al.¹⁹ showed that VR-based programs significantly improve cognitive performance and independence in activities of daily living (ADLs) compared to standard therapy.

Beyond healthcare, VR is widely used in training. In occupational safety, companies like UPS use VR simulations to train employees in hazard recognition and emergency response, reducing real-world risks²⁰. In sports, VR improves athletes’ decision-making, reaction times, and technical skills by replicating real-world scenarios in controlled virtual settings. Studies by K. Witte et al.²¹ and F. Richlan et al.²² confirm its effectiveness in enhancing on-field performance.

Finally, in manufacturing, VR supports early-stage workplace optimization. V. Gorobets et al.²³ found that Methods-Time Measurement (MTM) analyses performed in VR align closely with real-world data, despite slightly slower task execution, highlighting VR’s potential for improving production planning and reducing costs.

Motion capture

Motion capture (MoCap) is a technology that records human or object movements and translates them into digital models. It is widely used in animation, sports science, and human-computer interaction. Its roots trace back to early animation studies in the late 19th and early 20th centuries, evolving significantly in the late 20th century with digital systems that transitioned from marker-based to markerless methods²⁴, enabling accurate 3D joint tracking²⁵. A major recent development is depth-sensing markerless MoCap, using RGB-D cameras (e.g., Kinect, RealSense, LiDAR) and AI algorithms to track and predict motion without markers²⁶. While promising for healthcare, gaming, and interactive media, these systems face challenges with occlusion and environmental conditions.

In clinical rehabilitation, markerless MoCap offers objective, non-invasive assessment, especially for neurodegenerative conditions such as dementia and Parkinson’s disease. Jeyasingh-Jacob et al.²⁷ reviewed its effectiveness in gait analysis and mobility monitoring.

In ergonomics, MoCap is used to analyze movements in the workplace, helping prevent musculoskeletal disorders and improve safety. A review by S. Salisu et al.²⁸ highlighted its widespread application in evaluating and optimizing occupational tasks.

In animation and entertainment, MoCap enables lifelike character animations. J. Wang et al.²⁹ introduced a system using multiple cameras for markerless capture, while D. K. Jang et al.³⁰ presented“MOVIN,”a real-time motion capture system using a single LiDAR sensor and a conditional variational autoencoder to track full-body movements cost-effectively.

MoCap using devices involves several approaches. Optical MoCap uses infrared cameras and reflective or active markers for high-precision tracking, widely applied in film (e.g., Avatar), video games (e.g., FIFA), and biomechanics. Markerless optical methods analyze body structure directly from video using deep learning tools like OpenPose³¹ and MediaPipe³². Inertial MoCap relies on wearable IMUs (accelerometers, gyroscopes, magnetometers) to measure body segment motion without external cameras, making it ideal for outdoor and VR applications. Recent work by R. Armani et al.³³ introduced“Ultra Inertial Poser,”which combines sparse IMUs and ultra-wideband (UWB) radios for accurate, drift-reduced full-body tracking. This category includes the Xsens suit used in this work. Electromagnetic MoCap uses electromagnetic fields to determine the 3D position and orientation in real time. It is suited for medical research, robotics, prosthetics, and high-precision animation, though it requires careful setup to avoid interference.

The model

Figure 1 shows an overview of the model. The first step in constructing the model is to transform each sample into a graph or, rather, a sequence of graphs. Then, the Autoencoder (AEforGraph), through which the sequences are processed to extract each sample’s latent representation, is constructed³⁴. The generation of the centroids follows this methodology, followed by the development of the semi-supervised clustering (SSC) model through which the dataset elements are assigned to their respective centroids. Then the assigned samples are reconstructed through the decoder and compared to the cluster element reconstruction to describe the characteristics of the movement performed and detect any errors. Everything is included in the Unity system to respond to the user directly in the viewer as soon as the movement is performed.

The data provided by MVN is available in.xlsx format. Each has several sheets describing the movement captured under different metrics. It is essential to point out that the MVN data gives access to the individual components of a three-dimensional vector along each of its Cartesian axes (X, Y, Z) for the quantities a vector can represent. The latter enables the full potential of the vector’s spatial properties and develops the model using its complete 3D representation, ensuring that all directional variations are accurately accounted for in the computational process. The code of the entire pipeline can be accessed at the following link: https://doi.org/10.5281/zenodo.19005035.

Graphs: the creation to represent data

As seen in the previous section, Xsens provides enormous data for each specific node or angle. The most straightforward data structure to represent and exploit this is to construct a graph that illustrates the evolution of the features over time. The idea is therefore to assign features to each node, thus creating a feature vector on each node.

Once the quantities had been selected to characterize the graph, an analysis was conducted on the importance of the features to add it as a global feature, meaning that it doesn’t change over time, on each node. The variance of each feature per node has been calculated by summing the variances of the x, y, and z components and computing the Euclidean norm to obtain a final variance measure per node. This assigns different importance to the nodes and allows the model to focus more on those relevant to the correct movement required.

So, for each sheet considered, the variance of each feature and the consequent feature importance were derived by normalizing each node’s variance relative to the total variance across all nodes:

this allows the calculation of the importance of each node and the assignment as a feature corresponding to the weight relative to its node importance.

Other values that are useful for the complete description and analysis of the movement are the angles provided. The idea is also to add the evolution of the angles considered per frame, but Xsens provides more perspectives of each angle. A statistical comparison between joint angles was performed to identify significant differences in movement patterns and choose the best orientation. So, one-way Analysis of Variance (ANOVA) was involved to test whether the mean joint angles are equal across different movement recordings. The null hypothesis Inline graphic states that all group means are equal .

The Total variability is divided into between-group variance and within-group variance:

The F-statistic was computed as:

a large F value and a p-value below 0.05 indicate that the joint means are not all equal. This test highlights which joint angles vary significantly between the samples that will then be the basis for composing the centroids.

So, to summarize, the constructed graph is composed of the nodes considered, and for each of these, the associated feature vector presents the 3D components of the quantities taken into consideration, the feature importance associated with each node as a global feature, and the most relevant angles among those provided. The progress of the frames describes their evolution over time.

Note: The Fig. 2 only mentions conversion to xlsx and subsequent graph creation. As already mentioned in this section, for each movement, the graph sequence is constructed as follows:

For each joint, five 3D kinematic signals (segment velocity, acceleration, angular velocity, angular acceleration, and free acceleration) are extracted over time. When available, the corresponding joint angle is appended as an additional feature. Kinematic features are z-score normalized with a minimum standard deviation threshold to ensure numerical stability, while angle features are normalized separately using masked normalization to account for missing values. The resulting per-joint feature tensors are combined according to predefined skeletal connectivity to construct the final spatio-temporal graph representation.

Autoencoder for graphs (AEforGraph): the core model

Once the graph sequences have been constructed, they are processed using an autoencoder (AEforGraph) to extract their latent representation³⁵. The latent representation provides a compressed and abstract representation of the input data, capturing its most essential features. In this case, the latent space could capture important movement features, such as trajectory patterns or joint correlations, making it easier to compare performed actions with predefined“Optimal movements”, i.e., the initial sample for the centroids.

The model processes input data x with shape:

where N is the numbers of nodes in the graph, T is the number of time steps in the sequence and F is the number of node features, including an importance weight as the last feature and each node in the graph is connected through and edge index matrix Inline graphic where E is the number of edges.

The input tensor is first split into:

Spatial features: , excluding the importance
Importance weights: , the last column of the input.

The features are weighted by importance: Inline graphic , with the element-wise multiplication ensuring that higher-weighted features contribute more.

Each time step t is independently passed through the GCN encoder:

Then layer normalization is applied and all the time steps are concatenated to obtain a sequence.

The sequence is then passed through the LSTM encoder:

where Inline graphic is the latent representation, are the hidden and cell state and D is the latent dimension.

Decoding reverses the process. Each time steps t is decoded temporally:

where Inline graphic is the reconstructed hidden representation.

Then there is a layer of normalization, and, finally, spatial reconstruction is performed using the decoder GCN:

where Inline graphic is the learnable weight matrix of the decoder GCN.

The training function optimizes the model using an importance-weighted MSE loss.

The forward pass computes:

Latent Embeddings
Reconstructed output

The loss function applies importance-weighted Mean Squared Error:

where Inline graphic is the importance factor, is the ground truth and is s the reconstructed input.

Semi-supervised clustering: creation of the centroids and assign all the samples

Clustering is generally an unsupervised learning technique that involves training a model without having labeled samples. The idea of constructing a semi-supervised model (SSC) comes from the final goal: a movement can be defined correctly if it is as close as possible to the same movement considered optimal by the instructor. So, for each movement to be trained, a cluster is created to which the movements performed during the actual training phase are assigned, too describe their correctness and distance in terms of features calculated from the ideal movement.

Once the instructor has selected the optimal samples, one for each desired movement, the centroids are constructed. The function is designed to take a sequence with dimensions (num nodes, num frames, num features) and generate multiple overlapping sub-sequences (windows) of a fixed length, moving by a specific step (stride).

So, for each medoid Inline graphic having temporal length , a set of windows is defined:

where Inline graphic is the window length and S is the stride for window extraction inside the centroid. = and is the latent representation provided by the encoder.

The final centroids are calculated as the mean of all the latent representations of the new labeled sets of Labeled Files (the top(k) closest to the initial):

where K is the number of the closest samples and E means the output of the encoder.

Once the centroids are computed, for each sample sequence encoding Z, overlapping windows are extracted to perform localized matching. This step enables the comparison of test sequences with centroid windows in a temporally flexible manner. The distance between the centroids and the relative test sample is computed similarly, using the Mean Euclidean Distance (or L2 norm).

Decoding and compare: decoding all the samples and comparing with the centroids to provide answers

The last part of the model concerns decoding samples and centroids to analyze their differences and provide node-wise distance calculations to measure how much each node in the test sample deviates from the centroid and the time. Also, the time taken to complete the entire process on a single sample is computed, which is important to understand how long the user will have to wait to receive an evaluation in the virtual reality of the movement performed.

This decoding process consists of Temporal decoding, which converts the centroid’s latent representation into a structured sequence, and graph spatial decoding that uses the model’s graph convolutional decoder to reconstruct spatial relationships.

After decoding the test sample and its assigned centroid, the function calculates node-level distances to measure per-node deviation. For each node i, the Mean Absolute Deviation (MAD) is computed as:

where T is the sequence length, it measures how much the node’s trajectory in the test sample differs from the centroid.

However, MAD returns information about the whole feature per node, so it doesn’t provide much useful information because each node represents a joint or segment of the body with multiple features. For this reason, feature-wise differences are computed for each node, analyzing deviations in specific features (for instance, Segment Acceleration). By breaking down differences feature by feature and node by node, the function ensures that movement assessments are accurate and interpretable, and the whole evaluation can be regulated with a threshold.

Unity setup and live streaming

The archaeologist virtual training system was implemented in Unity as an immersive archaeological excavation site where the user performs specific gestures in a realistic working environment. The Unity scene serves as the central interface of the system, not only as a visualization platform but also as the integrative layer that connects motion capture, data processing, model inference, and user feedback into a unified operational pipeline.

The entire system is connected to Unity, leveraging Unity’s compatibility with Xsens for real-time motion data streaming. Through a dedicated Unity plugin provided by the Xsens MVN software, full-body motion capture data is streamed directly into the engine. This integration ensures that the user’s movements are accurately reproduced in the virtual environment with minimal latency. The combination of Unity, Meta Quest 3, and Xsens enables the creation of immersive training experiences in which users observe their movements precisely reflected in virtual space, thus helping the user to interact and immerse themselves in the environment.

As mentioned before, within this framework, full-body motion data are represented as a skeletal structure composed of multiple body segments. Each segment corresponds to a node in a graph-based model, and each node is described by a feature vector containing motion-related measurements. So, the dimensionality of the skeleton representation is defined by the number of body segments and the number of features associated with each segment. This structured representation is analyzed by the trained model, which assigns the performed movement to a cluster in latent space and computes node-specific deviations from learned reference centroids.

The output of this analysis is transmitted to Unity in real time via UDP. This streaming mechanism enables low-latency data exchange between the processing module and the virtual environment, ensuring that feedback is delivered immediately after movement execution. From the user’s perspective, movement performance and system evaluation appear seamlessly integrated within the same virtual workspace.

To provide structured and interpretable feedback, three dedicated logs were implemented inside the Unity scene:

Information Log: This component presents a descriptive summary of the performed movement. It reports the assigned movement cluster and provides an interpretation of the action according to the learned representation space. This allows the user to understand how their movement has been categorized within the system.
Warning Log: This component highlights execution-related deviations. When significant differences from the reference cluster are detected, the system identifies the involved joint(s) and displays the corresponding deviation magnitude. By explicitly associating deviations with specific anatomical segments (e.g., forearm or hand), the system translates latent-space distances into meaningful biomechanical feedback.
Error Log: This component reports issues related to motion capture quality or data inconsistencies. When anomalies arise from the acquisition process, the user is informed that the detected issue is due to registration errors rather than incorrect movement execution.

It should be emphasized that, to ensure that the MVN suit and Meta Quest 3 work correctly in Unity, an interpolation function was built to connect the last joint of the arms in the suits to the hands already present in the headset. This function was built directly in Unity, reconstructing the missing forearm as a truncated cone shaped to connect smoothly to the notable points of the respective parts.

Within this structure, the virtual environment serves as the convergence point for the entire workflow. Movement is performed within the archaeological site; skeletal data are captured and streamed into Unity in real time; the model evaluates performance in the latent space; and the resulting assessment is transmitted back into the same environment as structured feedback. This unified pipeline emphasizes the system-level design of the archaeologist training framework, in which data capture and data presentation are tightly coupled within a coherent real-time architecture. See Section “Live test” for more details about the Live Test.

Involved technologies

Meta Quest 3: Meta Quest 3, released in October 2023, marks a major advancement in standalone VR headsets, offering improved performance, sharper visuals, and greater comfort. Compared to Quest 2, it has a 40% slimmer profile thanks to pancake lenses, enhancing wearability for long sessions. It features a convenient layout with the power button and USB-C port on the right and a touch-sensitive passthrough button on the left. Its dual Inline graphic pixels per eye RGB-stripe LCD+ panels deliver a highly immersive visual experience, supporting 90 Hz and 120 Hz refresh rates and a 110-degree horizontal field of view. Powered by the Qualcomm Snapdragon XR2 Gen 2 processor and 8GB of LPDDR5 RAM, it supports advanced mixed reality experiences through full-color passthrough.

Xsens MVN Awinda: Xsens MVN Awinda is an inertial motion capture (MoCap) system with 17 wireless sensors placed strategically across the body: one on the forehead, one on the neck, two on the upper back, two on the upper arms, two on the wrists, two on the hands, one on the lower back, two on the upper legs, two below the knees, and two on the feet. Each Motion Tracker (MTw) integrates a gyroscope (measuring angular velocity), accelerometer (capturing linear acceleration), magnetometer (detecting magnetic field strength and direction), as well as a thermometer and barometer, making it highly reliable for joint angle and biomechanics measurement. After wearing the sensors, a calibration phase involves static and dynamic movements to ensure accurate tracking. Data are then saved in both.mvn and.xlsx formats, with the former containing the complete motion capture and the latter providing detailed movement breakdowns across multiple sheets.

Combination: Combining the head-mounted display (HMD) and MoCap systems enables natural and full-body interaction within VR environments. This integration allows precise mapping of real-world movements into virtual space. Recent developments like DivaTrack³⁶, a deep learning framework, reconstruct full-body motion from minimal input (VR headset and controllers) using IMU data and a dual reference frame approach to handle diverse motions and orientations. Another notable innovation is HapMotion by K. Jung et al.³⁷, which uses a wearable haptic system to translate performers’ upper body movements into real-time vibrotactile feedback for the audience, enriching immersive performance experiences.

Dataset

The goal is to build a system to train archaeologists in excavation techniques without involving a real site. To achieve this objective, three objects in particular were identified as the focus of the analysis: Pickaxe, Trowel, and Shovel. It was not possible to find a dataset that could meet the needs of this type of project, so it was necessary to build one suitable for the case in question.

The dataset comprises movements performed using the Xsens by eight professional archaeologists from the DigiLab, an Interdepartmental Research Center in Sapienza. It consists of 509 recorded movements performed with the three mentioned objects. Data augmentation was also performed, but to prevent over-modifying the data, each augmentation was applied probabilistically rather than deterministically. A predefined probability says whether a specific transformation will be applied to a given sample.

As mentioned, the dataset was built from scratch. Eight professional archaeologists were involved. Their heights differed, and they all recorded single movements made with the three tools to make up the total of 509 samples collected. They all performed the movements as if they were right-handed; two of them had to adapt the movement to this case, as this condition was necessary for the training.

Since the dataset is composed mainly of movements from professionals or aspiring professionals, incorrect samples typically correspond to minor deviations from the constructed centroids, while larger deviations result in higher distance values in the model’s feedback. Although the dataset is limited, a wider collection effort involving both professional and non-professional archaeologists is ongoing to improve coverage and support more reliable statistical analyses.

The dataset is available at this link: MVN Dataset.

It consists of 8 folders, each containing the.mvn files for each individual movement performed, the corresponding.xlsx file for each movement extracted with the numerical values in each frame captured for each physical characteristic considered, and an.mvna file with the physical characteristics of each individual user. In order to read the.mvn files, you must have the corresponding software. Finally, there is a folder with the samples used for training with the 509 movements and a text file that indicates which object was used to perform the movement (shovel, pickaxe, or trowel).

In constructing the input to be fed into the graph, for this model were considered 19 nodes among the 21 provided by Xsens (’Pelvis’, ’L3’, ’T8’, ’Neck’, ’Head’, ’Right Shoulder’, ’Right Upper Arm’, ’Right Forearm’, ’Right Hand’, ’Left Shoulder’, ’Left Upper Arm’, ’Left Forearm’, ’Left Hand’, ’Right Upper Leg’, ’Right Lower Leg’, ’Right Foot’, ’Left Upper Leg’, ’Left Lower Leg’, ’Left Foot’) an, only a few quantities were taken into account to study the details of the movement: Segment Velocity, Segment Acceleration, Segment Angular Velocity, Segment Angular Acceleration, Joint Angles XZY, Sensor Free Acceleration.

Experimental results

The Xsens was enough, combined with the real equipment, to record the movements that made the dataset. On the other hand, the entire system was used for the live tests, i.e., both the suit and the Meta Quest 3, with the entire virtual environment simulating an archaeological area recreated in Unity and reproduced in VR.

The live test requires the support of an instructor: when the user performs the movement, the instructor records it and saves the.mvn file in a folder constantly monitored by the model. This is done manually, as there is no other way to extract the XLSX data directly from the MVN. Once the folder receives the file, the model quickly sends all the answers and a description of the movement performed in the virtual environment, with any errors and corrections.

AEforGraph

The first phase was to divide the dataset into a train, validation, and test set to obtain the effectiveness results from our proposal. This division is only useful concerning AEforGraph, as in clustering, the division into train, validation, and test loses its meaning.

The three labeled files (LF) were excluded from the division. The validation set was used to perform hyperparameter tuning. Specifically, the model was empirically tested with different values, and the best hyperparameters found were:

The training function optimizes the model using an importance-weighted MSE loss.

The forward pass computes

Latent Embeddings
Reconstructed output

The loss function applies importance-weighted Mean Squared Error:

where Inline graphic is the importance factor, is the ground truth and is s the reconstructed input. Moreover, N is the number of nodes in the graph and T the number of time steps; dividing by NT ensures that the loss is computed as the average per-node, per-time-step error rather than a sum over all nodes and time steps. This prevents the loss magnitude from growing with dataset size and ensures consistent scaling during optimization.

So the whole training process is classical, composed of a forward pass, loss computation, backpropagation, and an optimization step. The activation functions, as said before, are ReLU for the GCN and for LSTM, there are sigmoid activations in its gates and tanh activation in its cell state.

In this case, the Loss analysis allows for a better understanding of the performance of the model (Fig. 3).

From epoch 6 to the last one, the consistent drop in training and validation losses without sudden increases suggests that the model is learning efficiently without overfitting.

Here, the RMSE indicated is the square root of the importance-weighted Mean Squared Error. Although it may seem redundant, RMSE (in the sense used in the paper) is reported for interpretability, since it expresses reconstruction error in the original feature scale.

The gradual reduction in loss and RMSE indicates that the model is refining its feature representations, and the fact that validation loss continues to decrease alongside training loss is an excellent sign that the model is still improving and generalizing well (Table 1). These results are significant in this sense, considering that one of the objectives of AEforGraph is to provide an excellent reconstruction of the input through the decoder.

Table 1.

Results obtained on Train, Validation, and Test set. AEforGraph.

Set	Loss	RMSE
Train	0.08787118784320636	0.29643074713
Validation	0.07342213139899316	0.27096518485
Test	0.0745231444016099	0.272989275251

Open in a new tab

The performance of the autoencoder confirms that the model has successfully generalized beyond the training data, and the RMSE suggests that the average reconstruction deviation is small, which means that the model can effectively reconstruct the input graph features with minimal error.

Another metric taken into consideration to evaluate in detail AEforGraph is Explained Variance Score (EVS):

A high EVS (close to 1) means that the autoencoder retains most of the variance in the original features, while a low EVS (close to 0) suggests that the model is failing to preserve important information.

In this case, EVS = 0.945498, meaning that the autoencoder captures over 94.5% of the original variance.

A comparison was made to see the differences in performance with one of the baselines closest and most similar to the one built. Specifically, ST-GCN (³⁸) with all hyperparameters kept the same as AEforGraph. The difference lies in how they handle graph sequences in their spatial and temporal part.

Fig. 4 — Train Loss and Validation Loss for ST-GCN.

Table 2.

Results obtained on Train, Validation, and Test set. ST-GCN.

Set	Loss	RMSE
Train	0.2397771381532638	0.48967043831
Validation	0.17410068660974504	0.41725374367
Test	0.17263097267646293	0.4154888358

Open in a new tab

As can be seen, AEforGraph is much more efficient than ST-GCN as it is able to better control the temporal part of the sequences provided. This improvement is immediately noticeable in the lower regularity of the loss during training, and this trend is also reflected in the RMSE over the epochs.

Moreover, in this case, EVS = 0.757781, indicating that the autoencoder captures 75.7% of the original variance, which is lower compared to AEforGraph.

AEforGraph achieves higher variance preservation because its architecture explicitly separates spatial and temporal modeling and incorporates importance weighting. By first extracting spatial correlations via GCN and then modeling temporal dynamics through an LSTM, AEforGraph captures long-range and different motion patterns more effectively than ST-GCN’s fixed convolutional kernels: the recurrent module in AEforGraph provides adaptive temporal integration because the LSTM updates its hidden state recursively, and this allows the information to accumulate over variable time spans. This makes the latent representation inherently more robust to temporal scale variations, since slower and faster executions of the same archaeological movement can be aligned in latent space through memory dynamics rather than fixed convolutional windows like ST-GCN.

Semi-Supervised Clustering (SSC), analysis

One of the crucial points of the model is to construct the centroids for the subsequent clustering, starting from the so-called“Optimal samples”. AEforGraph is trained in standard mode, after which the frozen model acts as a fixed feature extractor to generate latent embeddings for all samples. The subsequent clustering phase is entirely unsupervised and does not update model parameters; therefore, using the full dataset avoids data leakage and provides a more comprehensive and stable view of the latent space. This practice is well supported in the literature, as clustering on all available data better reflects the underlying data distribution and prevents biased or incomplete cluster structures^39,40.

One limitation of this method concerns the choice of the optimal samples, or rather, their lengths. The lengths of the LF must be very similar to obtain evaluable results. This is particularly the case analyzed in this study, as three specific movements are trained in the same model. In general, this can be mitigated by reducing the number of LF, and therefore of centroids, or by constructing more than one similar model.

Two metrics are used to evaluate the clusters. The first is to verify the manual assignment by manually labeling the samples to be able to compare the outcome. This is always possible thanks to the limited size of the dataset.

Over the 509 considered samples, 15 were assigned to the wrong cluster.

So, Overall Accuracy Inline graphic .

This process was supported by a 2D and 3D visualization obtained through PCA. Since the latent space is high-dimensional ( Inline graphic ), Principal Component Analysis (PCA) is used to reduce it to 2D or 3D.

The second metric used is Mean Intra-Cluster Distance, which measures the internal cohesion of each cluster, evaluating the average distance between each point and its assigned centroid:

where Inline graphic is the centroid of the cluster to which it is assigned, and d in this case are the Euclidean distances. Notably, the feature values are normalized; therefore, a significant value is close to 0.

As shown in Fig. 5, the trowel movement highlights evident differences from the other two, and the cluster is visually better as it is further apart from the others. The separation is less evident between the remaining two as they are very similar characteristics in the physical metrics considered (acceleration, velocity, and angular velocity). Despite this, the model manages to distinguish the small differences between the two movements very well, as is evident from the accuracy obtained. Among the 15 errors, in fact, 14 were placement errors between pickaxe, shovel, and vice versa.

Clustering was compared with a classic k-means model, in which the number of centroids was chosen to be equal to the model setup, i.e., Inline graphic , obtaining worse and unbalanced results since the sliding window technique is not used in k-means.

It can be observed that, although the dataset is well balanced in terms of the number of elements for each object, the resulting clustering in this case is heavily biased towards a specific class (pickaxe), providing a very low overall accuracy, especially when compared to that obtained previously.

It is important to clarify that clustering is performed in the original high-dimensional latent space. In the semi-supervised setting (Fig. 5), the centroids are not learned by the model nor computed through an optimization procedure. Instead, they are manually selected representative prototypes, as described previously. Consequently, they do not necessarily coincide with the geometric center of the projected clusters in PCA space.

In contrast, in the K-means experiment (Fig. 6), the algorithm automatically computes the centroids, which are then projected using the same PCA transformation. The different visual behavior of the centroids across the two figures is therefore expected.

This clear imbalance can also be seen when evaluating the Mean IC distance between the two models when a classic clustering algorithm (k-means) is applied:

Table 3.

Mean IC Distance, SSC and K-Means.

Clustering Algo	Trowel	Pickaxe	Shovel
SSC	0.143	0.127	0.203
K-Means	1.662	0.962	1.284

Open in a new tab

Here, it can be observed how the SSC system shows some more problems with the movements performed with the shovel. However, the values obtained for the Mean IC distance show the efficiency of the clustering when performed with SSC.

As can be seen from the results for k-means, this confirms that there is a deep imbalance and little internal cohesion within the clusters, especially with regard to Trowel, confirming that the clustering strategy applied in SSC leads to better results and allows for better management of especially those samples that are similar to two distinct centroids, as analyzed previously.

Live test

As previously mentioned, the model is built to train users to live in precise movements. In this case, the archaeologists performed live tests, with all devices active simultaneously: Meta Quest 3, the Xsens Suit, and the objects all integrated in the virtual environment built with Unity. The complete setup allowed for a good immersion in the constructed environment and a quick response.

For each user, the time required for the entire test is divided into various phases. The first step involves setting up the Xsens, putting on the suit, and positioning the sensors correctly (approximately 5 minutes in total). This is followed by a brief explanation from the instructor on how to use all the features in virtual reality and the various functions available. Finally, the test consists of performing individual movements at intervals of a few seconds, approximately 20 for each movement, with the response received directly in the virtual environment, in the specifically designed “Information”section. Any errors in the movement and the relative distance from the correct movement are reported in the“Warning”section, also included in the virtual environment. Out of 30 tests performed live, 29 were correctly assigned to the right cluster. In other words, the model assigned the movement to the correct cluster, and the distance from it was sufficient to classify the movement as correct. In only one case was the movement assigned to the wrong cluster and consequently reported as an error.

Once the MVN file was exported in.xlsx format, a procedure that must be carried out manually, the response time was about 1 second for each capture. The result was correctly visualized in the virtual environment with the right specifications on the nodes: in fact, the system is designed to give the user feedback on specific nodes based on the assigned cluster: for example, for the trowel, all the results are concentrated on the hand node (right hand, in this case), while for the pickaxe, the forearm nodes are also considered as they are considered fundamental. In this case, it means that specific data on those nodes is provided to the user.

Here is an example of the details provided for each individual node in the case of a correct and an incorrect movement. In the first case, we have the evolution of the ‘Right Hand’ node for acceleration and angular velocity in the case of a correct movement (Fig. 7), while in the second case, we again have the ‘Right Hand’ node with acceleration and angular velocity in the case of an incorrect movement (i.e., distant from the centroid) (Fig. 8) .

Fig. 7 — Right Hand acceleration (left) and angular velocity (right) of a right sample.

Fig. 8 — Right Hand acceleration (left) and angular velocity (right) of a wrong sample.

NOTE: The term“feature value”on the Y axis in the graphs refers to the specific feature requested by the user to analyze its evolution over the entire duration of the movement and compare it with the same feature in the centroid at the end of the execution.

In this case, therefore, the value refers to acceleration (for the images on the left in referred figures) and angular velocity (images on the right in referred figures) for the“right hand”node.

To allow the instructor to visualize everything that happens in the system, the setup during the live test was managed with two screens, where on the first (the one on the left in Fig. 9) the foreground displays what the user is seeing in the virtual reality and at the bottom, in the black part (Python), the results are shown to the instructor who is in front of the screen, as they are shown to the user in the Unity environment in a dedicated section.

On the second screen, the MVN software is kept open to check continuous motion capture and extract the XLSX as soon as necessary (Fig. 9, on the right).

Conclusions

In this work, a multi-part system was presented, aimed at creating support for training all those movements that require precision and control of the body.

The integration of virtual reality and motion capture using specific devices is not new, but it is part of the field of application. Similarities can be seen in fields such as rehabilitation or medical education, but the application and its possible extensions are multiple. Furthermore, the system can be expanded by considering more features made available by Xsens, such as different angles or even including positions.

Starting from a specific case study, such as the movements of archaeologists, it has been possible to construct a model that can be generalized to other applications. The results obtained are promising and strongly dependent on the data, but this is part of the logic of constructing any specific training program. The structure and logic remain valid and applicable in different situations.

The idea of exploiting latent representations to capture hidden relationships between apparently unrelated features is one of the system’s strengths, and the accuracy of the autoencoders in the reconstruction has made it possible to exploit both the subsequent clustering and the final decoding to provide users with accurate data.

The SemiSupervised Clustering system, on the other hand, was inspired by the general logic of training: the subject attempts to perform an ideal movement, trying to imitate as closely as possible what is considered perfect. This is the logic that led to the creation of SSC with specific centroids.

Furthermore, it is fascinating to be able to receive the responses and details of the movements performed directly in virtual reality, thus being able to totally immerse the user in the execution of the movements without necessarily having all the equipment available.

Experiments have shown a high dependence on data, but the model fully satisfies the set objective and can be generalized by always using very refined datasets. The AEforGraphs’ performance showed that the constructed model was appropriate for the case under consideration. Semi-supervised clustering was first effective in recognizing live, i.e., unlabeled, movements and providing detailed information for improving the movement.

Author contributions

V.P. wrote the paper and developed the software. M.R.M. provided feedback on the writing and supported the methodology. F.C.G.D.Z. and E.P. supported the instrumental requirements for the experimental phase and provided support on the physiological aspects of the data collection. E.B. and S.G.M. collected the data. L.C. supervised the research.

Funding

This research did not receive funding.

Data availability

All data generated or analysed during this study are included in this published article and available at this link: https://drive.google.com/drive/folders/1GLzUj6nkdJvh787RqKk8dgLg_awQ6WkT.

Declarations

Competing interests

The authors declare no competing interests.

Ethical approval

This study involved only non-identifiable motion capture data collected via IMUs only. No visual recordings, personal identifiers, or sensitive information were collected. In accordance with Sapienza guidelines, this study qualifies for an exemption from full ethical review for research involving benign interventions and anonymous data collection. All participants provided informed consent prior to participation. This study also fully respected the principles of the Declaration of Helsinki.

Footnotes

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

1.Anthes, C., Hernandez, R. J. G., Wiedemann, M. & Kranzlmüller, D. State of the art of virtual reality technologies. In IEEE Aerospace Conference10.1109/AERO.2016.7500674 (2016). [Google Scholar]
2.Vatsal, R. et al. An analysis of physiological and psychological responses in virtual reality and flat screen gaming. IEEE Trans. Affect. Comput.15, 1696–1710. 10.1109/TAFFC.2024.3368703 (2024). [Google Scholar]
3.Casasnovas, M., Michaelides, C., Carrascosa-Zamacois, M. & Bellalta, B. Experimental evaluation of interactive edge/cloud virtual reality gaming over wi-fi using unity render streaming. Computer Communications10.1016/j.comcom.2024.08.001 (2024). [Google Scholar]
4.Santos, J. Towards cloud-native virtual reality applications: State-of-the-art and open challenges. In 2024 IEEE Symposium on Computers and Communications (ISCC), 1–6, 10.1109/ISCC61673.2024.10733608 (Paris, France, 2024).
5.Tang, J. et al. Human-centred design and fabrication of a wearable multimodal visual assistance system. Nat. Mach. Intell.7, 627–638. 10.1038/s42256-025-01018-6 (2025). [Google Scholar]
6.Akindele, N. et al. A state-of-the-art analysis of virtual reality applications in construction health and safety. Results Eng.23, 102382. 10.1016/j.rineng.2024.102382 (2024). [Google Scholar]
7.Avola, D., Cinque, L., Foresti, G. L. & Marini, M. R. A novel low cybersickness dynamic rotation gain enhancer based on spatial position and orientation in virtual environments. Virtual Reality27, 3191–3209. 10.1007/s10055-023-00865-1 (2023). [Google Scholar]
8.Beck, J., Rainoldi, M. & Egger, R. Virtual reality in tourism: a state-of-the-art review. Tourism Review10.1108/TR-03-2017-0049 (2019). [Google Scholar]
9.Marini, M. R., Mocerino, L., Leopardi, L., Malatesta, S. G. & Cinque, L. Wsus: A novel usability metric based on sus for vr-based tasks in cultural heritage contexts. In 10th International Conference on Virtual Reality (ICVR), 347–355, 10.1109/ICVR62393.2024.10868543 (Bournemouth, United Kingdom, 2024).
10.Wang, F., Zhang, Z., Li, L. & Long, S. Virtual reality and augmented reality in artistic expression: A comprehensive study of innovative technologies. International Journal of Advanced Computer Science and Applications (IJACSA) 15, 10.14569/IJACSA.2024.0150365 (2024).
11.Jiahui, C. Application and challenges of virtual reality (vr) in art exhibition planning. Applied Mathematics and Nonlinear Sciences10.2478/amns-2024-1454 (2024). [Google Scholar]
12.Zhao, J. & Zhang, L. Virtual reality in art appreciation education: A systematic review of the literature. Journal of Educational Technology Development and Exchange17, 96–108. 10.18785/jetde.1701.05 (2024). [Google Scholar]
13.Mergen, M., Graf, N. & Meyerheim, M. Reviewing the current state of virtual reality integration in medical education - a scoping review. BMC Med. Educ.24, 788. 10.1186/s12909-024-05777-5 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
14.Marini, M. R., Mecca, A., Foresti, G. L. & Cinque, L. A natural interaction system for medical training through vr technology. In IEEE 37th International Symposium on Computer-Based Medical Systems (CBMS), 237–242, 10.1109/CBMS61543.2024.00047 (Guadalajara, Mexico, 2024).
15.Georgarakis, A. M. et al. A textile exomuscle that assists the shoulder during functional movements for everyday life. Nat. Mach. Intell.4, 574–582. 10.1038/s42256-022-00495-3 (2022). [Google Scholar]
16.Kroczek, L. O. H. & Mühlberger, A. Public speaking training in front of a supportive audience in virtual reality improves performance in real-life. Sci. Rep.13, 13968. 10.1038/s41598-023-41155-9 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
17.Dai, X., Zhang, Z., Zhao, S., Liu, X. & Chen, X. Full-body pose reconstruction and correction in virtual reality for rehabilitation training. Front. Neurosci.18, 1388742. 10.3389/fnins.2024.1388742 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
18.Lei, C. et al. Effects of virtual reality rehabilitation training on gait and balance in patients with Parkinson’s disease: A systematic review. PLoS One14, e0224819. 10.1371/journal.pone.0224819 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
19.Chen, X., Liu, F., Lin, S., Yu, L. & Lin, R. Effects of virtual reality rehabilitation training on cognitive function and activities of daily living of patients with post-stroke cognitive impairment: A systematic review and meta-analysis. Archives of Physical Medicine and Rehabilitation10.1016/j.apmr.2022.03.012 (2022). [DOI] [PubMed] [Google Scholar]
20.Diao, Z., Yamashita, H. & Abe, M. A metaverse laboratory setup for interactive atom visualization and manipulation with scanning probe microscopy. Sci. Rep.15, 17490. 10.1038/s41598-025-01578-y (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
21.Witte, K. et al. Sports training in virtual reality to improve response behavior in karate kumite with transfer to real world. Front. Virtual Real.3, 903021. 10.3389/frvir.2022.903021 (2022). [Google Scholar]
22.Richlan, F., Weiß, M., Kastner, P. & Braid, J. Virtual training, real effects: A narrative review on sports performance enhancement through interventions in virtual reality. Front. Psychol.14, 1240790. 10.3389/fpsyg.2023.1240790 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
23.Gorobets, V., Holzwarth, V., Hirt, C., Jufer, N. & Kunz, A. A vr-based approach in conducting mtm for manual workplaces. Int. J. Adv. Manuf. Technol.117, 2501–2510. 10.1007/s00170-021-07260-7 (2021). [Google Scholar]
24.Das, K., de Paula Oliveira, T. & Newell, J. Comparison of markerless and marker-based motion capture systems using 95% functional limits of agreement in a linear mixed-effects modelling framework. Sci. Rep.13, 22880. 10.1038/s41598-023-49360-2 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
25.Mündermann, L., Corazza, S. & Andriacchi, T. P. The evolution of methods for the capture of human movement leading to markerless motion capture for biomechanical applications. J. Neuroeng. Rehabil.3, 1–11. 10.1186/1743-0003-3-6 (2006). [DOI] [PMC free article] [PubMed] [Google Scholar]
26.Shenoy, M. A. et al. Bi-directional convlstm networks for early recognition of human activities and action prediction. Scientific Reports15, 38936 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
27.Jeyasingh-Jacob, J. et al. Markerless motion capture to quantify functional performance in neurodegeneration: Systematic review. JMIR Aging7, e52582. 10.2196/52582 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
28.Salisu, S. et al. Motion capture technologies for ergonomics: A systematic literature review. Diagnostics13, 2593. 10.3390/diagnostics13152593 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
29.Wang, J., Lu, K. & Xue, J. Markerless body motion capturing for 3d character animation based on multi-view cameras. arXiv preprint arXiv:2212.05788 (2022).
30.Jang, D. K. et al. Movin: Real-time motion capture using a single lidar. Comput. Graph. Forum42, e14961. 10.1111/cgf.14961 (2023). [Google Scholar]
31.Cao, Z., Hidalgo, G., Simon, T., Wei, S. E. & Sheikh, Y. Openpose: Real-time multi-person 2d pose estimation using part affinity fields. IEEE Trans. Pattern Anal. Mach. Intell.43, 172–186 (2019). [DOI] [PubMed] [Google Scholar]
32.Lugaresi, C. et al. Mediapipe: A framework for building perception pipelines. arXiv preprint arXiv:1906.08172, 10.48550/arXiv.1906.08172 (2019).
33.Armani, R., Qian, C., Jiang, J. & Holz, C. Ultra inertial poser: Scalable motion capture and tracking from sparse inertial sensors and ultra-wideband ranging. In ACM SIGGRAPH 2024 Conference Papers, 1–11, 10.1145/3641519.365746 (2024).
34.Zhu, L., Liu, Z. & Liu, G. A deep multiple self-supervised clustering model based on autoencoder networks. Sci. Rep.15, 18372. 10.1038/s41598-025-00349-z (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
35.Zhang, Z. et al. Interpretable unsupervised learning enables accurate clustering with high-throughput imaging flow cytometry. Sci. Rep.13, 20533. 10.1038/s41598-023-46782-w (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
36.Yang, D. et al. Divatrack: Diverse bodies and motions from acceleration-enhanced three-point trackers. Comput. Graph. Forum43, e15057. 10.1111/cgf.15057 (2024). [Google Scholar]
37.Jung, K., Kim, S., Oh, S. & Yoon, S. H. Hapmotion: Motion-to-tactile framework with wearable haptic devices for immersive vr performance experience. Virtual Reality28, 13. 10.1007/s10055-023-00910-z (2024). [Google Scholar]
38.Yan, S., Xiong, Y. & Lin, D. Spatial temporal graph convolutional networks for skeleton-based action recognition. Proceedings of the AAAI Conference on Artificial Intelligence 32, 10.1609/aaai.v32i1.12328 (2018).
39.Beer, A. et al. Shade: Deep density-based clustering. In Proceedings of the 2024 IEEE International Conference on Data Mining (ICDM), 675–680, 10.1109/ICDM59182.2024.00075 (IEEE, 2024).
40.Do, K., Tran, T. & Venkatesh, S. Clustering by maximizing mutual information across views. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 9928–9938, 10.1109/ICCV48922.2021.00978 (2021).

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

All data generated or analysed during this study are included in this published article and available at this link: https://drive.google.com/drive/folders/1GLzUj6nkdJvh787RqKk8dgLg_awQ6WkT.

[CR1] 1.Anthes, C., Hernandez, R. J. G., Wiedemann, M. & Kranzlmüller, D. State of the art of virtual reality technologies. In IEEE Aerospace Conference10.1109/AERO.2016.7500674 (2016). [Google Scholar]

[CR2] 2.Vatsal, R. et al. An analysis of physiological and psychological responses in virtual reality and flat screen gaming. IEEE Trans. Affect. Comput.15, 1696–1710. 10.1109/TAFFC.2024.3368703 (2024). [Google Scholar]

[CR3] 3.Casasnovas, M., Michaelides, C., Carrascosa-Zamacois, M. & Bellalta, B. Experimental evaluation of interactive edge/cloud virtual reality gaming over wi-fi using unity render streaming. Computer Communications10.1016/j.comcom.2024.08.001 (2024). [Google Scholar]

[CR4] 4.Santos, J. Towards cloud-native virtual reality applications: State-of-the-art and open challenges. In 2024 IEEE Symposium on Computers and Communications (ISCC), 1–6, 10.1109/ISCC61673.2024.10733608 (Paris, France, 2024).

[CR5] 5.Tang, J. et al. Human-centred design and fabrication of a wearable multimodal visual assistance system. Nat. Mach. Intell.7, 627–638. 10.1038/s42256-025-01018-6 (2025). [Google Scholar]

[CR6] 6.Akindele, N. et al. A state-of-the-art analysis of virtual reality applications in construction health and safety. Results Eng.23, 102382. 10.1016/j.rineng.2024.102382 (2024). [Google Scholar]

[CR7] 7.Avola, D., Cinque, L., Foresti, G. L. & Marini, M. R. A novel low cybersickness dynamic rotation gain enhancer based on spatial position and orientation in virtual environments. Virtual Reality27, 3191–3209. 10.1007/s10055-023-00865-1 (2023). [Google Scholar]

[CR8] 8.Beck, J., Rainoldi, M. & Egger, R. Virtual reality in tourism: a state-of-the-art review. Tourism Review10.1108/TR-03-2017-0049 (2019). [Google Scholar]

[CR9] 9.Marini, M. R., Mocerino, L., Leopardi, L., Malatesta, S. G. & Cinque, L. Wsus: A novel usability metric based on sus for vr-based tasks in cultural heritage contexts. In 10th International Conference on Virtual Reality (ICVR), 347–355, 10.1109/ICVR62393.2024.10868543 (Bournemouth, United Kingdom, 2024).

[CR10] 10.Wang, F., Zhang, Z., Li, L. & Long, S. Virtual reality and augmented reality in artistic expression: A comprehensive study of innovative technologies. International Journal of Advanced Computer Science and Applications (IJACSA) 15, 10.14569/IJACSA.2024.0150365 (2024).

[CR11] 11.Jiahui, C. Application and challenges of virtual reality (vr) in art exhibition planning. Applied Mathematics and Nonlinear Sciences10.2478/amns-2024-1454 (2024). [Google Scholar]

[CR12] 12.Zhao, J. & Zhang, L. Virtual reality in art appreciation education: A systematic review of the literature. Journal of Educational Technology Development and Exchange17, 96–108. 10.18785/jetde.1701.05 (2024). [Google Scholar]

[CR13] 13.Mergen, M., Graf, N. & Meyerheim, M. Reviewing the current state of virtual reality integration in medical education - a scoping review. BMC Med. Educ.24, 788. 10.1186/s12909-024-05777-5 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR14] 14.Marini, M. R., Mecca, A., Foresti, G. L. & Cinque, L. A natural interaction system for medical training through vr technology. In IEEE 37th International Symposium on Computer-Based Medical Systems (CBMS), 237–242, 10.1109/CBMS61543.2024.00047 (Guadalajara, Mexico, 2024).

[CR15] 15.Georgarakis, A. M. et al. A textile exomuscle that assists the shoulder during functional movements for everyday life. Nat. Mach. Intell.4, 574–582. 10.1038/s42256-022-00495-3 (2022). [Google Scholar]

[CR16] 16.Kroczek, L. O. H. & Mühlberger, A. Public speaking training in front of a supportive audience in virtual reality improves performance in real-life. Sci. Rep.13, 13968. 10.1038/s41598-023-41155-9 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR17] 17.Dai, X., Zhang, Z., Zhao, S., Liu, X. & Chen, X. Full-body pose reconstruction and correction in virtual reality for rehabilitation training. Front. Neurosci.18, 1388742. 10.3389/fnins.2024.1388742 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR18] 18.Lei, C. et al. Effects of virtual reality rehabilitation training on gait and balance in patients with Parkinson’s disease: A systematic review. PLoS One14, e0224819. 10.1371/journal.pone.0224819 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR19] 19.Chen, X., Liu, F., Lin, S., Yu, L. & Lin, R. Effects of virtual reality rehabilitation training on cognitive function and activities of daily living of patients with post-stroke cognitive impairment: A systematic review and meta-analysis. Archives of Physical Medicine and Rehabilitation10.1016/j.apmr.2022.03.012 (2022). [DOI] [PubMed] [Google Scholar]

[CR20] 20.Diao, Z., Yamashita, H. & Abe, M. A metaverse laboratory setup for interactive atom visualization and manipulation with scanning probe microscopy. Sci. Rep.15, 17490. 10.1038/s41598-025-01578-y (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR21] 21.Witte, K. et al. Sports training in virtual reality to improve response behavior in karate kumite with transfer to real world. Front. Virtual Real.3, 903021. 10.3389/frvir.2022.903021 (2022). [Google Scholar]

[CR22] 22.Richlan, F., Weiß, M., Kastner, P. & Braid, J. Virtual training, real effects: A narrative review on sports performance enhancement through interventions in virtual reality. Front. Psychol.14, 1240790. 10.3389/fpsyg.2023.1240790 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR23] 23.Gorobets, V., Holzwarth, V., Hirt, C., Jufer, N. & Kunz, A. A vr-based approach in conducting mtm for manual workplaces. Int. J. Adv. Manuf. Technol.117, 2501–2510. 10.1007/s00170-021-07260-7 (2021). [Google Scholar]

[CR24] 24.Das, K., de Paula Oliveira, T. & Newell, J. Comparison of markerless and marker-based motion capture systems using 95% functional limits of agreement in a linear mixed-effects modelling framework. Sci. Rep.13, 22880. 10.1038/s41598-023-49360-2 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR25] 25.Mündermann, L., Corazza, S. & Andriacchi, T. P. The evolution of methods for the capture of human movement leading to markerless motion capture for biomechanical applications. J. Neuroeng. Rehabil.3, 1–11. 10.1186/1743-0003-3-6 (2006). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR26] 26.Shenoy, M. A. et al. Bi-directional convlstm networks for early recognition of human activities and action prediction. Scientific Reports15, 38936 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR27] 27.Jeyasingh-Jacob, J. et al. Markerless motion capture to quantify functional performance in neurodegeneration: Systematic review. JMIR Aging7, e52582. 10.2196/52582 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR28] 28.Salisu, S. et al. Motion capture technologies for ergonomics: A systematic literature review. Diagnostics13, 2593. 10.3390/diagnostics13152593 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR29] 29.Wang, J., Lu, K. & Xue, J. Markerless body motion capturing for 3d character animation based on multi-view cameras. arXiv preprint arXiv:2212.05788 (2022).

[CR30] 30.Jang, D. K. et al. Movin: Real-time motion capture using a single lidar. Comput. Graph. Forum42, e14961. 10.1111/cgf.14961 (2023). [Google Scholar]

[CR31] 31.Cao, Z., Hidalgo, G., Simon, T., Wei, S. E. & Sheikh, Y. Openpose: Real-time multi-person 2d pose estimation using part affinity fields. IEEE Trans. Pattern Anal. Mach. Intell.43, 172–186 (2019). [DOI] [PubMed] [Google Scholar]

[CR32] 32.Lugaresi, C. et al. Mediapipe: A framework for building perception pipelines. arXiv preprint arXiv:1906.08172, 10.48550/arXiv.1906.08172 (2019).

[CR33] 33.Armani, R., Qian, C., Jiang, J. & Holz, C. Ultra inertial poser: Scalable motion capture and tracking from sparse inertial sensors and ultra-wideband ranging. In ACM SIGGRAPH 2024 Conference Papers, 1–11, 10.1145/3641519.365746 (2024).

[CR34] 34.Zhu, L., Liu, Z. & Liu, G. A deep multiple self-supervised clustering model based on autoencoder networks. Sci. Rep.15, 18372. 10.1038/s41598-025-00349-z (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR35] 35.Zhang, Z. et al. Interpretable unsupervised learning enables accurate clustering with high-throughput imaging flow cytometry. Sci. Rep.13, 20533. 10.1038/s41598-023-46782-w (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR36] 36.Yang, D. et al. Divatrack: Diverse bodies and motions from acceleration-enhanced three-point trackers. Comput. Graph. Forum43, e15057. 10.1111/cgf.15057 (2024). [Google Scholar]

[CR37] 37.Jung, K., Kim, S., Oh, S. & Yoon, S. H. Hapmotion: Motion-to-tactile framework with wearable haptic devices for immersive vr performance experience. Virtual Reality28, 13. 10.1007/s10055-023-00910-z (2024). [Google Scholar]

[CR38] 38.Yan, S., Xiong, Y. & Lin, D. Spatial temporal graph convolutional networks for skeleton-based action recognition. Proceedings of the AAAI Conference on Artificial Intelligence 32, 10.1609/aaai.v32i1.12328 (2018).

[CR39] 39.Beer, A. et al. Shade: Deep density-based clustering. In Proceedings of the 2024 IEEE International Conference on Data Mining (ICDM), 675–680, 10.1109/ICDM59182.2024.00075 (IEEE, 2024).

[CR40] 40.Do, K., Tran, T. & Venkatesh, S. Clustering by maximizing mutual information across views. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 9928–9938, 10.1109/ICCV48922.2021.00978 (2021).

PERMALINK

Spatio-temporal graph autoencoder for automated evaluation of human actions in 3D in immersive VR-based training for archaeologists

Valerio Pradisi

Marco Raoul Marini

Francesco Castelli Gattinara Di Zubiena

Eduardo Palermo

Edoardo Baiocchi

Saverio Giulio Malatesta

Luigi Cinque

Abstract

Introduction

State of the art and technologies

Virtual reality

VR in rehabilitation and training

Motion capture

The model

Fig. 1.

Graphs: the creation to represent data

Fig. 2.

Autoencoder for graphs (AEforGraph): the core model

Semi-supervised clustering: creation of the centroids and assign all the samples

Decoding and compare: decoding all the samples and comparing with the centroids to provide answers

Unity setup and live streaming

Involved technologies

Dataset

Experimental results

AEforGraph

Fig. 3.

Table 1.

Fig. 4.

Table 2.

Semi-Supervised Clustering (SSC), analysis

Fig. 5.

Fig. 6.

Table 3.

Live test

Fig. 7.

Fig. 8.

Fig. 9.

Conclusions

Author contributions

Funding

Data availability

Declarations

Competing interests

Ethical approval

Footnotes

References

Associated Data

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases