Abstract
Various robotic arm control systems have been proposed to support the daily lives of patients with severe motor impairments; however, existing robotic arm control devices typically require physical input devices, such as joysticks, which are often difficult for patients with motor impairments to use. To overcome these limitations, this paper proposes a new method for controlling a robotic arm using augmented reality and object detection. The system automatically configures the path from the robotic arm to the object, allowing all operations to be performed using only eye tracking. With precise object localization and an intuitive gaze-based interface, the proposed robotic arm control system offers a significant advantage for patients with motor impairments, providing a more accessible and user-friendly alternative to traditional control methods.
Keywords: Augmented reality, Object detection, Robotic arm
Subject terms: Computer science, Software
Introduction
Patients with severe motor impairments, such as those caused by spinal cord injuries or locked-in syndrome, often require extensive support to perform daily activities1,2. To assist these individuals, various assistive technologies, including robotic devices and rehabilitation tools, have been developed3–6. Among these, robotic arm control systems have emerged as a practical solution to help patients regain some level of independence7–12. However, most existing robotic arm control systems rely on physical input devices, such as joysticks, which can be difficult or impossible for patients with severe motor impairments to use effectively13.
To overcome these challenges, this project aims to develop a novel robotic arm control system based entirely on eye tracking technology integrated with augmented reality (AR). Eye tracking allows for the precise monitoring of a user’s gaze, making it a suitable and intuitive input method for individuals with limited motor function but intact eye movement. By integrating AR, our system seamlessly overlays interactive digital content onto the user’s physical environment14–16, enabling natural and immersive interactions17–20.
Unlike existing gaze-based systems that typically rely on fixed monitors or external sensors, our approach leverages the built-in spatial mapping and eye-tracking capabilities of HoloLens2, combined with real-time object detection using a lightweight YOLOv8 model. This allows the system not only to detect what the user is looking at, but also where it is located in 3D space, enabling autonomous, gaze-driven object interaction without additional hardware or calibration. Furthermore, the interface includes an adaptive AR-based control menu (“Bring”, “Move”, “Cancel”) to support post-grasp manipulation, contributing to a more flexible and intuitive user experience.
To validate the effectiveness of this system, we conduct a series of experiments that quantitatively evaluate its performance in real-world-like settings. Specifically, we measure key empirical metrics such as localization accuracy and grasp success rate under various spatial conditions. Through these evaluations, we identify practical operational limits (e.g., a reliable working range within 1 meter) and reveal how performance degrades with distance-information essential for practical deployment.
In summary, this research contributes a novel AR-assisted, gaze-controlled robotic arm system that emphasizes real-time object awareness, minimal user burden, and empirical validation. By quantifying the system’s grasp success rate and spatial accuracy, this work offers valuable insights into the feasibility and limitations of gaze-based assistive technologies, ultimately aiming to enhance accessibility and autonomy for individuals with severe motor impairments.
Related works
Overview of gaze-based robotic arm control
Several studies have explored the use of monitors and eye trackers to control robotic arms21–23 and the application of AR combined with eye tracking24. However, these approaches present significant limitations. Innovations are required to enhance user experience and address these issues.
Early systems, such as monitor-based gaze control methods, are based on guiding the user’s gaze to a specific location on the screen to execute commands23. One major limitation arises when using a monitor with an eye tracker attached. The robotic arm risks colliding with the monitor during operation, leading to the need to restrict the arm’s workspace to avoid collision. This constraint limits the robotic arm’s range of motion, making it difficult to perform tasks freely around the user. From a user experience perspective, controlling the robotic arm while constantly shifting focus between the monitor and the robot’s status is distracting and reduces task efficiency25. Furthermore, requiring continuous user input to command the robotic arm can be inconvenient and mentally taxing.
Integration of AR in robotic control
To overcome these challenges, AR-based systems with eye tracking have been proposed to improve robotic arm control.24 AR head-mounted displays (AR-HMD) can replace traditional monitors, offering an expanded workspace and eliminating spatial constraints. The see-through nature of AR devices allows users to view the robotic arm in real-time while interacting with the control interface, which is superimposed onto the real-world environment14–16. This method enables users to control the robotic arm more naturally without constantly switching between monitor and arm. McMullen et al.24 improved robotic arm control through an AR interface, and Weber et al.26 implemented external robot calibration and gaze-based cooperation using HoloLens2, but focused on object segmentation. However, such systems rely on predefined object sets or external calibration, limiting their adaptability in dynamic real-world environments.
Object detection in assistive technology
Object detection has been relatively less integrated with gaze-based control. Chen et al.25 combined AR and computer vision to implement high-level robotic arm control, but the complexity increased with the addition of a brain-computer interface. Wang et al.27 focused on grasping using near-eye-tracking and transformer-based models, but did not address real-time object detection or AR feedback. In contrast, Cio et al.21 attempted scene reconstruction using stereo vision, but lacked the intuitive AR overlay provided by our study.
Comparison and differentiation
Previous studies have attempted to estimate the user’s gaze using a near-eye camera and RGB-D sensor and control the robotic arm based on this information27. In this study, the sensor was attached to the robotic arm to estimate the direction of the user’s gaze, and the robotic arm performed the action of picking up an object. This approach was possible because the experimental setup did not need to consider the user’s position or movement, and the focus was on verifying interactions where grasping was possible using only gaze within a relatively fixed environment. However, in this case, it is difficult to respond to changes in the relative position between the robotic arm and the user, and there are limitations to applying this approach in environments where the user moves freely while interacting with the robotic arm.
Our study was designed to go beyond simply picking up objects that users looked at with a robotic arm, extending to a series of interactive scenarios that included directly delivering objects (cups) to users and allowing them to drink water. To achieve this, it was essential to estimate the spatial position of users, and to this end, we adopted HoloLens2, a commercial AR-HMD with built-in SLAM functionality. The device is equipped with four cameras and supports real-time user position recognition along with eye tracking, enabling flexible interaction with the robotic arm.
Furthermore, some existing systems only use coordinate-based gaze input and do not reflect semantic information of objects. For example, in human-robot collaboration systems using AR, interaction was implemented by aligning gaze input and virtual interfaces, but recognition of the objects themselves depended on user anchoring or manual settings26. In our study, we adopted a YOLOv8-based object detection algorithm to interpret in real-time what the user’s gaze is directed at and to enable accurate object selection even when multiple objects exist in the field of view.
In previous studies, voice commands or gesture recognition had to be used in combination with eye gaze to select objects. However, our system is designed to enable object selection using only eye gaze, providing a more intuitive and effortless operating environment for users with physical limitations. In addition, some studies did not include visual feedback or a cancel selection function after selecting an object, which could cause malfunctions in real-world environments21. Therefore, our study provides clarity in object selection through a bounding box-based real-time feedback interface and allows users to cancel selections using only their gaze when necessary.
Lastly, our study is significant in terms of technical usability and extensibility, as it demonstrates that the system can be configured using only commercially available AR devices (HoloLens2) without any external calibration procedures or customized hardware.
Methods
Materials
Microsoft HoloLens2 (AR-HMD)
AR-HMDs offer a unique ability compared to traditional display methods, such as augmenting reality, and transforming user interactions. The Microsoft HoloLens2 (Fig. 1a), pivotal in this project, integrates critical hardware components. Noteworthy features include an eye tracker for gaze control, a camera for object detection, 6-degree-of-freedom (DoF) tracking for precise positioning, and spatial mapping for object localization. This advanced technology combines the physical and virtual realms, notifying a revolutionary improvement in user experience. The hardware components that are critical to this project are listed in Table 1.
Fig. 1.
Materials of system.
Table 1.
Critical hardware specifications of Microsoft HoloLens2.
| Sensors | Eye tracking | 2 IR cameras |
|---|---|---|
| Camera | 8-MP stills, 1080p30 video | |
| Human understanding | Eye tracking | Real-time tracking |
| Environment understanding | 6DoF tracking | World-scale positional tracking |
| Spatial mapping | Real-time environment mesh | |
| Compute and connectivity | Wi-Fi | Wi-Fi (802.11ac 2x2) |
Kinova’s Gen2 (7DoF Robotic Arm)
In this project, Kinova’s Gen2 7DoF robotic arm (Fig. 1b) was used for its flexibility, performance, and safety, making it ideal for human-robot interaction tasks such as bringing a cup or bottle to a user. Instead of manually controlling individual joints, the system calculated the relative coordinates between the end effector of the robotic arm and the target, then navigated the arm along the horizontal, vertical, and depth axes. An embedded program automatically adjusted the joints, simplifying control and ensuring efficient, accurate movements.
YOLOv8
YOLOv8 represents a significant advancement in the YOLO (You Only Look Once) series, introducing capabilities beyond traditional object detection, notably in instance segmentation and pose estimation. These enhancements enable the model to perform pixel-level object delineation and accurately identify human body keypoints, respectively.
In this research, we specifically utilize the YOLOv8n (Nano) model (Fig. 1c), a lightweight version developed by Ultralytics28 and optimized for edge computing and real-time inference on resource-constrained devices. The Nano model strikes a compromise between speed and accuracy, making it particularly effective for real-time object detection and tracking. By leveraging its low computational footprint, we ensure efficient performance without compromising detection accuracy.
Instance Segmentation: YOLOv8 extends its functionality to instance segmentation, allowing for precise pixel-level identification of individual objects within an image. This capability is crucial for applications requiring detailed object boundaries, such as medical imaging and autonomous driving. The model maintains high accuracy and real-time processing speeds, making it suitable for practical deployment in various scenarios.
Pose Estimation: In addition to segmentation, YOLOv8 incorporates pose estimation features, enabling the detection and mapping of human body keypoints. This functionality is essential for applications in human-computer interaction, sports analysis, and surveillance. Studies have shown that YOLOv8’s pose estimation models deliver outstanding precision and rapid performance, rendering them effective for both live implementations and scenarios that require comprehensive pose evaluation. These advancements position YOLOv8 as a versatile tool in computer vision, capable of handling complex tasks with efficiency and precision.
Research setting and user interface system
Research setting
The system setup is shown in Fig. 2. HoloLens2 was wirelessly connected to the control PC via TCP communication, while the robotic arm was connected to the control PC through a USB cable. The control PC handled object detection and robotic arm control tasks, whereas HoloLens2 was responsible for program control and AR tasks. Object detection was developed in Python, and the robotic arm control was implemented using MATLAB on a Windows-operated control PC. The AR user interface (UI), developed in C# within the Unity3D environment, was built on ARM 64 architecture for deployment on HoloLens2.
Fig. 2.
Outline of system.
HoloLens2’s eye tracking feature, managed through Microsoft’s MRTK, was utilized to track the user’s eye movements as inputs, offering a more intuitive interaction with the AR interface. To ensure accurate eye tracking, the calibration process-where users adjust the tracker to their eyes by focusing on specific targets in the AR environment-is recommended, as it enhances the precision and reliability of the function.
UI System
To select an object, the user simply gazes at the desired object identified by the object detection system for 5 seconds. The eye tracker monitors the user’s gaze, and once an object is selected, the robotic arm moves to grasp it. After selection, the user is presented with three options: “Bring,” “Move,” and “Cancel” (Fig. 3).
Bring: The robotic arm brings the object to the user and displays a “Putback” option on the interface. The user can hold the object as long as needed and, when ready, choose the “Putback” option to return the object to its original location.
Move: When selected, five sub-options appear: left, right, push, pull, and done. These options allow the user to move the grasped object along the horizontal and depth axes. This feature is useful when the user needs to clear an object (B) that is obstructing the robotic arm’s path to another object (A). The user can move object B to a different location and select “Done” to release it, enabling the robotic arm to access object A.
Cancel: If chosen, the robotic arm will release the object and return to its home position.
Fig. 3.

System UI flow.
Flow of Operators
Image capture and object detection: The HoloLens2 captures an image and sends it to the control PC, where object detection is performed using YOLOv8n. The results are sent back to HoloLens2, which displays boundary boxes around detected objects.
Object selection and ray casting: The user selects an object by gazing at its boundary box for 5 seconds. Once selected, a ray is cast from the HoloLens2 toward the object’s center, colliding with the spatial mapping mesh. The system then calculates the relative 3D coordinates between the robotic arm (determined by QR code scanning) and the object.
Robotic arm navigation and grasping: The calculated coordinates are sent to the control PC, where they are converted into the robotic arm’s coordinate system. The robotic arm then moves by the relative distance of the object’s coordinates, first adjusting its elevation, followed by horizontal and depth positions. During this approach, the grasping pose is automatically adjusted to feasible joint angles according to the object’s position, ensuring that the end-effector aligns with the object coordinates. Upon reaching the object, the arm closes the gripper to grasp it.
UI options and action execution: After grasping the object, the control PC sends a completion signal to HoloLens2, which displays a UI with three options: Bring, Move, and Cancel.
- User commands:“Bring”: The robotic arm brings the object to the user’s mouth, and the user can later select “Putback” to return the object. “Move”: Subcommands (Left, Right, Push, Pull, Done) are provided to move the object in the horizontal or depth directions. “Cancel”: The robotic arm releases the object and returns to its home position.User mouth targeting: To implement “Bring” command that brings the selected object to user’s mouth, the user’s reference position is defined by the “MainCamera” object in Unity3D, which represents the HoloLens2 head pose at application launch. A child object “UserMouth” is anchored to the “MainCamera” to continuously track head motion. Then the “UserMouth” is placed at a fixed offset of +15 cm (Z) and − 25 cm (Y) relative to the “MainCamera” to represent position of user’s mouth, considering a typical drinking situation in which a drinking straw extends upward (+Y) from cup and toward the user (-Z). Additional explanations are provided in supplementary materials (Figs. S1–S6).
This streamlined process enables efficient interaction between the user, the robotic arm, and the AR interface, facilitating intuitive object manipulation.
Control system development
The system for object detection and object locating in AR operates through a series of steps using HoloLens2 and Unity3D.
-
Image capture and data transmission.
The HoloLens2’s RGB camera captures an image, which is converted into byte format. This byte data is then sent to a control PC using TCP communication.
-
Object detection with YOLOv8n
On the control PC, the byte data is decoded back into an image and used as input for YOLOv8n, a real-time object detection model. YOLOv8n processes the image and identifies objects, such as “tv”, “keyboard”, “mouse” and “chair” shown in Fig. 4a. The results are formatted into a dataframe using the Pandas library for easy manipulation and extraction of specific object data.
-
Data transmission to HoloLens2
The object detection results are converted into JSON format and sent back to HoloLens2 via TCP communication. HoloLens2 uses this information to proceed with the next steps in the object locating process.
-
Converting 2D to 3D coordinates
The 2D coordinates from the image
are used to compute 3D coordinates for the detected objects. These coordinates represent the center points of the detected objects in the real-world environment. -
Spatial mapping for real-world context
To properly locate objects in the physical world, spatial mapping is used to create a 3D mesh of the user’s surroundings as shown in Fig. 4b. Microsoft’s Mixed Reality Toolkit (MRTK) is employed for spatial awareness, automatically starting the mapping process when the program begins. The Update Interval is set to 1 second for continuous tracking, and the mesh is generated with the highest level of detail for precise object localization. The mesh remains invisible to avoid obstructing the user’s view but is crucial for mapping real-world objects into virtual space.
-
Displaying the boundary box in AR
Using Unity3D, a light blue boundary box is instantiated at the calculated 3D coordinates. The size of the boundary box is scaled appropriately to fit around the object. This provides visual feedback of the object’s detected position within the AR environment.
-
Eye tracking and object selection
HoloLens2’s eye tracker enables hands-free interaction. When the user gazes at a detected object (identified by the light blue boundary box), the box turns red, indicating focus as shown in Fig. 4c. Object detection temporarily pauses to stabilize the interaction and prevent the boundary box from disappearing. If the user looks away, detection resumes. However, if the user maintains their gaze for 5 seconds, the object is selected, triggering the robotic arm control function.
-
Ray casting for precise object location
After an object is selected, Unity3D’s ray casting function is used to specify the position of the object. Summary of object locating is shown in Fig. 4dA virtual ray is projected from the user’s location (HoloLens2) towards the center of the boundary box. When the ray collides with the 3D mesh generated by spatial mapping, the collision point provides the exact coordinates of the object in the real world. This process allows precise interaction with real-world objects through AR, enhancing object manipulation capabilities in the virtual space.
As shown in the flowchart below Fig. 5, object detection and spatial mapping are part of the Perception Update-Rate and it is set to 1 Hz. Object detection (covering steps 1, 2, 3, 4, and 6) and spatial mapping (step 5) operate in parallel. Additional explanations are provided in supplementary materials (Fig. S7–S10).
Fig. 4.
Object detection and locating.
Fig. 5.
Object detection and spatial mapping flowchart and perception update-rate.
Robotic arm locating: QR code scanning
Pinpointing the coordinates of a fixed robotic arm is essential, especially as the user’s position (HoloLens2) is constantly moving and rotating. To address this, QR code-scanning technology was applied to accurately identify the robotic arm’s position.
QR code recognition and utilization
This study utilizes the Microsoft.MixedReality.QR package to detect and track QR codes using the HoloLens2’s built-in camera. When the QR code is detected, its coordinates are retrieved and mapped into Unity3D’s coordinate system, as illustrated in Fig. 6. A Unity3D GameObject (white cube) is then instantiated at the robotic arm’s origin position (0, 0, 0), ensuring precise positioning and alignment of the robotic arm within Unity3D.
Fig. 6.

QR code placement demonstration setup.
The QR code is scanned only once during the initial setup, meaning there is no need for continuous re-scanning unless the entire system (e.g., wheelchair, robotic arm) is moved. If the system is relocated, the QR code must be re-scanned to reestablish the robotic arm’s updated coordinates within Unity3D.
QR code placement and positioning principles
A QR code was placed 60 cm to the left of the robotic arm horizontally due to the physical constraints of the wheelchair setup. The key reasons for this placement are:
Avoiding obstruction by the robotic arm: Placing the QR code directly in front of the robotic arm could lead to occlusion when the arm moves, causing tracking issues.
Maintaining a fixed reference point: Since the robotic arm is installed on the right side of the wheelchair, the QR code is placed on the left to provide a stable reference point for position calibration.
Ensuring accurate tracking in varying conditions: A QR code size of at least 10 cm
10 cm has been found to provide stable recognition while remaining small enough for practical use. Smaller sizes may still be detected, but a larger QR code improves visibility and consistency under varying lighting and camera angles.
Coordinate transformation and application in Unity3D
The QR code’s coordinates are detected within the HoloLens2 world coordinate system and mapped to Unity3D. The robotic arm’s relative position is then calculated based on the detected QR code. This process ensures that the robotic arm remains accurately positioned even when the user moves.
Once scanned, Unity3D assigns the detected QR code’s position to a GameObject. The robotic arm’s reference point is then computed relative to this QR code position. The final calculated coordinates are fed into the robotic arm control algorithm, allowing it to navigate precisely towards the desired object.
QR code re-recognition conditions
While the QR code is scanned only once under normal conditions, there are scenarios where re-scanning is required:
Wheelchair or robotic arm relocation: If the wheelchair moves, the QR code must be scanned again to update the robotic arm’s coordinates.
Obstruction or poor lighting conditions: If the QR code becomes obscured, additional scanning may be needed.
System recalibration: If there is a discrepancy between the expected and detected robotic arm position, re-scanning can be triggered for correction.
For this study, QR codes were chosen due to their simplicity, low computational cost, and ease of integration with HoloLens2’s built-in camera system. By integrating QR code-based tracking with spatial mapping and ray-casting, the robotic arm can accurately determine its location and move to manipulate objects with high precision.
Experimental methodology
To assess the applicability of the AR-based robotic arm control system, two experiments were conducted to evaluate the functional limits, accuracy, and performance of object localization using the proposed system. This study investigates how errors in object localization impact robotic arm operation, particularly in terms of grasping accuracy and success rates.
The experiment consists of two main objectives:
Evaluating localization accuracy of objects in the AR environment by measuring spatial errors along horizontal (X-axis) and depth (Y-axis) directions.
Evaluating the grasping limits of the robotic arm by identifying the depth threshold beyond which successful grasping is no longer reliable.
Experimental setup
Participants: A total of four participants (3 males, 1 female), aged between 24 and 30, with varying levels of AR experience, took part in the experiments. Each participant was instructed to fixate on the bounding box shown by the object detection system while keeping their head still, enabling consistent eye-tracking interaction regardless of prior AR familiarity. Given that the experiment was conducted using a waveguide-based AR-HMD equipped with two-dimensional pupil expansion and Exit Pupil Expansion (EPE) technologies, the device provided a wide eyebox capable of accommodating diverse interpupillary distances and variations in eye position. This optical design ensured inherently stable visual alignment across participants, regardless of individual facial differences. Therefore, we determined that no separate training or adaptation phase was necessary to enable reliable and consistent interaction.
Object Placement: The test object (cup) was placed at 64 systematically distributed points in a coordinate system ranging from (− 10, 10) to (− 80, 80) centimeters. The object placement was controlled across horizontal (X-axis) and depth (Y-axis) variations to evaluate localization accuracy (Fig. 7).
User Position: The user’s reference position was set to (− 30, − 40) centimeters relative to the robotic arm, aligned with the structure of the wheelchair (Fig. 7, 8).
Measurement Approach: Ray-casting measurements were taken at each of the 64 coordinates, with each participant performing the measurement once per coordinate. This setup ensured the reliability of the data for calculating localization errors, with all measurements conducted by four participants to ensure both reliability and reproducibility. To eliminate head movement variability, only eye-tracking interactions were used for object selection. In the subsequent robotic arm grasping limit evaluation, the same four participants performed grasping tasks 4 times at selected spatial locations derived from the localization results. This repeated testing allowed us to empirically validate the spatial boundary within which grasping was consistently successful.
Fig. 7.

Setting of evaluation test.
Fig. 8.

Overview of the system.
Experiment procedure
The experiment was conducted in two phases:
Phase 1: Localization Accuracy Assessment
This phase aimed to quantify spatial errors in object localization across different positions.
Error Measurement: In the experiment, error measurements were conducted by varying one distance parameter at a time while keeping the other fixed. For depth error measurement, the horizontal distance (X-axis) was fixed at multiple points ranging from − 10 cm to − 80 cm, and the depth distance (Y-axis) was increased in 10 cm increments. Conversely, for horizontal error measurement, the depth distance (Y-axis) was fixed at values from 10 cm to 80 cm, while the horizontal distance (X-axis) was varied in 10 cm increments.
Comparison of Measured vs. Actual Coordinates: The experimental ray-casted object coordinates were compared with actual object positions to compute absolute localization errors.
Data Collection: Localization errors were analyzed to determine whether object placement accuracy deteriorates as distance from the user increases.
Phase 2: Robotic Arm Grasping Limit Evaluation
This phase investigated the relationship between localization errors and robotic arm grasping success rates.
Instead of testing all positions, representative horizontal and depths were chosen based on their impact on grasping performance in real-world scenarios.
The success and failure rates of robotic grasping were recorded at each selected depth to identify a critical depth threshold beyond which grasping was no longer reliable.
Reproducibility Analysis: Multiple trials were conducted to assess the consistency of grasping success rates.
Objective and expected outcomes
Objective
This study aims to extend the understanding of localization accuracy in AR environments and its direct impact on robotic arm operation. Specifically, the objectives include:
Investigating how spatial errors (horizontal and depth-based) affect object localization in AR environments.
Determining the maximum permissible localization error that still allows for successful robotic arm grasping.
Establishing a depth threshold where grasping success significantly declines due to localization inaccuracy.
Expected impact
Through this experiment, we expect to achieve the following key insights:
Quantification of AR localization errors: This study will provide empirical data on how localization accuracy varies across different positions in an AR environment.
Identification of operational limits for robotic grasping: The results will establish grasping constraints in terms of depth thresholds, contributing to optimal workspace design for robotic arms in assistive applications.
Optimization of AR-based robotic control systems: By understanding localization error trends, improvements can be made to enhance object detection algorithms and improve real-time robotic arm precision.
Insights for future AR-integrated robotic systems: The findings will support the development of more reliable, user-friendly robotic assistance technologies for individuals with motor impairments.
Results and discussion
First experiment: localization accuracy assessment
The first experiment focused on evaluating how the accuracy of the localization of objects in HoloLens2 is influenced by changes along the horizontal and depth axes, with the aim of analyzing how variations in these two directions affect the performance of the system. By systematically adjusting object positions and measuring the resulting errors-calculated as the absolute difference between the measured and actual positions-the study provides key insights into how HoloLens2 manages spatial accuracy when detecting and interacting with objects across different axes.
In this experiment, depth and horizontal error were measured by fixing the horizontal distance between − 10 cm and − 80 cm, and gradually increasing the depth by 10 cm increments. As shown in Fig. 9a, 9b, for each axis (depth, horizontal), we summarized error as a function of the corresponding ground truth level (i.e., depth-wise and horizontal-wise). At each ground truth level, we report the mean absolute error and a 95% percentile bootstrap confidence interval (CI) based on B=10,000 resamples. These intervals are used only for descriptive visualization (shaded bands in the line plots). In addition to assess monotonic changes across ground truth levels, we tested adjacent levels pairs for each axis. For each pair, we performed Mann-Whitney U (MWU) test. We then quantified the size and direction of change using Hodges-Lehmann (HL) shift. For each adjacent pair, we computed a 95% CI with Wilcoxon inversion (Fig. 9c, 9d).
Fig. 9.
(a) Mean absolute depth error across ground truth depths with 95% percentile bootstrap CI (10,000 resamples; shaded band). (b) Mean absolute horizontal error across ground truth horizontals (same CI). (c) Adjacent depth-level comparisons: Hodges–Lehmann (HL) location shifts (points) with 95% CI obtained by inverting the Wilcoxon rank-sum test; points colored red for Mann–Whitney U (MWU) p < 0.05 and blue otherwise. (d) Adjacent horizontal-level comparisons (same as C).
Up to ground truth depth 50 cm, the depth error remained relatively consistent with no significant differences between adjacent levels (all MWU p
0.05; HL shift 95% CI included zero). Between 50 and 60 cm, a modest but statistically significant increase was observed (MWU p = 0.003; HL shift = 1.79 cm [95% CI: 0.53–2.79]). Beyond 60 cm, the error rose more markedly, with stronger effects (e.g., 60
70 cm: MWU p < 0.001; HL shift = 5.72 cm [95% CI: 3.88–6.91]) (Fig. 9c).
Pivoting the analysis to the horizontal axis (Fig. 9b, 9d), we observed that errors increased as the horizontal distance of the object deviated from the user’s fixed horizontal location (− 30 cm). Notably, errors increased consistently as the object moved away from the − 30 cm point in either direction (− 60
− 10 all MWU p < 0.05; HL shift 95% CI excluded zero), indicating that localization accuracy was highest near the user’s own position and declined significantly with increasing lateral displacement.
Overall, the analysis showed that while both depth and horizontal errors increase as the distance from the user grows, errors caused by changes in Depth were more pronounced (Min HL shift: 1.79 cm at 50
60, Max HL shift: 5.72 cm at 60
70). Horizontal errors were generally smaller and more stable (Min HL shift: 0.69 cm at − 20
− 10, Max HL shift: − 1.79 cm at − 60
− 50) compared to depth errors. In this context, negative HL shift values indicate that the predicted positions tended to deviate further to the left (i.e., more negative on the x-axis) compared to ground truth, while positive values correspond to rightward deviations. This observation highlights the importance of proximity in maintaining localization accuracy, particularly along the depth axis. These results are based on across-subject averaged data (n = 4). Detailed values are reported in supplementary materials (Table S1-S4), and results for each individual subject are provided in supplementary materials (Fig. S11–S14 and Table S5–S8).
Pre-dominance of depth error in localization accuracy
The ground truth coordinates from the experiment and the predicted coordinates obtained through ray casting were visualized in 2D (Fig. 10). Visual inspection qualitatively indicated that the depth error was larger than the horizontal error, which was quantitatively confirmed by the error deviations: depth error (y-direction, min: 5.17 cm; max: 22.28 cm) compared with horizontal error (x-direction, min: 0.02 cm; max: 6.41 cm).
Fig. 10.
2D visualization of the result. Left: individual subjects’ predictions; Right: mean prediction.
Second experiment: robotic arm grasping limit evaluation
In the second experiment, we selected the optimal horizontal position − 50 cm and − 60 cm for the arm to grasp the cup, and then experimented with different depth values based on the results from the first experiment. Rather than performing experiments at all depths, we selected a representative set of depths based on the data from the first experiment that would make a significant difference in real-world robotic grasping environments, and then closely compared and analyzed the success and failure rates, and reproducibility of cup grasping at these depths. We aimed to identify the depth threshold at which the arm can reliably grasp cups, and provide insights into the optimal grasping strategy and work environment settings.
As the results of first experiment showed that the depth error (
) increased significantly when the object is placed beyond 1 m from the user. Accordingly, an experiment was conducted to analyze the grasping probability (%) near the threshold distance of 1 m (approximately ground truth depth 60 cm) by increasing the number of trials.
First, experiments were conducted at coordinates with a fixed horizontal position of − 50 cm, and the results are reported in Table 2 and Fig. 11. From depth 61 cm, which corresponds to a user distance of 1.01 m, the
increased (mean
7.17 cm to 10.97 cm), leading to a drop in the grasping probability (100% to 60%). Further analysis at depth 61 cm revealed that the robotic arm failed to grasp the cup when
exceeded 12 cm, while successful grasping occurred only when
was below 10.4 cm. Additionally, at depth 62 cm and beyond, most
values were greater than 12 cm (71 out of 80; 88.75%), which prevented successful grasping.
Table 2.
Grasping probability at different depths with horizontal fixed at − 50 cm.
| Depth [cm] | Mean [cm] |
Grasping probability [%] |
|---|---|---|
| 50 | 5.01 | 100 |
| 60 | 7.17 | 100 |
| 61 | 10.97 | 60 |
| 62 | 12.15 | 0 |
| 63 | 15.20 | 0 |
| 64 | 13.18 | 0 |
| 65 | 12.28 | 0 |
| 70 | 12.02 | 0 |
Fig. 11.

Grab success rate with horizontal fixed at − 50 cm.
Similarly, experiments were conducted at coordinates with a horizontal position of − 60 cm, and the results are summarized in Table 3 and Fig. 12. Furthermore, beyond the threshold distance of 1.01 m, the increase in
(mean
7.64 cm to 9.01 cm) resulted in a decreased grasping probability (100% to 56.3%). At depth 61 cm, the robotic arm could only grasp the cup when
was below 9.6 cm. Beyond the threshold of
= 9.6 cm, successful grasping was no longer possible.
Table 3.
Grasping probability at different depths with horizontal fixed at − 60 cm.
| Depth [cm] | Mean [cm] |
Grasping probability [%] |
|---|---|---|
| 55 | 5.92 | 100 |
| 56 | 6.79 | 100 |
| 60 | 7.64 | 100 |
| 61 | 9.01 | 56.3 |
| 62 | 9.90 | 0 |
Fig. 12.

Grab success rate with horizontal fixed at − 60 cm.
Pre-dominance of depth error in grasping probability
In the second experiment, once the object’s depth exceeded 61 cm, the robotic arm invariably failed to grasp it, regardless of whether the horizontal coordinate was set to –50 cm or –60 cm. This finding reaffirms that the “pre-dominance of depth error in localization accuracy,” observed in the first experiment, has a substantial impact on the arm’s actual grasping performance.
Comparison to prior work and practical implications
Our system achieved 100% grasping success at selected horizontal positions (–50 cm and –60 cm) within a 1 m depth range, showing performance comparable to or better than prior gaze-based control systems such as PoGARA (92%)21 and Wang et al. (87% for familiar objects)27. Moreover, unlike these systems, ours integrates real-time object detection (YOLOv8) with AR and gaze-based input, providing both intuitive UI and autonomous object-awareness.
While the precision of our localization does not reach sub-centimeter levels achieved by some AR calibration methods, it is sufficient to support reliable grasping within a clearly defined working envelope. The AR-based visual feedback and “Bring/Move/Cancel” UI further enhance usability, especially for users with severe motor impairments.
These results empirically define the spatial performance limits of our system and validate its practical potential for real-world assistive applications.
Conclusion
This study presents an AR-based, gaze-controlled robotic arm system designed to enhance autonomy for individuals with motor impairments. Traditional input methods such as joysticks are often inaccessible for users with severe physical limitations13, making alternative hands-free interfaces crucial. By leveraging the HoloLens2’s eye-tracking and spatial mapping capabilities in combination with real-time object detection (YOLOv8n), the system allows users to perform complex tasks-such as grasping and moving objects-using only eye movements.
Our empirical findings highlight that the system achieves high localization accuracy and grasp success within a 1-meter operating range. Notably, a grasp success rate of 100% was observed up to 60 cm, with performance degrading sharply beyond 61 cm due to increased depth error. These results validate the system’s feasibility for practical deployment in close-range, daily-life contexts.
Compared to existing gaze-controlled systems, our approach provides greater autonomy by incorporating 3D spatial awareness, object detection, and intuitive AR feedback without the need for external sensors or manual recalibration. In addition, post-grasp manipulation options (e.g., “Bring” or “Move”) offer enhanced interactivity that can support users beyond simple object selection.
While the system performs well under its current configuration, it remains limited by AR-based depth estimation accuracy. Future improvements should focus on refining spatial accuracy, expanding usable range, and incorporating comparative baselines (e.g., joystick-based or voice-controlled interfaces) to quantify usability gains. Moreover, integrating additional input modalities such as voice commands or biosignals could broaden accessibility to individuals unable to utilize eye tracking.
A limitation of this study is that the present experiments involved only four participants. Although consistent patterns were observed across all individuals (Supplementary Fig. S11–S14 and Supplementary Table S5-S8), the small sample size inevitably limits the statistical validity and generalizability of the findings. Thus, these results should be interpreted as preliminary and exploratory, and future work with larger cohorts is required to confirm the robustness of the observed effects.
Despite this limitation, our system provides encouraging preliminary evidence of its potential as a user-friendly, empirically validated assistive solution. It contributes not only a novel integration of AR and gaze control for robotic manipulation but also a clear experimental framework for evaluating future systems. Broader adoption and further refinement could enable applications in education, creative arts, and beyond, ultimately improving quality of life for individuals with diverse needs. A user study with clinical populations is recognized as an important direction for future work. In real-world scenarios where the system is integrated with a powered wheelchair, objects beyond the 1-meter range could be accessed by moving the user closer to the target before initiating robotic arm control. Future studies may explore seamless coordination between user mobility and arm manipulation to extend the system’s functional workspace.
Supplementary Information
Acknowledgements
This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (No. 2022R1A2C1013205) and by the National Research Foundation (NRF) (RS-2023-00280938) grant funded by the Korea government.
Author contributions
J-W.H. implemented the system, K.W. and D-J.K. conceived the experiment(s), J-W.H. and W.N. conducted the experiment(s), J-W.H. and W.N. and K.W. analysed the results. All authors reviewed the manuscript.
Data availability
The datasets used and/or analysed during the current study available from the corresponding author on reasonable request.
Declarations
Competing interests
The authors declare no competing interests.
Ethics approval
The study protocol was approved by the Korea University Institutional Review Board (KUIRB-2021-0378-08), and all methods were performed in accordance with relevant guidelines and regulations. Informed consent was obtained from all subjects prior to participation, and the study did not collect any information that could be used to identify individuals.
Informed consent
Written informed consent was obtained from all individual participants (or their legal guardians) included in the study for the publication of any identifying information or images in this open-access journal.
Footnotes
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Joung-Woo Hyung and Woochan Na are contributed equally to this work.
Supplementary Information
The online version contains supplementary material available at 10.1038/s41598-025-19514-5.
References
- 1.Müller, R., Peter, C., Cieza, A. & Geyh, S. The role of social support and social skills in people with spinal cord injury a systematic review of the literature. Spinal cord50, 94–106 (2012). [DOI] [PubMed] [Google Scholar]
- 2.Rousseau, M.-C., Pietra, S., Nadji, M. & Billette de Villemeur, T. Evaluation of quality of life in complete locked-in syndrome patients. J. Palliat. Med.16, 1455–1458 (2013). [DOI] [PubMed] [Google Scholar]
- 3.(Australia), N. S. F. Clinical guidelines for stroke management 2010 (National Stroke Foundation, 2010).
- 4.Kwakkel, G., Wagenaar, R. C., Koelman, T. W., Lankhorst, G. J. & Koetsier, J. C. Effects of intensity of rehabilitation after stroke: A research synthesis. Stroke28, 1550–1556 (1997). [DOI] [PubMed] [Google Scholar]
- 5.Miller, D. P. Assistive robotics: an overview. Assistive Technology and Artificial Intelligence: Applications in Robotics, User Interfaces and Natural Language Processing 126–136 (2006).
- 6.Hillman, M. 2 rehabilitation robotics from past to present–a historical perspective. Advances in rehabilitation robotics: human-friendly technologies on movement assistance and restoration for people with disabilities 25–44 (2004).
- 7.Cowan, R. E. et al. Recent trends in assistive technology for mobility. J. Neuroeng. Rehabilit.9, 1–8 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Beaudoin, M. et al. Impacts of robotic arm use on individuals with upper extremity disabilities: A scoping review. Can. J. Occup Ther.85, 397–407 (2018). [DOI] [PubMed] [Google Scholar]
- 9.Uehara, H., Higa, H. & Soken, T. A mobile robotic arm for people with severe disabilities. In 2010 3rd IEEE RAS & EMBS International Conference on Biomedical Robotics and Biomechatronics, 126–129 (IEEE, 2010).
- 10.Topping, M. J. & Smith, J. K. The development of handy 1. a robotic system to assist the severely disabled. Technol. Disabil.10, 95–105 (1999). [Google Scholar]
- 11.Driessen, B. J., Kate, T. T., Liefhebber, F., Versluis, A. & Woerden, J. V. Collaborative control of the manus manipulator. Univ. Access Inform. Soc.4, 165–173 (2005). [Google Scholar]
- 12.Hillman, M., Hagan, K., Hagan, S., Jepson, J. & Orpwood, R. The Weston wheelchair mounted assistive robot-the design story. Robotica20, 125–132 (2002). [Google Scholar]
- 13.Lin, C.-S., Ho, C.-W., Chen, W.-C., Chiu, C.-C. & Yeh, M.-S. Powered wheelchair controlled by eye-tracking system. Opt. Appl.36, 401–412 (2006). [Google Scholar]
- 14.Carmigniani, J. & Furht, B. Augmented reality: an overview. Handbook of augmented reality 3–46 (2011).
- 15.Azuma, R. et al. Recent advances in augmented reality. IEEE Comput. Gr. Appl.21, 34–47 (2001). [Google Scholar]
- 16.Arena, F., Collotta, M., Pau, G. & Termine, F. An overview of augmented reality. Computers11, 28 (2022). [Google Scholar]
- 17.Ajmi, F., Abdallah, S. B., Othman, S. B., Zgaya-Biau, H. & Hammadi, S. An innovative system to assist the mobility of people with motor disabilities. In 2019 IEEE International Conference on Systems, Man and Cybernetics (SMC), 4037–4043 (IEEE, 2019).
- 18.Rashid, Z., Melià-Seguí, J., Pous, R. & Peig, E. Using augmented reality and internet of things to improve accessibility of people with motor disabilities in the context of smart cities. Future Gener. Comput. Syst.76, 248–261 (2017). [Google Scholar]
- 19.Abdallah, S. B., Ajmi, F., Ben Othman, S., Vermandel, S. & Hammadi, S. Augmented reality for real-time navigation assistance to wheelchair users with obstacles management. In Computational Science–ICCS 2019: 19th International Conference, Faro, Portugal, June 12–14, 2019, Proceedings, Part V 19, 528–534 (Springer, 2019).
- 20.Ridha, A. M. & Shehieb, W. Assistive technology for hearing-impaired and deaf students utilizing augmented reality. In 2021 IEEE Canadian Conference on Electrical and Computer Engineering (CCECE), 1–5 (IEEE, 2021).
- 21.Cio, Y.-S.L.-K., Raison, M., Ménard, C. L. & Achiche, S. Proof of concept of an assistive robotic arm control using artificial stereovision and eye-tracking. IEEE Trans. Neural Syst. Rehabilit. Eng.27, 2344–2352 (2019). [DOI] [PubMed] [Google Scholar]
- 22.Zahir, E., Hossen, M. A., Al Mamun, M. A., Amin, Y. M. & Ishfaq, S. Implementation and performance comparison for two versions of eye tracking based robotic arm movement. In 2017 International Conference on Electrical, Computer and Communication Engineering (ECCE), 203–208 (IEEE, 2017).
- 23.Sunny, M. S. H. et al. Eye-gaze control of a wheelchair mounted 6dof assistive robot for activities of daily living. J. Neuro Eng. Rehabilit.18, 1–12 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.McMullen, D. P. et al. Design and preliminary evaluation of an augmented reality interface control system for a robotic arm. Johns Hopkins APL Tech. Dig35, 220–230 (2020). [Google Scholar]
- 25.Chen, X., Huang, X., Wang, Y. & Gao, X. Combination of augmented reality based brain-computer interface and computer vision for high-level control of a robotic arm. IEEE Trans. Neural Syst. Rehabilit. Eng.28, 3140–3147 (2020). [DOI] [PubMed] [Google Scholar]
- 26.Weber, D., Kasneci, E. & Zell, A. Exploiting augmented reality for extrinsic robot calibration and eye-based human-robot collaboration. In 2022 17th ACM/IEEE International Conference on Human-Robot Interaction (HRI), 284–293 (IEEE, 2022).
- 27.Wang, S. et al. What you see is what you grasp: User-friendly grasping guided by near-eye-tracking. In 2023 IEEE International Conference on Development and Learning (ICDL), 194–199 (IEEE, 2023).
- 28.Jocher, G., Qiu, J. & Chaurasia, A. Ultralytics YOLO (2023).
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
The datasets used and/or analysed during the current study available from the corresponding author on reasonable request.








