ABSTRACT
Numerous minimally invasive procedures, including biopsies, ablations and neurostimulation, rely on the accurate placement of a needle under image guidance. Mixed reality (MR) head‐mounted displays (HMDs) offer a promising solution to enhance the guidance of these procedures without radiation. While the performance of the established HoloLens 2, now discontinued, is well‐documented, the clinical viability of its potential successors remains unproven. This study provides the first direct comparative benchmark of the HoloLens 2, Magic Leap 2 and Apple Vision Pro for a high‐precision needle insertion task, using sacral nerve stimulation (SNS) as a validation scenario. We developed custom applications for each platform and evaluated the performance with 11 users, including nine clinicians without prior HMD experience, through a randomized protocol. Quantitative and qualitative analyses were conducted to assess procedural efficiency and user experience. Results identified the Magic Leap 2 as the most effective platform, demonstrating significantly higher success rates and superior usability scores, driven by its ergonomic design and stable tracking. In contrast, the Apple Vision Pro, despite offering superior visual fidelity, proved unsuitable for this navigation task. Its performance was critically hampered by unstable marker tracking, significant device weight, and a lack of accommodation for prescription glasses, which disrupted the clinical workflow. The HoloLens 2 performed as a reliable, albeit less usable, baseline. We conclude that, for this surgical navigation purpose, the optimal HMD is determined by a balanced combination of tracking reliability, user comfort, and practical workflow integration, over the technical specifications of the device. These findings emphasize the importance of device selection in enhancing procedural outcomes and clinical training.
Keywords: augmented reality, biomedical imaging, computer vision, medical computing, needles, neuromuscular stimulation, object tracking, rendering (computer graphics), surgery, tracking
This study provides the first direct comparative benchmark of the HoloLens 2, Magic Leap 2 and Apple Vision Pro for a high‐precision, image‐guided needle insertion task. The Magic Leap 2 proved to be the most effective platform, demonstrating significantly higher success rates and superior usability due to its stable tracking and ergonomic design. In contrast, the Apple Vision Pro was found unsuitable for this clinical workflow, highlighting that a balance of tracking reliability and user comfort is more critical for surgical navigation than raw visual fidelity.

Abbreviations
- AVP
Apple Vision Pro
- HL2
HoloLens 2
- HMD
head mounted displays
- ML2
Magic Leap 2
- MR
mixed reality
- OST
optical‐see through
- VST
video‐see through
1. Introduction
Minimally invasive surgeries (MIS) frequently depend on image‐guided needle insertions for procedures like biopsies, ablations and neurostimulation therapies. The clinical success of these interventions hinges on precise spatial alignment to reach deep or narrow anatomical targets safely [1]. However, conventional 2D visualization on monitors disrupts the surgical workflow by forcing clinicians to constantly shift their focus away from the patient, breaking natural hand‐eye coordination [2]. This cognitive gap between the 2D image and the 3D patient space can lead to increased procedure times, repeated needle passes, and a higher risk of complications [3]. Mixed reality (MR) directly addresses this fundamental human–computer interaction (HCI) challenge in computer‐assisted interventions (CAI) by overlaying 3D anatomical models into the clinician's direct field of view. This creates an intuitive and unified perceptual space, enhancing spatial awareness and procedural efficiency [4].
Head‐mounted displays (HMDs) are the primary vehicle for delivering these MR experiences. This concept traces back to Sutherland's pioneering HMD in 1968 [5], with medical applications appearing in the late 1980s [6]. Since then, HMDs have been progressively gaining traction in surgical research, supported by a growing body of literature highlighting their potential to enhance intraoperative visualization, spatial awareness, and clinical decision‐making [7]. The Microsoft HoloLens 2 (HL2), an optical see‐through (OST) device, has become the gold‐standard in surgical research. Numerous studies have validated its utility, demonstrating reduced needle passes and enhanced spatial perception in tasks like ultrasound‐guided biopsies and orthopaedic interventions [8, 9, 10, 11, 12]. To properly assess its value, researchers have conducted different comparative analyses. These studies have benchmarked both generations of HL systems against the conventional 2D monitors standard of care, often demonstrating improved task times and accuracy [13, 14]. They have also established benchmarks for AR‐guided procedures, with placement accuracies reported in the range of 4 mm in a phantom study [15]. Furthermore, a comparative study confirmed the generational improvements of HL2 in ergonomics and stability over the HL1, solidifying its role as a mature clinical research tool [16].
However, despite its proven role, the HL2 has been discontinued, creating a pressing need to identify and validate the next generation of HMDs. The most promising successors, the Magic Leap 2 (ML2) and the Apple Vision Pro (AVP), introduce significant technological shifts. The ML2 enhances the OST paradigm with a wider field of view and dynamic dimming, with studies highlighting its potential in surgical scenarios through novel infrared marker tracking strategies [17]. An earlier research about a comparative study with the previous generation of Magic Leap already highlighted a user preference for ML1 due to better visualization and interaction [18]. The AVP, meanwhile, pioneers a high‐fidelity video see‐through (VST) approach, with early work emphasizing its potential for immersive visualization in laparoscopy [19], but no specific research has been reported related to marker tracking in surgical navigation domain.
Despite this preliminary work, there is a critical lack of validated, comparative data on their performance for high‐precision surgical navigation tasks. Crucially, for the AVP, no formal studies have yet reported its end‐to‐end tracking accuracy in a surgical context, leaving its clinical viability for such applications as an open and pressing question.
To address this technological gap, our study introduces a reproducible and broadly applicable benchmark based on a deliberately simple yet representative needle insertion task. This provides the first rigorous comparative validation of these devices' performance—in terms of task time, success rate, target error, usability, and qualitative feedback— within a consistent framework.
This leads to a central hypothesis that newer devices like the ML2 and AVP will offer superior navigation performance over the established HL2. However, a simple comparison of technical specifications is insufficient, as factors beyond raw performance—such as virtual content stability, latency, ergonomics, and the fundamental difference between OST and VST display—can profoundly impact surgical utility. Therefore, our study was designed to conduct a holistic evaluation. Our primary contributions are:
The first clinical benchmark in the surgical navigation context by the leading OST (HL2, ML2) and VST (AVP) devices;
A quantitative analysis of tracking accuracy and task efficiency, providing the first reported distance error metrics for the AVP in a phantom‐based surgical guidance scenario;
Evidence‐based insights into the clinical suitability and limitations of OST and VST MR devices, particularly regarding spatial awareness and comfort for prolonged use.
By addressing both technical and experiential factors, this study aims to guide the design and implementation of next‐generation MR systems in surgical navigation and medical training.
2. Methods and Materials
2.1. Device Evaluation
This study evaluates three state‐of‐the‐art HMDs for their suitability in MR‐guided needle insertion procedures: The HoloLens 2 (HL2), Magic Leap 2 (ML2) and Apple Vision Pro (AVP) (see Figure 1). These devices were selected based on their prominence in clinical research, technological maturity, and potential impact in spatial computing applications. The HL2 serves as a well‐established benchmark, having been widely validated in surgical navigation studies that confirmed its feasibility but also highlighted limitations related to its restricted field of view (FoV) and the stability of virtual content. In contrast, the ML2 and AVP represent recent technological advancements. The ML2 introduces innovations such as a wider FoV and dynamic dimming to improve legibility of 3D models in environments with bright lighting. The AVP pioneers a high‐fidelity video pass‐through approach, which promises superior visual immersion and photorealistic rendering by leveraging powerful onboard processing and micro‐OLED displays. Table 1 summarizes the key technical specifications of each device.
FIGURE 1.

(a) HoloLens 2 device. (b) Magic Leap 2 device. (c) Apple Vision Pro device.
TABLE 1.
Overview of HMD technical specifications.
| Specification | HoloLens 2 [20] | Magic Leap 2 [21, 22] | Apple Vision Pro [23] |
|---|---|---|---|
| Display type | See‐through waveguide | See‐through with dynamic dimming | Video pass‐through (micro‐OLED) |
| Resolution (per eye) | 2048 1080 | 1440 1760 | 3660 3200 |
| Field of view | diagonal | diagonal | diagonal (estimated) |
| Interaction | Hand tracking, gaze, voice | Hand tracking, eye tracking, controller | Hand gestures, eye tracking, voice |
| Tracking system | Inside‐out (depth + IMU) | Inside‐out with spatial anchors | Inside‐out with LiDAR + cameras |
| Operating system | Windows Holographic OS | Android 10 | VisionOS |
| Processor | Qualcomm Snapdragon 850 | AMD Quad‐core Zen2 x86 | Apple M2 chip |
| Weight | 566 g | 248 g (headset only) | 600–650 g (headset only) |
| Notable features | Ergonomic; remote assist | High FoV; ambient light dimming | High‐fidelity visuals; immersive |
| Commercial launch | 2020 | 2022 | 2024 |
The three HMDs differ significantly in terms of display technology, interaction modalities, system architecture, and development frameworks—factors that may influence spatial tracking and stability of virtual content, which are hypothesized to directly impact on the performance in clinical MR guidance scenarios. To assess this, sacral nerve stimulation (SNS) procedure was selected as a representative use case. This minimally invasive therapy, where procedural success hinges on accurate placement of the needle through a target sacral foramen to deliver an electrical stimuli, is conventionally guided by fluoroscopy.
To systematically evaluate the impact of these device differences, the following sections describe the application development, experimental design, and performance metrics used in this study, including both objective technical benchmarks and subjective user assessments.
2.2. Navigation System Based on MR
An MR‐based navigation system was developed to assist with needle insertion procedures by providing virtual visualizations of key anatomical structures. The system uses marker‐based tracking to overlay virtual representations of the anatomy, helping the user accurately guide the needle to the target. The system was specifically tested in an experimental setup for SNS procedure to demonstrate its potential effectiveness in a clinical setting.
The initial development of the MR system was conducted using the Unity Engine 2022.3.30f1 LTS version [24]. Unity provided the tools necessary for building the 3D models and interactive elements that form the core navigation system, including real‐time spatial tracking, visualizations and user interaction. To enhance the functionality of the system on MR devices, the Mixed Reality Toolkit 3 (MRTK3) [25]—a cross‐platform framework developed by Microsoft—was integrated with Unity. By using MRTK3, the system incorporated key features, such as gesture‐based interaction, spatial awareness and intuitive user interface components.
Marker‐based tracking was implemented using the Vuforia library [26], which enabled real‐time detection and tracking of a physical marker placed on the posterior–inferior region of the anatomical phantom. This marker served as the spatial reference point for rendering 3D models in their corresponding anatomical position, including the skin, sacral bone, needle trajectory, insertion depth indicators and a guiding cone to assist with trajectory pivoting.
The systems also featured an interactive user interface displayed within the MR environment. The interface included a horizontal menu with on/off buttons based on an MRTK3 prefab (see Figure 2). The menu allowed the user to toggle the visibility of each of the digital components, according to their guidance needs.
FIGURE 2.

MRTK3 horizontal menu prefab adapted for the procedure.
Given the incompatibility between Vuforia and VisionOS, a separate implementation of the navigation system was developed natively for the AVP using SwiftUI [27] rather than unity. SwiftUI is a declarative, multiplatform framework designed by Apple for constructing user interfaces, built on top of the Swift programming language.
To enable environmental understanding and appropriate placement of 3D models in the physical space, we utilized ARKit and RealityKit. ARKit provides access to multiple sensor data streams to facilitate spatial awareness. Specifically, we employed its image anchoring capability to pre‐register a marker image that served as the tracking reference for the system. Upon detection, ARKit established and continuously updated an anchor as the image moved within the environment.
Rendering of 3D content in the Vision Pro environment was achieved through RealityKit, the native rendering engine for VisionOS. While SwiftUI handles the display of conventional visual elements through view structures, RealityKit is required for rendering advanced immersive content such as volumes and immersive spaces. In our implementation, an immersive space was deployed to enhance user comfort and accessibility, providing a fully immersive experience.
The developed user interface of AVP comprised two main components: a control window and a RealityView. The control window included the menu with on/off switches to toggle the visibility of individual 3D elements, while the RealityView is a specialized view responsible for rendering the preloaded 3D assets during application startup.
2.3. Experimental Setup
The experimental setup employed in this study builds on an existing phantom‐based evaluation method for needle insertion navigation on a custom sacral phantom [28]. While the original setup included multiple navigation modalities—simulated fluoroscopy, optical tracking, and smartphone‐based MR—the present work focuses exclusively on MR‐based guidance using HMDs.
An anthropomorphic sacral phantom, derived from preoperative CT data of a patient undergoing a SNS procedure, was used as the anatomical substrate for validation. The phantom included both hard (bone) and soft (skin and fat) tissue analogues. To support spatial registration, it incorporated six conical surface landmarks [28].
The existing setup included a MR cubic marker for MR guidance with the smartphone. For this experiment, a planar two‐dimensional marker (4 4 cm) compatible with the detection libraries for the HMDs was designed and 3D printed. The marker was rigidly fixed to a custom adapter positioned over the superior gluteal region of the phantom, and was slightly tilted to optimize line‐of‐sight recognition for all evaluated HMDs.
In addition to MR visualization, the validation setup integrated an optical tracking system based on the OptiTrack V120:Duo [29], serving as the gold standard for verifying the proximity of the needle tip to the target. Two infrared (IR) markers were used: one fixed to the phantom and another attached to the needle. These markers allowed for validation of the needle's proximity to the anatomical target (the sacral foramen) throughout the insertion process.
Proximity validation was performed using a 3D Slicer module designed for assessing navigation accuracy in phantom‐based procedures [28, 30]. This module computed the distance between the needle tip and the target, and collected data about the number of punctures, successful attempts and time spent per puncture, providing quantitative performance feedback. This setup (see Figure 3) enabled an isolated and repeatable evaluation of each HMD's performance under identical anatomical conditions. Figure 4 shows the digital view from the user's perspective with each of the devices.
FIGURE 3.

Virtual and physical phantom with MR and IR markers.
FIGURE 4.

First‐person perspective of the digital guidance from the three HMDs. The apparent registration error in the Magic Leap 2 image is a known artifact caused by an offset between the device's video capture sensor and the user's viewpoint.
2.4. Evaluation
To systematically assess the developed navigation system, we designed a comprehensive evaluation protocol. The following sections detail the methodology used, including the participant demographics, the experimental procedure each user followed, the objective performance metrics collected, the subjected usability instruments employed, and the statistical methods used for data analysis.
2.4.1. Participants
A total of = 17 participants were recruited for this study, which was conducted in two sequential phases: a preliminary study to refine the protocol and a main clinical evaluation.
2.4.1.1. Phase 1: Preliminary Study ( = 6)
The initial phase was designed to refine the experimental protocol and gather early feedback on the system's usability. This phase included six participants: one clinician and five engineers with varying levels of experience in MR systems. These participants performed an initial version of the needle insertion task and subsequently completed the usability questionnaires.
2.4.1.2. Phase 2: Main Clinical Evaluation ( = 11)
This phase constituted the main evaluation, involving a cohort of 11 users who performed the complete MR‐guided needle insertion task. This clinically‐focused group was composed of nine clinicians with varying levels of surgical experience (ranging from first‐year residents to specialists) and two senior engineers familiar with the procedure. All participants in this phase completed the usability questionnaires immediately following the task.
2.4.1.3. Data Analysis Cohorts
Based on this two‐phase design, the data were separated for analysis. For the performance and accuracy analysis, only data from the main clinical evaluation were used to ensure methodological consistency (phase 2, N = 11). For the usability analysis, data from all participants across both phases were pooled to provide a more robust assessment (N = 17).
2.4.2. Procedure
Each participant evaluated all three MR devices (HL2, ML2, and AVP) in a single session. To mitigate learning effects, the order in which the devices were presented was randomized for each participant.
For each device, participants first received a brief introduction to the visual guidance system. Subsequently, they were allotted a 5 min timed session to perform as many needle insertions as possible. The core task for each insertion attempt involved the following steps:
-
1.
The participant wore the HMD and detected the 2D MR marker to register the virtual content with the physical world.
-
2.
Using the virtual menu, the participant could toggle the visibility of anatomical models (skin, bone) and guidance aids (trajectory, depth indicators) according to their preference.
-
3.
Guided by the MR overlay, the participant performed the needle insertion, manually aligning the physical needle with the virtual trajectory, using their spatial perception and hand‐eye coordination, as no real‐time rendering of the instrument was displayed.
-
4.
If bone contact occurred, the attempt was not considered as an insertion without stimulation, and the participant was required to withdraw the needle and start a new trial.
-
5.
Upon reaching the target depth according to the user, the final needle placement was verified using the external optical tracking system, which served as the ground truth.
After completing the sessions with all three devices, participants filled out the subjective questionnaires comparing their experience.
2.4.3. Performance Evaluation
To evaluate task performance, an insertion attempt was defined as a full cycle of needle insertion into the phantom's soft tissue and its subsequent withdrawal. Attempts were classified based on their outcome: a successful insertion required the needle tip to enter the correct sacral foramen regardless of the depth, while a failed insertion was recorded when the operator believed the placement was complete, but the tracked position was outside the foramen. Both successful and failed insertions were considered stimulations, as they represented completed procedures from the operator's perspective. In contrast, attempts aborted due to bone contact were not classified as stimulations and were excluded from positional accuracy analysis, though they were included in the total attempt count. Based on these outcomes, the following metrics were quantified:
Attempts to first success: The median number of attempts required to achieve the first successful insertion.
Number of punctures: The median number of insertions performed per user within the 5 min session.
Time per puncture (s): The median time taken for each insertion attempt.
Success rate (%): The percentage of successful insertions out of the total number of attempts.
Puncture accuracy was quantified by calculating the error between the final needle tip position and the intended target point (see Figure 5). The error metrics are reported as median (IQR) and were decomposed into three components:
Distance error (mm): The total Euclidean distance in 3D space.
Lateral error (mm): The 2D radial error in the plane perpendicular to the ideal trajectory, calculated as .
Depth error (mm): The absolute error along the insertion axis (‐axis).
FIGURE 5.

Illustration of the distance error. The 3D error vector between the needle tip and the target is decomposed into a lateral error (2D radial distance in the ‐plane) and a depth error (Z‐axis offset). Views: (a) right‐side and (b) posterior.
2.4.4. Usability Evaluation
Subjective user experience was assessed using two instruments:
System usability scale (SUS): A validated 10‐item questionnaire yielding a global usability score from 0 to 100 [31].
Custom Likert‐scale questionnaire: A 13‐item questionnaire designed to evaluate specific aspects of the user experience. These items were adapted from previous HMD comparison studies [16, 18] in medical contexts and were grouped into four categories: ergonomics and comfort; visual quality and immersive experience; interaction and controls; and physical discomfort. Participants responded using a 5‐point agreement scale following the Likert technique [32].
2.4.5. Statistical Analysis
All statistical analyses were conducted with a significance level of . Given that the data was not consistently normally distributed, non‐parametric tests were primarily used.
To compare metrics across the three devices (HL2, ML2, AVP), a Kruskal–Wallis H test was performed. If a statistically significant difference was found, pairwise comparisons were conducted using a Dunn post‐hoc test with a Bonferroni correction for multiple comparisons.
To analyse the learning effect, a two‐way ANOVA was used to test for an interaction effect between the independent variables device (categorical) and trial order (continuous) on the binary success rate.
3. Results
3.1. Performance Evaluation
A summary of the total procedural attempts and their clinical outcomes for each MR‐HMD is presented in Table 2. Quantitative performance metrics, shown in Table 3, were compared using non‐parametric Kruskal–Wallis tests.
TABLE 2.
Summary of procedural metrics across different MR‐HMD devices.
| Metric | HoloLens 2 | Magic Leap 2 | Apple Vision Pro |
|---|---|---|---|
| Total number of insertions | 189 | 210 | 179 |
| Total number of stimulations | 149 | 182 | 123 |
| Total number of successful attempts | 122 | 164 | 87 |
| Total number of bone punctures | 40 | 28 | 56 |
TABLE 3.
Median (IQR) of performance metrics for the three devices, with corresponding ‐values.
| Parameter | HoloLens 2 | Magic Leap 2 | Apple Vision Pro | ‐value |
|---|---|---|---|---|
| Attempts to first success | 2.00 [1.00 ‐ 2.50] | 2.00 [1.00 ‐ 3.00] | 2.00 [1.00 ‐ 8.25] | 0.6874 |
| Number of punctures | 19.00 [12.00 ‐ 21.50] | 14.00 [13.00 ‐ 25.00] | 16.50 [13.00 ‐ 23.50] | 0.6790 |
| Time/Puncture [s] | 14.00 [11.00 ‐ 20.00] | 12.00 [9.00 ‐ 19.00] | 13.00 [9.00 ‐ 21.00] | 0.0069 * |
| Success rate [%] | 68.42 [51.92 ‐ 73.10] | 84.62 [71.54 ‐ 96.15] | 52.46 [25.00 ‐ 69.46] | 0.0385 * |
Statistically significant value (p 0.05).
Statistically significant differences were found for time per puncture (p = 0.0069) and success rate (p = 0.0385) metrics. Descriptively, ML2 showed the highest median success rate (84.62%) and the fastest median puncture time (12.00 s). No statistically significant differences were observed for attempts to first success (p = 0.6874) or the median number of punctures per user (p = 0.6790). The distribution of success rates for each device is visualized in Figure 6. While variability remains high, a trend towards improved performance with HL2 and ML2 is observed. AVP displays broader inter‐subject variability.
FIGURE 6.

Distribution of success rates across all the devices.
The effect of practice was analysed to assess learning curves (Figure 7). A 2‐way ANOVA revealed a statistically significant interaction effect between the device and the trial order on success rate (p 0.05), showing that the change in performance over time was different for each HMD.
FIGURE 7.

Interaction plot showing the learning curves for each device. The mean success rate improves differently with trial order depending on the device, indicating a significant interaction effect. Shaded areas represent the 95% confidence interval.
Puncture accuracy was quantified by analysing the error between the final needle tip position and the intended target for successful and failed insertions (Table 4).
TABLE 4.
Median (IQR) distance error, lateral error, and depth error for successful versus failed insertions.
| Outcome | Device | Total distance error (mm) | Lateral error (mm) | Depth error (mm) |
|---|---|---|---|---|
| HL2 | 7.90 [5.06–11.21] | 2.44 [1.52–3.35] | 7.14 [4.13–10.95] | |
| Successful | ML2 | 7.99 [4.27–11.49] | 2.90 [1.79–4.06] | 7.19 [2.74–10.98] |
| AVP | 6.34 [5.13–8.68] | 2.76 [1.87–4.02] | 5.36 [3.30–8.23] | |
| HL2 | 8.54 [6.32–10.63] | 8.40 [5.83–10.40] | 1.22 [0.52–3.22] | |
| Failed | ML2 | 12.70 [10.24–14.82] | 10.76 [6.60–13.51] | 5.64 [1.29–8.60] |
| AVP | 20.09 [13.96–28.14] | 19.39 [13.74–26.38] | 6.23 [1.89–10.20] |
During successful insertions, median lateral errors were low for all devices, remaining below 3.0 mm. The AVP was associated with the lowest median total distance error at 6.34 [5.13−8.68] mm. The lateral errors for all devices were comparable, with HL2 at 2.44 [1.52−3.35] mm, ML2 at 2.90 [1.79–4.06] mm, and AVP at 2.76 [1.87–4.02] mm.
During failed insertions, a notable divergence in performance was recorded. Failed attempts with the AVP resulted in a large median lateral error (19.39 [13.74–26.38] mm) and total distance error (20.09 [13.96–28.14] mm). These error values were considerably higher than those for both the HL2 (8.40 [5.83–10.40] mm lateral error) and the ML2 (10.76 [6.60–13.51] mm lateral error).
3.2. Usability Evaluation
User experience was quantified via the SUS, with results shown in Table 5 and Figure 8. The responses from one participant were excluded due to suspected anomalies suggesting potential misunderstanding or misjudgment in rating the devices. The mean SUS scores for ML2 (75.47) and AVP(70.17) correspond to ‘good’ usability ratings. In contrast, HL2 scored lower (64.69), indicating ‘marginal’ usability, and also exhibited the greatest variability in user ratings (SD = 18.00).
TABLE 5.
System usability scale (SUS) results for each headset.
| Metric | HoloLens 2 | Magic Leap 2 | Apple Vision Pro |
|---|---|---|---|
| Mean SUS score | 64.69 | 75.47 | 70.17 |
| Standard deviation | 18.00 | 15.66 | 16.10 |
| Interpretation | Marginal usability | Good usability | Good usability |
FIGURE 8.

Boxplot comparing the perceived usability via SUS scores across the three devices. The results show a higher median usability score for the AVP compared to the HL2 and ML2.
In addition to the overall SUS score, a custom Likert‐scale questionnaire provided a detailed breakdown of user experience. These results, summarized in Table 6 and visualized in the grouped boxplots in Figure 9, indicate distinct usability profiles for each device.
TABLE 6.
Category scores with mean and standard deviation values expressed as the percentage based on Likert scoring for each HMD.
| Category | HoloLens 2 | Magic Leap 2 | Apple Vision Pro |
|---|---|---|---|
| Ergonomics and comfort | 69.5317.95 | 81.2518.68 | 51.2522.93 |
| Visual quality and immersive experience | 59.3818.40 | 78.5216.61 | 72.0818.73 |
| Interaction and control | 69.5320.52 | 75.0014.25 | 79.5820.11 |
| Physical discomfort | 67.1928.46 | 67.1935.02 | 53.1330.10 |
FIGURE 9.

Boxplot comparing the user experience in the described four categories for each HMD.
The ML2 received the highest scores for ergonomics and comfort and visual quality and immersive experience. In contrast, the AVP was rated highest in interaction and control, while simultaneously receiving the lowest scores for ergonomics and comfort and physical discomfort. The HL2 scores were positioned between the other two devices across most categories.
A detailed breakdown of all individual Likert‐scale responses is visualized in Figure 10.
FIGURE 10.

Likert‐questionnaire percentage responses for the three HMDs.
4. Discussion
This study provides a direct comparative evaluation of three state‐of‐the‐art MR HMDs–HoloLens 2, Magic Leap 2, and Apple Vision Pro–for a representative surgical needle‐insertion guidance task. Despite employing a consistent user interface across all devices, the results demonstrated meaningful differences across them, both in task performance and user‐perceived usability.
4.1. Performance
Our quantitative results indicate that the choice of HMD significantly impacts procedural efficiency. The ML2 demonstrated the highest success rate and the fastest median time per puncture, suggesting its combination of a wide field of view, stable tracking, and ergonomic design provides a more effective guidance environment.
Regarding the number of attempts to first success, our observed median of 2.0 across all HMDs (see Table 3), compares favourably to the mean of 3.4 0.9 attempts to target reported by [8] for a similar HL2‐guided task. This is, however, higher than the median of 1[1‐2] reported by [28] for a smartphone‐based AR system tested on the same clinical procedure. This difference within the smartphone study is likely multifactorial, reflecting their different evaluation methodology based on ‘first‐success‐ends‐trial’ protocol and the initial learning curve associated with a novel HMD interface for first‐time users.
Conversely, the time per puncture was substantially lower in our study, with the ML2 recording 12.0 s compared to 41.67 s for the smartphone system [28]. Considering both, these results suggest that despite a minimal familiarization hurdle, the hands‐free HMD workflow is highly efficient, requiring fewer attempts than comparable HMD‐based procedures and being considerably faster per attempt than handheld AR.
Regarding accuracy, prior studies on HL2‐guided procedures have established benchmarks for total placement error. The work conducted in [15] on prostate interventions reported a mean total error of 4.141.08 mm. While the median total distance errors in our study were higher (ranging from 6.34 to 7.99 mm), this is primarily driven by the depth component of the error. However, considering the deviation from needle's trajectory, the obtained median lateral errors of 2.44 to 2.90 mm demonstrate a close alignment of the needle tip to the target.
Interestingly in this study, the AVP presented a paradoxical performance profile. While its median distance error to target on successful attempts was the lowest of all three devices (6.34 mm), indicating high potential for accuracy, it also had the lowest overall success rate (52.46%) and by far the largest error on failed attempts (20.10 mm). This ‘high‐risk, high‐reward’ pattern aligns directly with qualitative feedback from users, who consistently reported that the AVP's marker‐based tracking was unstable, causing virtual models to ‘jump’ or drift. This suggests that while the AVP's core spatial understanding is accurate, its current software implementation for marker tracking is constrained by visionOS's strict performance management. For standard applications, the operating system throttles computational power to manage heat, likely causing a drop in tracking FPS and leading to high errors when tracking stability is momentarily lost. In contrast, the HL2 performed as a reliable but less efficient baseline, with moderate success rates and low distance error metrics.
With regards to the observed learning curve, it further supports the potential for user adaptation and skill acquisition over repeated exposures, highlighting the feasibility of MR‐based SNS training paradigms. The significant interaction effect observed in the learning curve analysis further underscores these device‐specific characteristics. While performance improved with practice across all devices, the rate and ceiling of that improvement varied, suggesting that each HMD imposes a different cognitive and adaptive load on the user. The strong performance of the nine clinical users in the cohort reinforces the value of anatomical and procedural knowledge, but the overall results suggest that device stability and ergonomics are equally critical factors for success.
4.2. Usability
The usability assessment corroborates the performance findings, with ML2 achieving the highest SUS score (75.47, ‘Good'), followed by AVP (70.17, ‘Good') and HL2 (64.69, ‘Marginal'). The detailed Likert‐scale responses explain why. The ML2 was rated most favourably in the ‘ergonomics and comfort’ category, a critical factor for adoption in lengthy surgical procedures. In contrast, the AVP, despite scoring highest on ‘interaction and control’ thanks to its intuitive eye‐tracking, was heavily penalized on comfort metrics due to its significant weight and frontal form factor.
The ‘marginal’ SUS score for the HL2 in our study is noteworthy, as prior work has generally reported positive user acceptance (with user‐friendliness rated 3.9/5) [33] and confirmed its significant ergonomic improvements over the HL1 [16]. Our result may reflect the nature of direct comparison; when evaluated alongside newer devices with superior ergonomics like the ML2, the relative shortcomings of the HL2's visualization and virtual model drifting may have become more apparent to users.
The high rating for the ML2 aligns with this trend. The study by [18] found that even the previous generation ML1 was strongly preferred by surgeons over the HL2 for visualization and interaction quality. It is therefore logical that the lighter, more advanced ML2 would solidify this user preference, as reflected in our results.
Crucially, our study identified a major real‐world usability barrier not captured by standard questionnaires: compatibility with prescription glasses. The HL2 was the only device that comfortably accommodated users’ own glasses. Several participants struggled to use the ML2 and AVP without their corrective lenses, and one clinician could not complete the AVP trial at all due to severe dizziness (cybersickness). This is not an isolated finding, as in [18] they also reported a participant being unable to complete their study with ML1 due to this exact issue. This is a critical workflow limitation in a clinical setting, as it would require costly, custom‐per‐user prescription inserts for the AVP or ML2, making device sharing between staff impractical and increasing costs significantly.
4.3. Implications and Limitations
Our findings suggest that for marker‐based surgical navigation the ML2 offers the best balance of performance, usability and comfort. It provides a significant visual upgrade over the HL2 without sacrificing the ergonomic stability required for precision tasks. The HL2, while technologically older, remains a viable and reliable option, especially given its superior handling of prescription eyewear, a non‐trivial logistical advantage in a hospital environment.
The AVP, in its current state, is not yet suitable for this type of high‐precision, marker‐based navigation. Its tracking instability, weight and lack of glasses compatibility present major hurdles for clinical adoption in interventional workflows. This is supported by our performance data, which shows that while its potential accuracy is high, its reliability and precision do not yet meet the standards set by prior AR navigation error metrics. However, its outstanding visual fidelity and powerful interaction model make it a compelling platform for other medical applications such as surgical training, pre‐operative planning, and patient consultation, where its limitations are less critical and its immersive strengths can be fully leveraged. The direct VST, which prevents a direct line of sight to the patient, also remains a philosophical and safety concern for live surgery that the community must address.
The primary limitations of this study include the phantom‐based model, the small sample size and the lack of real‐time instrument tracking. Future work should focus on three key areas: (1) Enlarging participants cohorts and validating the ML2's performance in a more formal clinical trial; (2) Investigating alternative tracking methods for the AVP, such as Apple's native object tracking framework, which may yield more stable results than the native marker‐based library; (3) Conducting longitudinal studies to assess long‐term user fatigue and comfort with these devices in realistic surgical settings.
5. Conclusions
In conclusion, this study provides the first direct comparative benchmark of HoloLens 2, Magic Leap 2 and Apple Vision Pro for a MR‐guided surgical needle insertion. Our results demonstrate that while next‐generation headsets offer substantial improvements in visualization and interaction, these benefits must be weighed against critical factors of tracking stability, ergonomics and practical clinical workflow.
The ML2 emerged as the most effective and well‐rounded device for this application, combining high success rates and superior user‐rated comfort. It represents a clear and viable successor to the HL2 for surgical navigation. In contrast, the AVP, despite its revolutionary display and interaction system, is currently hampered by unstable marker tracking and significant ergonomic challenges (weight, incompatibility with glasses) that make it unsuitable for high‐precision interventional guidance. Its potential, for now, lies firmly in non‐navigational applications like training and planning.
Ultimately, this work underscores that the optimal HMD for surgery is not necessarily the one with the highest technical specifications, but the one that provides a stable, comfortable and reliable fusion of digital information and physical reality. While MR HMDs are proving to be increasingly viable tools in medicine, our insights emphasize the need for continued refinement in comfort, stability and seamless clinical integration in order to translate technological potential into real‐world surgical impact.
Supporting Information
Code. The source code for compiling the apps is available at: https://github.com/Vicomtech/NAVMR.git (and mirrored at: https://github.com/BSEL‐UC3M/NAVMR.git).
Video. The video shows the puncture procedure with each of the HMD. The misalignment in the case of the Magic Leap 2 device is a known artifact caused by an offset between the device's video capture sensor and the user's viewpoint.
Author Contributions
Amaia Iribar‐Zabala: conceptualization, formal analysis, investigation, methodology, software, validation, writing – original draft. Joseba Ruiz‐Olalla‐Del‐Fresno: software. Inés Rubio‐Pérez: validation. Rafael Moreta‐Martínez: methodology, software. Andoni Beristain‐Iraola: supervision, methodology, writing – review and editing. Javier Pascau: conceptualization, supervision, writing – review and editing. Mónica García‐Sevilla: conceptualization, project administration, supervision, validation, writing – original draft, writing – review and editing.
Funding
This work was supported by Project CER‐20231013 (‘Aid for Technological Centres Cervera’, from the Ministerio de Ciencia e Innovación/AEI/10.13039/501100011033 and the European Union's ‘NextGenerationEU'/PRTR).
Conflicts of Interest
The authors declare no conflicts of interest.
Acknowledgements
The authors would like to express their sincere gratitude to the Department of General Surgery at the Hospital Universitario La Paz (Madrid, Spain) for their participation in the clinical evaluation and usability assessment in the comparative study.
Data Availability Statement
The developed applications that support the findings of this study are available in the supplementary material of this article as links to GitHub repositories.
References
- 1. Guiu B., De Baère T., Noel G., et al., “Feasibility, Safety and Accuracy of a CT‐Guided Robotic Assistance for Percutaneous Needle Placement in a Swine Liver Model,” Scientific Reports 11 (2021): 5218, 10.1038/s41598-021-84878-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2. Storz P., Buess G., Kunert W., and Kirschniak A., “3D HD Versus 2D HD: Surgical Task Efficiency in Standardised Phantom Tasks,” Surgical Endoscopy 26 (2012): 1454–1460, 10.1007/s00464-011-2055-9. [DOI] [PubMed] [Google Scholar]
- 3. Hatzipanayioti A., Bodenstedt S., von Bechtolsheim F., et al., “Associations Between Binocular Depth Perception and Performance Gains in Laparoscopic Skill Acquisition,” Frontiers in Human Neuroscience 15 (2021): 675700, 10.3389/fnhum.2021.675700. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4. Navab N., Martin‐Gomez A., Seibold M., et al., “Medical Augmented Reality: Definition, Principle Components, Domain Modeling, and Design‐Development‐Validation Process,” Journal of Imaging 9, no. 1 (2023): 4, 10.3390/jimaging90100048. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5. Sutherland I. E., “A Head‐Mounted Three Dimensional Display,” in Proceedings of the December 9‐11, 1968, Fall Joint Computer Conference, Part I (ACM, 1968), 757–764, 10.1145/1476589.1476686. [DOI] [Google Scholar]
- 6. Roberts G. W., Evans A., Dodson A., Denby B., Cooper S., and Hollands R., “The Use of Augmented Reality, GPS and INS for Subsurface Data Visualization,” in FIG XXII International Congress 4 (2002): 1–12. [Google Scholar]
- 7. Birlo M., Edwards P. J. E., Clarkson M., and Stoyanov D., “Utility of Optical See‐Through Head Mounted Displays in Augmented Reality‐Assisted Surgery: A Systematic Review,” Medical Image Analysis 77 (2022): 102361, 10.1016/j.media.2022.102361. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8. Park B. J., Hunt S. J., Nadolski G. J., and Gade T. P., “Augmented Reality Improves Procedural Efficiency and Reduces Radiation Dose for CT‐Guided Lesion Targeting: A Phantom Study Using HoloLens 2,” Scientific Reports 10 (2020): 18620, 10.1038/s41598-020-75676-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9. Gsaxner C., Li J., Pepe A., Schmalstieg D., and Egger J., “Inside‐Out Instrument Tracking for Surgical Navigation in Augmented Reality,” in Proceedings of the 27th ACM Symposium on Virtual Reality Software and Technology (Association for Computing Machinery, 2021), 4, 10.1145/3489849.3489863. [DOI] [Google Scholar]
- 10. Jiang B., Wang L., Xu K., et al., “Wearable Mechatronic Ultrasound‐Integrated AR Navigation System for Lumbar Puncture Guidance,” IEEE Transactions on Medical Robotics and Bionics 5, no. 4 (2023): 966–977, 10.1109/tmrb.2023.3319963. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11. Theivendrampillai S., Yang B., Little M., and Blick C., “Targeted Augmented Reality‐Guided Transperineal Prostate Biopsies Study: Initial Experience,” Therapeutic Advances in Urology 16 (2024): 17562872241232582, 10.1177/17562872241232582. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12. Trojak M., Stanuch M., Kurzyna M., Darocha S., and Skalski A., “Mixed Reality Biopsy Navigation System Utilizing Markerless Needle Tracking and Imaging Data Superimposition,” Cancers 16, no. 10 (2024): 1894, 10.3390/cancers16101894. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13. Heinrich F., Schwenderling L., Joeres F., Lawonn K., and Hansen C., “Comparison of Augmented Reality Display Techniques to Support Medical Needle Insertion,” IEEE Transactions on Visualization and Computer Graphics 26, no. 12 (December 2020): 3568–3575. [DOI] [PubMed] [Google Scholar]
- 14. Hwang S., Lee S., and Kim S., “Surgical Navigation System for Pedicle Screw Placement Based on Mixed Reality,” International Journal of Control, Automation and Systems 21 (2023): 3983–3993, 10.1007/s12555-023-0083-6. [DOI] [Google Scholar]
- 15. Li M., Mehralivand S., Xu S., et al., “HoloLens Augmented Reality System for Transperineal Free‐Hand Prostate Procedures,” Journal of Medical Imaging 10, no. 2 (March 2023): 025001, 10.1117/1.JMI.10.2.025001. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16. Pose‐Díez‐de‐la Lastra A., Moreta‐Martinez R., García‐Sevilla M., et al., “HoloLens 1 vs. HoloLens 2: Improvements in the new Model for Orthopedic Oncological Interventions,” Sensors 22, no. 13 (2022): 4915, 10.3390/s22134915. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17. Żelechowski M., Zubizarreta‐Oteiza J., Karnam M., et al., “Augmented Reality Navigation in Orthognathic Surgery: Comparative Analysis and a Paradigm Shift,” Healthcare Technology Letters 12 (2025): e12109, 10.1049/htl2.12109. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18. Zari G., Condino S., Cutolo F., and Ferrari V., “Magic Leap 1 Versus Microsoft HoloLens 2 for the Visualization of 3D Content Obtained From Radiological Images,” Sensors 23, no. 6 (2023): 3040, 10.3390/s23063040. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19. Tsai Y.‐C., Hsiao C.‐C., Lin C.‐W., Chern M.‐C., and Huang S.‐W., “Apple Vision Pro‐Guided Laparoscopic Radio Frequency Ablation for Liver Tumors: The Pioneer Experience,” Surgical Innovation 32, no. 3 (2025): 312–314, 10.1177/15533506251316001. [DOI] [PubMed] [Google Scholar]
- 20.“HoloLens 2 Specifications,” Microsoft, accessed April 6, 2025, https://learn.microsoft.com/en‐us/hololens/hololens2‐hardware.
- 21.“Magic Leap 2 Technical Specifications,” Magic Leap, accessed April 6, 2025, https://www.magicleap.com/en‐us/magic‐leap‐2.
- 22.“Magic Leap 2 Technical Specifications Developer,” Magic Leap Developer, accessed April 6, 2025, https://developer‐docs.magicleap.cloud/docs/guides/ml2‐overview.
- 23.“Apple Vision Pro Tech Specs,” Apple Inc., accessed April 6, 2025, https://www.apple.com/apple‐vision‐pro/specs/.
- 24.“Unity Real‐Time Development Platform,” Unity Technologies, accessed April 6, 2025, https://unity.com/.
- 25.“MRTK3 Overview—Mixed Reality Toolkit for Unity,” Microsoft Learn, accessed April 7, 2025, https://learn.microsoft.com/en‐us/windows/mixed‐reality/mrtk‐unity/mrtk3‐overview/.
- 26.“Vuforia Developer Portal,” Vuforia Developer Portal, accessed April 7, 2025, https://developer.vuforia.com/home.
- 27.“SwiftUI ‐ Xcode ‐ Apple Developer,” Apple Inc., accessed April 8, 2025, https://developer.apple.com/xcode/swiftui/.
- 28. Moreta‐Martínez R., Rubio‐Pérez I., García‐Sevilla M., García‐Elcano L., and Pascau J., “Evaluation of Optical Tracking and Augmented Reality for Needle Navigation in Sacral Nerve Stimulation,” Computer Methods and Programs in Biomedicine 224 (2022): 106991, 10.1016/j.cmpb.2022.106991. [DOI] [PubMed] [Google Scholar]
- 29.“V120:Duo ‐ An Optical Tracking System in a Single, Plug‐and‐Play Package,” OptiTrack, accessed April 7, 2025, https://www.optitrack.com/cameras/v120‐duo/.
- 30. Fedorov A., Beichel R., Kalpathy‐Cramer J., et al., “3D Slicer as an Image Computing Platform for the Quantitative Imaging Network,” Magnetic Resonance Imaging 30, no. 9 (2012): 1323–1341, 10.1016/j.mri.2012.05.001. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31. Brooke J., “SUS: A ‘Quick and Dirty’ Usability Scale,” In Jordan P. W., Thomas B., Weerdmeester B. A., McClelland I. L., (eds.). Usability Evaluation in Industry Taylor & Francis, London: (1996): 189–194. [Google Scholar]
- 32. Likert R., A Technique for the Measurement of Attitudes, Vol. 140 (Columbia University Press, 1932). [Google Scholar]
- 33. Rohmer K., Becker M., Georgiades M., et al., “Acceptance and Feasibility of an Augmented Reality‐Based Navigation System With Optical Tracking for Percutaneous Procedures in Interventional Radiology ‐ A Simulation‐Based Phantom Study,” RöFo 197, no. 8 (August 2025): 936–944, 10.1055/a-2416-1080. [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
The developed applications that support the findings of this study are available in the supplementary material of this article as links to GitHub repositories.
