Skip to main content
Healthcare Technology Letters logoLink to Healthcare Technology Letters
. 2025 Dec 16;12(1):e70044. doi: 10.1049/htl2.70044

Markerless Tracking of Robotic Surgical Instruments With Head Mounted Display for Augmented Reality Applications

Nicholas Greene 1, Aoqi Long 2, Yonghao Long 2, Zheng Han 2, Qi Dou 2, Peter Kazanzides 1,
PMCID: PMC12706539  PMID: 41409573

ABSTRACT

In robotic‐assisted laparoscopic surgery, an assistant surgeon stands at the bedside assisting the intervention, while the surgeon sits at the console teleoperating the robot. Tasks for the assistant include navigating new instruments into the surgeon's field‐of‐view and passing in or retracting materials from the body using hand‐held tools. We previously developed ARssist, an augmented reality application based on an optical see‐through head‐mounted display (HMD), to aid the assistant. Localization of the HMD with respect to the robot was achieved via the attachment of markers. In this paper, we propose a novel markerless tracking method for robotic instruments using a HoloLens 2 HMD. We first run off‐the‐shelf YOLOv11 and SAMURAI (an adaptation of Segment Anything 2) networks to detect instrument primitives (shaft lines and keypoints). We then recover full 6D poses via a geometrically interpretable pipeline combining perspective‐n‐point (PnP) and a multi‐view least‐squares optimization. We experimentally compare the markerless tracking accuracy to a baseline marker‐based tracking solution, and show similar instrument tip accuracy. This suggests that the markerless method is an acceptable substitute to marker‐based tracking for this augmented reality application, while avoiding workflow issues with sterilizing and attaching markers.

Keywords: augmented reality, medical robotics, pose estimation


We propose a novel markerless tracking method for robotic instruments using a HoloLens 2 head‐mounted display. This enables an augmented reality overlay of the robotic instruments and endoscope field of view for the assistant (bedside) surgeon in minimally‐invasive robotic surgery. Experimental results indicate that the markerless method achieves tracking accuracy comparable to that of a marker‐based method.

graphic file with name HTL2-12-e70044-g004.jpg


Abbreviations

AR

augmented reality

DoF

degrees of freedom

HMD

head‐mounted display

RCM

remote centre of motion

1. Introduction

Robotic‐assisted laparoscopic surgery has various advantages compared to conventional laparoscopic surgery, including the restoration of hand‐eye coordination, improved dexterity of instruments, and 3D immersive visualization [1]. The most common approach is for the surgeon to sit at a console and teleoperate robotic instruments inside the patient, while viewing images from a stereo endoscope. This setup requires a sterile, assistant surgeon at the patient bedside to perform tasks such as changing the robotic instruments and inserting or removing materials to or from the surgical site.

Researchers have identified that the performance of the assistant surgeon is critical to the overall outcome of the surgery [2, 3, 4]. However, the environment, workflow and facilities for the assistant to perform his/her tasks have not significantly changed since the introduction of surgical robots. The assistant surgeon essentially operates like a laparoscopic surgeon, except that the pose of the camera is controlled by the console surgeon and often does not provide intuitive hand/eye coordination for the assistant surgeon. This has motivated the development of augmented reality (AR), provided through a head‐mounted display (HMD), to attempt to improve the perception and performance of the assistant surgeon, as described in our prior work on ARssist [5, 6]. However, one significant limitation was the requirement to attach markers or fiducials to the surgical robot (in that particular case, to the cannulas) to enable registration of the HMD to the robot for accurate alignment of the overlays. Embedding markers into conventional surgical protocols is far from straightforward. Preoperative sterilization of these markers, their strategic placement to avoid encroaching on the operative field, and registration requirements all interrupt established surgical practices and training workflows, and may increase procedure time. Furthermore, introducing a new sterile device would add requirements for design, validation, regulatory approval, manufacturing, distribution and inventory.

In this work, we address these concerns and develop a markerless approach for localizing an HMD with respect to the robot by tracking the robotic instruments, because they are among the few parts of the surgical robot that are not covered by sterile drapes. In addition, we perform experiments to evaluate the instrument tip tracking accuracy of our markerless tracking method compared to a baseline marker‐based method on the Microsoft HoloLens 2 HMD. Our contributions are the following:

  • 1.

    A markerless pose estimation method for tracking robotic surgical instruments on a HoloLens 2 HMD.

  • 2.

    The integration of easily trainable networks coupled with classical methods and a fast multi‐view least‐squares optimization for tracking the instrument.

  • 3.

    Experimental results that demonstrate that the proposed markerless pose estimation method achieves tip localization accuracy that is comparable to the previous marker‐based method.

2. System Overview

This section provides an overview of the proposed HMD‐based augmented reality visualization for the assistant surgeon, which is based on ARssist [5, 6]. Figure 1 (left) shows a block diagram of the application, which integrates vision‐based pose estimation with robot kinematics for robotic surgery. The Unity App captures RGB and greyscale images from the HoloLens 2 and forwards them to a desktop PC. On the PC, feature detection, localization, and shaft refinement are performed to compute an optimized 6D pose. That pose is sent back to the Unity App along with kinematic data from the surgical robot. The Unity App fuses the two to achieve precise instrument tracking.

FIGURE 1.

FIGURE 1

(Left) High‐level flowchart of application system architecture. (Right) Robotic instruments inside a training phantom as seen through the HoloLens 2 display, with registered AR overlays of the two instruments and the endoscope. The endoscope's viewing frustum is also visible.

2.1. Visualization of Robotic Instruments and Endoscope

In order to visualize robot‐controlled instruments on the HMD, it is necessary to obtain the real‐time pose of each instrument relative to the HMD. This requires an initial registration between the HMD and the robot, and real‐time updates to that transformation as the HMD moves with respect to the robot, which are further described in Section 4.2. For the endoscope, the viewing frustum is also visualized, extending from the tip of the endoscope. An example of the AR overlays can be seen in Figure 1 (right).

2.2. HMD Localization With Respect to Robot

The relevant transformations are shown in Figure 2. The HoloLens includes a self‐localization capability, based on real‐time visual‐inertial SLAM (simultaneous localization and mapping), where it determines its pose with respect to a World coordinate system that is defined with respect to the room. Similarly, the robot kinematics determines the pose of each instrument with respect to a Robot Base coordinate system. Thus, for augmented reality overlay, the minimum requirement is to register the HMD World coordinate system to the Robot Base coordinate system. Robot kinematics can provide the 6D pose of any feature with respect to the Robot Base and computer vision methods can be applied to the HoloLens camera image to obtain the 6D pose of the same feature with respect to the HMD and, using the self‐localization, with respect to the World.

FIGURE 2.

FIGURE 2

Transformations required to register the HoloLens World frame to the Robot Base frame. Robot kinematics provides feature positions with respect to Robot Base, while computer vision (and HoloLens self‐localization) provides feature positions with respect to HoloLens World.

Virtual objects defined with respect to the robotic instruments can be placed in the HoloLens World frame by composing the kinematics transformation and the registration transformation. The registration will need to be updated occasionally because the HoloLens self‐localization is known to drift over time. Note that our tracking method, which updates the registration transformation, does not need to run in real‐time. The streamed robot kinematics data and HoloLens self‐localization both update in real‐time which is sufficient to display a real‐time visualization. In fact, updating the registration transformation too quickly could lead to jitter, or ‘swimming,’ of the virtual overlays which would need to be handled separately.

In this work, we focus on determining the pose of a single robotic instrument with respect to the HMD, as depicted by the red lines in Figure 2, without introducing markers to the scene. This is the first step in a tracking framework where the registration is updated over time to maintain its accuracy.

The reasons for tracking the robotic instruments directly are threefold: (1) Sterile drapes visually occlude all parts of the robot except for the instruments, cannulas and part of the robot base, making computer vision impractical for the covered regions; (2) The instruments are more likely to be in the field‐of‐view (FOV) of the assistant surgeon, and therefore in the FOV of the HMD cameras; (3) Since we plan to overlay parts of the instruments that are inside the patient, tracking the instruments directly can minimize the effect of tracking errors, especially compared to tracking features, much farther away, such as on the robot base.

3. Background and Related Work

Our specific problem is tracking of long, cylindrical surgical instruments from cameras mounted on an HMD from a view of the instrument that is outside the patient. Notably, most prior efforts have focused on tracking the intra‐corporeal part of instruments from endoscopic camera images [7], whereas our goal is to track the part of the instrument that is outside the patient.

3.1. HMD‐Integrated Tracking Pipelines

Most methods involving surgical tool‐tracking from an HMD are marker based. Recent HMD surgical navigation studies have evaluated planar markers, that is, ArUco [8] tracked tools in procedure‐relevant tasks. We previously reported an overlay error of 4.27 mm with a standard deviation of 3.09 mm for ARssist [5].

Heining et al. used ArUco‐tracked standard orthopaedic instruments and an HMD to guide pelvic screw placement in an ex vivo study [9]. The overall translational deviation was reported as 3.99 ± 1.77 mm, and the overall rotational deviation was reported as 4.3 ± 1.8°. Line‐of‐sight dropouts and workflow issues when markers left the HMD field of view were challenges. Sterility, occlusion and ergonomics remain barriers to clinical use with planar markers.

Gsaxner et al. leveraged the HoloLens 2's onboard stereo greyscale cameras to track passive retro‐reflective sphere‐based markers affixed to a biopsy needle via stereo triangulation followed by an extended Kalman filter; they reported mean lateral error of 1.7 mm and angular error of 1.1Inline graphic against an external optical tracker, demonstrating that inside‐out HMD sensing alone can rival ceiling‐mounted IR optical tracking systems [10].

Martin‐Gomez et al. developed STTAR, which leveraged the near‐infrared camera in the HoloLens 2 to localize retro‐reflective spheres attached to surgical tools, achieving sub‐millimetre lateral accuracy in a K‐wire insertion task [11]. Although marker‐based, STTAR demonstrates that HMD sensors can support surgical tool pose estimation with high accuracy.

3.2. Markerless Instrument Pose Estimation With Learning

There is recent work on markerless tracking of tools by AR headsets. Doughty et al. introduced HMD‐EgoPose, a single‐shot CNN that jointly estimates 6‐DoF pose of a surgical drill and the user's hand from HoloLens RGB video [12], where the drill tip error was reported as 28.07 ± 19.81 mm and the drill bit direction error was reported as 3.89 ± 2.74°. In a related study, Hein et al. evaluated state‐of‐the‐art multi‐view deep pose estimation for surgical tools using images from multiple fixed cameras as well as cameras on two HMDs [13]. They found that single‐view networks still fall short of surgical accuracy requirements, and that the multi‐view networks performed very well on images from fixed cameras, but the multi‐view networks performed dramatically worse when the images came from two moving HMDs, where the translation error was reported as 7.45 ± 9.49 mm, and the rotation error as 3.82 ± 5.89°. Both of these purely neural network based methods had large errors when using HMD images. These results support our rationale to adopt a hybrid approach in our own markerless tracking method.

To the best of our knowledge, there is no other work which directly demonstrates markerless tracking of robotic surgical instruments, in particular those with long cylindrical shafts, with an HMD.

4. Methods

We describe the hardware we used in Section 4.1, and our real‐time markerless tracking method in Section 4.2.

4.1. Hardware and Implementation Details

We use a Microsoft HoloLens 2 as our HMD, which serves as both the image capture device and the augmented reality display. We utilize the RGB camera set to 720 × 1280 resolution at 30 FPS and the left‐front and right‐front visible light (greyscale) cameras, each 640 × 480 resolution at 30 FPS, available through the HoloLens 2 Research Mode [14]. We use hl2ss [15] to stream these images to an off‐board desktop PC for real‐time image processing. Note that we used the factory camera calibrations from our HoloLens 2 for all vision tasks for the entirety of this work. We use UDP sockets to stream pose information back to the HoloLens 2. The PC contains an Intel 10850k 10‐Core CPU (2020) and an Nvidia RTX 3090 (2020). We also used the Sentire Surgical System from Cornerstone Robotics Ltd., Hong Kong SAR, China.

4.2. Markerless Tracking Method

Initially, we investigated general purpose 6D pose estimation networks and methods such as those found in the BOP challenge [16], however with the amount of training data we could collect, even with an automatic collection system [17], we were unable to achieve a reliable tracking system. It is clear from the BOP challenge results that 6D pose estimation networks are still not fully mature, and they are usually very slow. As of October 2025, the minimum mean runtime for the top 10 methods on the BOP challenge leaderboard is 21.07 s, though a runtime of 0.75 s can be achieved using methods within the top 50.

Our method employs neural networks for bounding box localization [18] and object segmentation [19] [20]. It requires very little and easy to collect training data—only a few hundred images with labelled bounding boxes—which simplifies implementation. Otherwise, we use classical methods which allow us to easily understand our results due to the interpretable geometry. At each step in a classical pipeline, it is possible to check for anomalies and reject the pose. This is especially important in the context of robotic surgery, where undetected and inexplicable failures may be distracting at best or harmful at worst.

In summary, we acquire RGB and greyscale stereo images from the HoloLens 2 and introduce a novel pipeline that fuses classical feature detection, neural network‐based shaft segmentation, and optimization techniques to precisely estimate the 6‐DoF pose of the surgical instrument, as described in the following subsections. Our current implementation requires about 0.75 s per image, but it can be further optimized. This does not affect real‐time AR visualization, which relies solely on the robot kinematics and HMD self‐localization.

4.2.1. Instrument Feature Detection

Our approach relies on distinct features that already exist on the instruments, and are similar even between surgical robots from different manufacturers. As illustrated in the low‐level processing diagram (Figure 3), the feature detection pipeline targets key geometric elements on the instrument for robust pose estimation: fitting a square to the round‐cornered square‐shaped button to determine precise corner locations, and detecting two lines on the high‐contrast boundaries of black and white regions.

FIGURE 3.

FIGURE 3

Low‐level processing of the instrument feature detection and pose estimation pipeline: (a) RGB image captured by HoloLens 2. (b) YOLO bounding box overlaid on the RGB image. (c) Extracted contour of the target square‐shaped button. (d) Scaled‐up convex hull with potential corners removed. (e) RANSAC‐based edge line fitting and corner computation via line intersections. (f) Resulting rectangular fit to the button. (g) Detected primary and secondary separator lines. (h) Eight identified 2D image feature points. (i) Initial pose estimation with projection of a fixed‐offset shaft point serving as a point prompt for RGB segmentation. (j) Reprojection of the point prompt to the left greyscale stereo image using factory calibration matrices, and segmentation result on the left greyscale image. (k) RGB segmentation results, including shaft centreline and silhouette edges. (l) Optimized pose projecting the full‐length 3D shaft onto RGB (yellow) compared with 3D shaft projection by initial pose (red).

4.2.1.1. Instrument Housing Localization

In complex surgical environments, traditional methods lack robustness for detecting the square button amid noise and disturbances. Thus, we employ a lightweight, real‐time YOLOv11 model [18], which requires minimal data (fine‐tuned on approximately 200 annotated images with varied views, lighting, and occlusions), to localize the instrument housing ROI (region of interest). While we achieved tracking with only about 200 images, using more images increased the reliability of YOLO. An example bounding box ROI is shown in Figure 3b.

4.2.1.2. Instrument Feature Detection

We apply bilateral filtering and Gaussian blurring to the ROI before applying a brightness‐threshold. Candidates for the rounded‐square button are detected via contour extraction on the resulting binary image. The button's shape varies dramatically with instrument motion due to perspective distortion and rounded corners, necessitating a composite scoring system to reliably select the correct contour and avoid erroneous ones. Contours are scored using the geometric mean of four normalized metrics: aspect ratio (near 1 for squares), extent (area ratio to bounding rectangle), solidity (area ratio to convex hull), and orthogonality (average vertex angle deviation from 90Inline graphic). For the orthogonality check, the contour is first simplified with the Douglas–Peucker algorithm [21]. It is rejected if it contains less than four or more than eight edges, otherwise it receives an orthogonality score. Contours are also rejected if the relative size to the ROI is too small or too large, or if the contour is touching the border of the ROI. The highest‐scoring non‐rejected contour is chosen, as shown in Figure 3c.

The target button presents challenges for corner identification compared to conventional markers, due to its small size and rounded corners. The small size makes Hough line transforms difficult, and the rounded corners prevent exact corner localization by standard corner detection algorithms. We instead find the corner points as the intersection points of the contour edges.

Since the image is noisy, we fit four straight edges to the contour using a RANSAC based method. First, the convex hull is scaled up to a fixed area via centre‐cropping and uniform scaling of the contour vertices. Additional vertices are added via interpolation. We then identify locally straight sections by computing the angle between consecutive edges. If the deviation from 180Inline graphic (near‐collinearity) falls above a threshold, points are retained as edge candidates, otherwise they are discarded. This removes edges with sharp turns, namely the edges along the rounded corners, as seen in Figure 3d.

RANSAC line fitting follows: we choose random vertex pairs to define a line, inliers are counted within a distance tolerance, and the best lines are refined via least‐squares on the inliers. This yields four edges whose pairwise intersections define the four ‘corners’ of the contour, as in Figure 3e.

Two additional linear features are detected: Boundary A, the interface between the instrument housing and the sterile adapter, and Boundary B, the interface between the sterile adapter and the motor pack. Both of these are high‐contrast black–white transitions. They are detected by using edge detection and probabilistic Hough transforms. Boundary A is found first by the proximity to the button contour centroid and by the number of Hough‐line votes. To find Boundary B, we look along a vector from the button centre, with the direction taken from one of the previously fitted edges, toward Boundary A. We find the Hough‐line with the most votes out of lines that intersect the search vector. The two boundary lines are depicted in Figure 3g.

Upon finding the two boundary lines, the 2D feature point set is constructed from the four button corners plus four intersections between the button edges and separators, providing eight points for PnP as shown in Figure 3h.

4.2.1.3. Initial Pose Estimation with PnP

An initial pose is estimated via Perspective‐n‐Point (PnP) with a disambiguation step. Before PnP, we must first infer whether we are viewing the left side or right side of the instrument as it is symmetrical. We determine the view (left or right) by considering the slope and endpoints of Boundary A, and the button centroid.

Next, we find pose candidates using PnP. The eight 2D image points (Figure 3h) correspond to coplanar 3D model points. Because the 3D points are coplanar, a two‐pose ambiguity exists for any PnP solver. We use a PnP solver which yields both of the ambiguous pose candidates [22].

Finally, we disambiguate the pose by choosing the correct candidate. For both pose candidates, the +Z unit vector is projected onto the image to evaluate its direction θk. In views where the instrument face is sufficiently non‐parallel to the image plane, we choose the pose candidate which aligns the +Z axis projection with the inferred view of the instrument. For example, we choose the candidate where θk points left when the view is of the left‐side of the instrument. When the instrument face is nearly parallel to the image plane, we run a different PnP solver [23] which provides only a single pose candidate. This may be the incorrect one because of the pose ambiguity; however, our subsequent shaft refinement method (Section 4.2.3) usually converges to the correct pose.

4.2.2. Shaft Detection

To precisely detect the instrument shaft in images captured by the HoloLens 2, we employ a real‐time variant of the SAMURAI segmentation network [19, 20] to isolate the shaft from the background. SAMURAI provides robust object tracking across sequential video frames.

The segmentation model requires a prompt to identify and track the target object. In our approach, we provide a point prompt derived from projecting a 3D point SshaftR3 in the instrument frame, located at a fixed offset from the proximal shaft end, using the initial pose TSE(3) obtained from PnP (Section 4.2.1). The 3D point in the RGB camera frame is computed as Sprgb=TSshaft, where the tilde represents homogeneous coordinates. The corresponding 2D point prompt in the RGB image is prgb, given by π(Krgb,Sprgb) where π(K,S) is the perspective projection of 3D points onto the image plane, and Krgb is the intrinsic matrix of the RGB camera. Point prompts are also generated for the left and right greyscale stereo images using the left and right camera intrinsics and extrinsics (KL,TrgbL) and (KR,TrgbR). The 3D point is transformed to each stereo image as SpL=TrgbLSprgb,SpR=TrgbRSprgb.

Given the binary segmentation mask of the shaft from SAMURAI, PCA (principal component analysis) is applied to compute the centroid and unit‐normalized principal direction (which lies along the shaft centreline). The shaft pixels are projected onto this frame to get axial distances along the centreline and perpendicular distances across the width. For robustness, we use the min/max axial extents and 5Inline graphic/95Inline graphic percentiles of perpendicular distances to ignore outliers. We verify that the aspect ratio exceeds a threshold to confirm a shaft‐like shape.

For all images (RGB and stereo greyscale), the centreline endpoints are derived by shifting the centroid along the principal direction by the min/max extents. For RGB only, we also find the left/right edges of the shaft by using the width percentiles at the endpoints. This yields the centreline and edges for RGB as endpoint‐defined lines 1rgb, 2rgb, and 3rgb, and the centrelines in the left and right greyscale images, gL, and gR. See Figure 3j,k.

4.2.3. Pose Refinement via Shaft Correction

Given the initial pose obtained from the PnP result in the RGB image, the known 2D–3D correspondences in the RGB image, and the shaft lines from all images, we refine the predicted pose via a non‐linear least‐squares optimization.

Let the 6D instrument pose be parameterized by a 6D vector rtR6, where rR3 is a Rodrigues (axis‐angle) vector representing rotation, and tR3 is the translation vector. The six values in rt are the parameters that will be updated by least squares optimization. Let {Xi}i=1NR3 be 3D object keypoints in the instrument frame, with corresponding 2D observations {xi}i=1N in the RGB image. The 3D object keypoints in the RGB frame are given by Xirgb=TXi, where T is the transformation matrix corresponding to the instrument pose. The reprojection residuals for these points are:

Lpnp=i=1Nπ(Krgb,Xirgb)xi2.

We also define two 3D points in the instrument frame, S1, S2R3, which are at both endpoints of the instrument shaft, and which lie along its centreline. These are projected into the RGB and stereo greyscale images. For the RGB image, we compute their projected 2D locations and use their perpendicular distance to the previously located shaft lines (left‐edge, centreline, right‐edge) krgb in the image. The 3D shaft keypoints in the RGB frame are given by Sirgb=TSi. The corresponding residual is Lrgbline, shown in Equation (1), where d(·,) denotes the perpendicular distance from a 2D point to a 2D line.

Given the intrinsic and extrinsic parameters of the left and right greyscale cameras as (KL,TrgbL) and (KR,TrgbR), respectively, the shaft points are transformed from the RGB camera frame to each greyscale camera frame as SjL=TrgbLSjrgb,SjR=TrgbRSjrgb. We then project and compute their perpendicular distances to known lines gL and gR in each greyscale image to find the term Lgreylines shown in Equation (2):

Lrgbline=k=13j=12d(π(Krgb,Sjrgb),krgb)2 (1)
Lgreylines=j=12[d(π(KL,SjL),gL)2+d(π(KR,SjR),gR)2]. (2)

Finally, we have a term to prevent the optimizer from adjusting the translation too drastically with respect to the initial translation estimate t0: Lt=tt02.

The full objective function is then a weighted sum of all residuals, where wpnp,wrl,wgl,wt are fixed scalar weights:

minr,tLtotal=wpnpLpnp+wrlLrgbline+wglLgreylines+wtLt. (3)

5. Experiments

We collected data to experimentally compare the tracking accuracy of our markerless tracking method to a marker‐based method similar to the one used in ARssist [5]. Our experiment was designed to eliminate the effect of robot kinematic error, allowing us to compare the tracking accuracy of both methods via tip position errors. Due to the difference in device hardware (HoloLens 1 vs. HoloLens 2) and robot hardware, we could not replicate the original marker‐based method exactly. However, our marker‐based tracking method was implemented to be as similar as possible.

Both our markerless tracking method and the marker‐based tracking method are single‐frame pose estimation methods. Despite having access to video streams, we do not use any temporal information or filtering in our comparison.

5.1. Marker Based Tracking

We used a single large ArUco [8] marker which was slightly larger in size compared to the ARToolKit [24] markers used by ARssist. We placed the marker such that it was rigidly attached to the instrument, and such that it would be visible in the same RGB images used by our markerless tracking method, so that we could compare the estimated tip position of each method for each frame directly. We also placed the marker closer to the instrument tip than the primary visual features on the instrument housing that our markerless tracking method uses (aside from the instrument shaft). This gives the marker‐based method a small advantage in that orientation errors will be magnified less at the instrument tip; however, it does not replicate the ARssist solution of attaching markers to the cannulas, which would further reduce the effect of orientation errors.

Our ArUco tracking follows standard methods outlined by OpenCV [25]. After detecting the ArUco marker in the image, we run corner sub‐pixel refinement, and then use the standard iterative PnP solver to find the marker pose.

5.2. Robot System

We use the surgical robot system to hold the robotic instrument and teleoperate it as it would normally be used. The robot has no modifications, and we do not require any kinematics data from the robot for this evaluation.

A typical robotic instrument of this style has 7 degrees of freedom, where the first three joints are responsible for the ‘gross’ positioning of the instrument end effector. These three joints make large motions of the robotic instrument which are visible by our tracking methods from outside the patient body. The fourth joint causes the instrument shaft to rotate about its centreline, however due to the rotational symmetry of the cylindrical shaft, this is visually imperceptible. The final three joints are responsible for the ‘fine’ positioning of the end effector. These joints make small motions of the instrument wrist and gripper, which are not visible from outside the patient and therefore cannot be tracked by external cameras. In the intended application, we would rely on the robot kinematics to obtain the joint positions of the final four joints and the true tip position.

In order to evaluate tip tracking accuracy from the visual tracking methods without depending on robot kinematics to compute the tip position, we used a 3D printed ‘instrument cap’ to fix the final three joints in place, which can be seen in Figure 4. The instrument cap itself is also rotationally symmetric, and has a tip that aligns with the centreline of the instrument shaft. Therefore, the rotation of the instrument shaft from the fourth robot joint will not affect the tip position. With this instrument cap, we effectively constrain the robot instrument tip to the 3 degrees of freedom corresponding to the first three joints of the robot's kinematic chain. By doing so, the tip is rigidly fixed to the visible parts of the instrument, and we can estimate the tip position purely from our tracking methods.

FIGURE 4.

FIGURE 4

Data collection setup showing the instrument with the instrument cap and the grid points.

5.3. Data Collection

The ground truth for our accuracy evaluation is a 6 × 6 grid with 20 mm spacing that was printed on paper using a laser printer. We placed the paper on a flat table, at a distance where it could fit inside the field of view of the endoscopic camera, as shown in Figure 4. We positioned the robot's remote centre of motion (RCM) such that the instrument could reach all 36 printed points on the grid. We then placed the HoloLens in a fixed position so that both tracking methods could locate the instrument when the tip was touching each of the grid points.

A user then teleoperated the robot to sequentially touch each grid point with the instrument cap. When the instrument was touching each grid point, we saved the RGB and greyscale images for offline processing of the pose estimation algorithms.

We collected images in eight trials: four trials with the instrument's left side visible and four trials with its right side visible. A different ArUco marker was placed on each side. The markers were oriented such that the left‐side marker's y‐axis was pointing to the right, while the right‐side marker's y‐axis was pointing up. Each set of four trials contained two different RCM positions of the instrument; two trials were recorded for each RCM pose.

The HoloLens camera was fixed throughout each trial, however it was repositioned, if only slightly, between trials. Therefore, each trial can be considered independent.

5.4. Data Analysis

Our goal is to evaluate the accuracy of the marker‐based and markerless methods by comparing the measured grid points to their known ground‐truth positions. This comparison requires the following calibrations: (1) the transformation between the measurement coordinate frame (HMD) and the grid coordinate frame (i.e., printed paper) and (2) the offset from the instrument coordinate frame to the instrument tip (tip offset).

Normally, the tip offset is obtained by a pivot calibration procedure, but that is not feasible when the instrument (with instrument cap) is moved by the robot, since the RCM constraint prevents the same point from being touched at multiple orientations. Instead, we used an optimization approach to estimate the tip offset from the available data. This approach takes advantage of the RCM constraint that ensures that the shaft always passes through a fixed point in space. We treat the RCM point and the direction of the physical shaft (not just the vector to the tip) as unknowns in the optimization problem.

We perform two different calibrations for each tracking method, one for the left side view of the instrument, and one for the right side view of the instrument. In total there are four batches of data: marker‐left, marker‐right, markerless‐left and markerless‐right. These correspond to the quadrants of Table 1. Our calibration method consists of a large non‐linear optimization problem which we solve iteratively.

TABLE 1.

Instrument tip accuracy (mm), 3D and 2D (in image plane), of markerless and marker methods, with outliers removed.

Trial Error type Marker method (mm) Markerless method (mm)
Mean Median Std Min Max Outliers Mean Median Std Min Max Outliers
Right Trial 1 3D (XYZ) 3.52 2.73 2.10 0.87 9.34 3 3.25 2.85 1.86 0.52 9.34 0
Planar (XY) 2.15 1.99 1.54 0.31 9.34 2.07 1.83 1.18 0.52 5.18
Right Trial 2 3D (XYZ) 3.40 2.67 2.17 0.59 9.21 3 4.14 4.28 2.10 0.54 11.03 0
Planar (XY) 2.24 1.87 1.75 0.25 8.11 2.32 2.10 1.34 0.12 6.15
Right Trial 3 3D (XYZ) 4.68 4.66 2.94 0.36 12.60 0 4.04 3.11 3.23 0.77 17.39 2
Planar (XY) 3.18 2.81 2.02 0.29 8.03 1.91 1.69 1.05 0.45 6.17
Right Trial 4 3D (XYZ) 4.15 3.09 2.48 0.90 9.88 0 4.96 4.09 3.90 0.96 19.59 2
Planar (XY) 2.81 2.38 1.55 0.52 7.07 2.72 2.43 1.48 0.30 6.09
Right Overall 3D (XYZ) 3.96 3.14 2.51 0.36 12.60 6 4.09 3.22 2.94 0.52 19.59 4
Planar (XY) 2.61 2.30 1.78 0.25 9.34 2.25 1.98 1.31 0.12 6.17
Left Trial 1 3D (XYZ) 6.67 6.07 3.83 0.58 18.99 4 3.36 3.13 1.62 0.50 6.63 0
Planar (XY) 2.78 2.71 1.80 0.31 9.05 2.49 2.22 1.50 0.28 5.18
Left Trial 2 3D (XYZ) 7.18 6.87 4.45 0.83 18.51 2 3.39 2.81 1.51 1.31 6.29 0
Planar (XY) 2.72 2.54 1.74 0.61 8.32 2.67 2.30 1.62 0.48 5.76
Left Trial 3 3D (XYZ) 5.75 4.72 3.88 0.74 19.87 3 2.50 2.16 1.74 0.37 9.56 0
Planar (XY) 1.62 1.78 0.75 0.31 3.31 1.10 0.99 0.51 0.17 2.09
Left Trial 4 3D (XYZ) 5.23 4.12 3.22 0.67 12.99 4 2.32 1.92 1.51 0.36 7.25 0
Planar (XY) 1.46 1.27 0.84 0.10 3.52 1.00 0.83 0.59 0.18 2.44
Left Overall 3D (XYZ) 6.22 5.51 3.95 0.58 19.87 13 2.89 2.43 1.67 0.36 9.56 0
Planar (XY) 2.15 1.89 1.50 0.10 9.05 1.82 1.33 1.40 0.17 5.76
All Trials 3D (XYZ) 5.06 4.22 3.48 0.36 19.87 19 3.48 2.80 2.45 0.36 19.59 4
Planar (XY) 2.39 2.10 1.67 0.10 9.34 2.03 1.75 1.37 0.12 6.17

For each batch, we have the following batch‐wide unknowns: PtipR3, which is the instrument tip coordinates in the instrument's frame (tip offset), and the unit vector ushaftR3,u=1, which is the shaft direction, also in the instrument's frame.

Within a batch, let i{1,,4} index the four trials. For each trial, let j{1,,36} index the touches collected. For each touch, we have the instrument pose Ti,jinst=ri,jinstti,jinst.

We have additional unknowns for each trial: Pircm which is the location of the RCM point (which is mechanically constrained by the robot) in the RGB camera frame, and the 6D pose of the paper grid Tigrid=rigridtigrid in the RGB camera frame. Let Pjgrid be the known grid points in the grid frame. The total unknowns are 5+9N where N=4, the number of trials.

The objective function is a weighted sum of the following terms:

Lgrid=i=14j=136Ti,jinstPtipTigridPjgrid2 (4)
Lrcm=i=14j=136dTi,jinstPi,jtip,ushaft,Pircm2 (5)
Lshaft=ushaftu0shaft2 (6)

where d(Ptip,ushaft,Prcm) denotes the perpendicular distance from the 3D point Prcm to the 3D line defined by Ptip and ushaft, and u0shaft is the initial guess for the shaft direction.

For initializing the parameters for this optimization, we were able to choose reasonably close values for Ptip and u0shaft based on the known geometry. For each trial, we found a guess for Pircm by finding the 3D point of best fit for the intersection of all 36 shaft lines. We found an initial guess for the grid pose rigridtigrid by computing the Procrustes paired‐point registration with the tip locations and the grid points.

We handled outliers in the following manner for each trial. After the initial tip calibration, we use a RANSAC Procrustes paired point registration on half of the remaining points. We take the set of points with the lowest fiducial registration error (FRE), and remove outliers from the points not used for registration based on a threshold on their target registration error (TRE). The threshold for outliers was chosen to be 20 mm as that was the spacing of the grid points.

6. Results and Discussion

The results from our accuracy evaluation are summarized in Table 1, which reports the tip localization errors (in mm) for both the marker‐based and markerless methods across all trials. The 3D errors were computed as the Euclidean distance between the projected 3D tip positions and the known ground‐truth grid points after tip calibration and outlier removal, whereas the 2D (planar) errors were computed by projecting all points to the HoloLens 2 RGB camera image plane (orthographic projection to a plane orthogonal to the principal axis of the camera). Right corresponds to trials viewing the right side of the instrument (four trials, 144 touches total), and left corresponds to the left side (four trials, 144 touches total).

For the marker method, the combined mean error without outliers was 3.96 mm in right (across 138 points) and 6.22 mm in left (across 131 points), yielding an overall weighted mean of approximately 5.06 mm across all 269 inlier points. Median errors were 3.14 and 5.51 mm for left and right, respectively, with standard deviations of 2.51 and 3.95 mm indicating moderate variability. Maximum errors reached 19.87 mm in left, often due to partial marker occlusion or unfavourable viewing angles. Outlier rates were higher in left (9% of points removed) compared to right (4%), likely attributable to marker pose ambiguity where the PnP solver found the wrong one of the two ambiguous solutions for a planar object.

In contrast, the markerless method exhibited a combined mean error without outliers of 4.09 mm in right (140 points) and 2.89 mm in left (144 points), resulting in an overall weighted mean of approximately 3.48 mm across 284 points—lower than the marker method despite fewer outliers overall (2% removal rate in right, 0% in left). Median errors were 3.22 and 2.43 mm, with standard deviations of 2.94 and 1.67 mm, suggesting greater consistency, particularly in left. Maximum errors were 19.59 mm in right, primarily from shaft segmentation failures or 2D image feature detection failures due to variations in lighting or observation point, but remained below 10 mm in left.

These results clearly demonstrate that our hybrid markerless method achieves comparable or superior tip localization accuracy to the marker‐based approach, with an overall combined mean error 31% less than the marker‐based approach. Notably, the markerless method yielded lower mean errors in six out of eight individual trials. The number of outliers in the markerless trials is roughly one‐third that of the marker‐based approach, indicating greater overall stability for the markerless method. A similar set of comparisons can be performed for the planar errors. In short, the markerless method had a lower mean, median, and standard deviation than the marker method when comparing overall planar error. Although the markerless method is more susceptible to pose ambiguity—owing to the coplanar and considerably smaller feature points detected on the instrument compared to those on markers—the shaft refinement step typically converges to the correct solution even when initialized with an incorrect pose, thereby accounting for the reduced number of outliers. The performance disparity between groups highlights method‐specific sensitivities: the marker method struggled more on the right side, possibly due to the ArUco orientation affecting corner detection, while the markerless method benefited from robust shaft line fitting in multi‐view optimization, excelling on the right side. Both methods maintained sub‐5 mm median errors in most trials, sufficient for AR visualizations in robotic surgery, where the assistant surgeon requires approximate instrument guidance for spatial awareness rather than sub‐millimeter precision for navigation. One surgeon who used our system subjectively estimated that a tip‐tracking error up to 10 mm would be tolerable for instrument insertion when aiming at the centre of a ‘safe’ zone in the endoscope frustum.

7. Conclusions and Future Work

In this work, we presented a novel markerless tracking method for robotic surgical instruments using the onboard cameras of a Microsoft HoloLens 2 head‐mounted display, aimed at enabling seamless augmented reality applications for the assistant surgeon in minimally‐invasive robotic surgery. By integrating off‐the‐shelf deep learning models (YOLOv11 for bounding box detection and SAMURAI for segmentation) with classical geometric techniques (PnP for initial pose estimation and multi‐view least‐squares optimization for refinement), our hybrid approach recovers full 6‐DOF poses with low latency and without requiring fiducial markers, addressing key workflow disruptions associated with marker‐based systems.

Experimental evaluations demonstrated that our markerless method achieves comparable or superior tip localization accuracy to a marker‐based (ArUco) baseline, with an overall combined mean error approximately 31% lower and with fewer outliers across eight trials. This performance, coupled with sub‐5 mm median errors, supports its suitability for AR visualizations where approximate instrument guidance suffices for bedside tasks.

Our contributions advance markerless tracking in medical augmented reality by leveraging interpretable primitives to enhance robustness while requiring minimal training data, thereby facilitating practical clinical integration. Future work will extend to dynamic multi‐instrument scenarios, including the tracking of handheld instruments and the real‐time display of endoscopic video to better assist the surgical team during surgical procedures.

Author Contributions

Nicholas Greene: conceptualization, formal analysis, methodology, software, validation, writing – original draft, writing – review and editing. Aoqi Long: conceptualization, formal analysis, methodology, software, validation, writing – original draft, writing – review and editing. Yonghao Long: resources, validation, writing – review and editing. Zheng Han: resources, validation, writing – review and editing. Qi Dou: project administration, resources, writing – review and editing. Peter Kazanzides: conceptualization, funding acquisition, project administration, supervision, writing – review and editing.

Funding

This study was supported in part by the InnoHK initiative of the Innovation and Technology Commission of the Hong Kong Special Administrative Region Government.

Conflicts of Interest

The authors declare no conflicts of interest.

Acknowledgements

The authors thank Cornerstone Robotics for providing access to its Sentire Surgical System. The authors are also grateful to Wenkai Luo and Derek Hang Chun Kwok for technical support. The authors thank Dr. Jeremy Teoh, Director of Urology Center, and Director of Robotic Services, at the Chinese University of Hong Kong Medical Center for testing our system and providing valuable feedback.

Greene N., Long A., Long Y., Han Z., Dou Q., and Kazanzides P., “Markerless Tracking of Robotic Surgical Instruments With Head Mounted Display for Augmented Reality Applications.” Healthcare Technology Letters 12, no. 1 (2025): e70044. 10.1049/htl2.70044

Data Availability Statement

Research data are not shared.

References

  • 1. Lanfranco A. R., Castellanos A. E., Desai J. P., and Meyers W. C., “Robotic Surgery: A Current Perspective,” Annals of Surgery 239, no. 1 (2004): 14. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2. Sgarbura O. and Vasilescu C., “The Decisive Role of the Patient‐Side Surgeon in Robotic Surgery,” Surgical Endoscopy 24, no. 12 (2010): 3149–3155. [DOI] [PubMed] [Google Scholar]
  • 3. Potretzke A. M., Knight B. A., Brockman J. A., et al., “The Role of the Assistant During Robot‐Assisted Partial Nephrectomy: Does Experience Matter?,” Journal of Robotic Surgery 10, no. 2 (2016): 129–134. [DOI] [PubMed] [Google Scholar]
  • 4. Nayyar R., Yadav S., Singh P., and Dogra P. N., “Impact of Assistant Surgeon on Outcomes in Robotic Surgery,” Indian Journal of Urology (IJU) 32, no. 3 (2016): 204. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5. Qian L., Deguet A., and Kazanzides P., “ARssist: Augmented Reality on a Head‐Mounted Display for the First Assistant in Robotic Surgery,” Healthcare Technology Letters 5, no. 5 (2018): 194–200. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6. Qian L., Deguet A., Wang Z., Liu Y.‐H., and Kazanzides P., “Augmented Reality Assisted Instrument Insertion and Tool Manipulation for the First Assistant in Robotic Surgery,” in IEEE International Conference on Robotics and Automation (ICRA) (IEEE, 2019), 5173–5179. [Google Scholar]
  • 7. Yang C., Zhao Z., and Hu S., “Image‐Based Laparoscopic Tool Detection and Tracking Using Convolutional Neural Networks: A Review of the Literature,” Computer Assisted Surgery 25, no. 1 (2020): 15–28. [DOI] [PubMed] [Google Scholar]
  • 8. Garrido‐Jurado S., Muñoz‐Salinas R., Madrid‐Cuevas F. J., and Marín‐Jiménez M. J., “Automatic Generation and Detection of Highly Reliable Fiducial Markers Under Occlusion,” Pattern Recognition 47, no. 6 (2014): 2280–2292. [Google Scholar]
  • 9. Heining S.‐M., Raykov V., Wolff O., Alkadhi H., Pape H.‐C., and Wanner G. A., “Augmented Reality‐Based Surgical Navigation of Pelvic Screw Placement: An Ex‐Vivo Experimental Feasibility Study,” Patient Safety in Surgery 18, no. 1 (2024): 3. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10. Gsaxner C., Li J., Pepe A., Schmalstieg D., and Egger J., “Inside‐Out Instrument Tracking for Surgical Navigation in Augmented Reality,” in ACM Symposium on Virtual Reality Software and Technology (ACM, 2021), 1–11. [Google Scholar]
  • 11. Martin‐Gomez A., Li H., Song T., et al., “STTAR: Surgical Tool Tracking Using off‐the‐Shelf Augmented Reality Head‐Mounted Displays,” IEEE Transactions on Visualization and Computer Graphics 30, no. 7 (2023): 3578–3593. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12. Doughty M. and Ghugre N. R., “HMD‐EgoPose: Head‐Mounted Display‐Based Egocentric Marker‐Less Tool and Hand Pose Estimation for Augmented Surgical Guidance,” International Journal of Computer Assisted Radiology and Surgery 17, no. 12 (2022): 2253–2262. [DOI] [PubMed] [Google Scholar]
  • 13. Hein J., Cavalcanti N., Suter D., et al., “Next‐Generation Surgical Navigation: Marker‐Less Multi‐View 6DoF Pose Estimation of Surgical Instruments,” Medical Image Analysis 103 (2025): 103613. [DOI] [PubMed] [Google Scholar]
  • 14. Ungureanu D., Bogo F., Galliani S., et al., “Hololens 2 Research Mode as a Tool for Computer Vision Research,” preprint, arXiv, August 25, 2020, 10.48550/arXiv.2008.11239. [DOI]
  • 15. Dibene J. C. and Dunn E., “HoloLens 2 Sensor Streaming,” preprint, arXiv, November 4, 2022, 10.48550/arXiv.2211.02648. [DOI]
  • 16. Nguyen N. V., Tyree S., Guo A., et al., “Bop Challenge 2024 on Model‐Based and Model‐Free 6D Object Pose Estimation,” preprint, arXiv, April 23, 2025, 10.48550/arXiv.2504.02812. [DOI]
  • 17. Greene N., Luo W., and Kazanzides P., “dVPose: Automated Data Collection and Dataset for 6D Pose Estimation of Robotic Surgical Instruments,” in IEEE International Symposium on Medical Robotics (ISMR) (IEEE, 2023), 1–7. [Google Scholar]
  • 18. Jocher G. and Qiu J., “Ultralytics Yolo11,” Github, published October 1, 2024, https://github.com/ultralytics/ultralytics.
  • 19. Ravi N., Gabeur V., Hu Y.‐T., et al., “SAM 2: Segment Anything in Images and Videos,” preprint, arXiv, October 28, 2024, https://arxiv.org/abs/2408.00714.
  • 20. Yang C.‐Y., Huang H.‐W., Chai W., Jiang Z., and Hwang J.‐N., “SAMURAI: Adapting Segment Anything Model for Zero‐Shot Visual Tracking With Motion‐Aware Memory,” preprint, arXiv, November 30, 2024, 10.48550/arXiv.2411.11922. [DOI]
  • 21. Douglas D. H. and Peucker T. K., “Algorithms for the Reduction of the Number of Points Required to Represent a Digitized Line or Its Caricature,” Cartographica: The International Journal for Geographic Information and Geovisualization 10, no. 2 (1973): 112–122. [Google Scholar]
  • 22. Collins T. and Bartoli A., “Infinitesimal Plane‐Based Pose Estimation,” International Journal of Computer Vision 109, no. 3 (2014): 252–286. [Google Scholar]
  • 23. Lepetit V., Moreno‐Noguer F., and Fua P., “EPnP: An Accurate O(n) Solution to the PnP Problem,” International Journal of Computer Vision 81, no. 2 (2009): 155–166. [Google Scholar]
  • 24. Kato H. and Billinghurst M., “Marker Tracking and HMD Calibration for a Video‐Based Augmented Reality Conferencing System,” in IEEE/ACM International Workshop on Augmented Reality (IWAR) (IEEE, 1999), 85–94. [Google Scholar]
  • 25. Bradski G., Kaehler A., et al., “OpenCV,” Dr. Dobb's Journal of Software Tools 3, no. 2 (2000): 120–126. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

Research data are not shared.


Articles from Healthcare Technology Letters are provided here courtesy of Wiley

RESOURCES