Abstract
Neuroendoscopic approach to deep-brain targets imparts deformation of the ventricles and adjacent parenchyma, limiting the accuracy of conventional neuronavigation. We report a method for 3D endoscopic reconstruction and registration via simultaneous localization and mapping (SLAM) for real-time guidance with or without robotic assistance. The aim is to permit augmented video overlay of structures registered from preoperative or intraoperative 3D images within and beyond the endoscopic field of view for more accurate targeting in the presence of deep-brain deformation. Phantom studies were performed to evaluate geometric accuracy and uncertainty in distinct scenarios of limited data (feature sparsity and scene occlusion), demonstrating performance over a broad range of challenges to endoscopic data. Reconstruction and registration accuracy were maintained even with up to 40% loss in feature density or 120° of the visual scene occluded. Overall, the method achieved a high degree of geometric accuracy, with target registration error of 1.02 mm and runtime supporting real-time guidance (3.45 Hz, representing a >16× speedup with SLAM approach compared to previous work). The studies establish essential quantitative performance characteristics and validation that are essential to future translation to clinical studies.
Index Terms—: Image-guided surgery, intraoperative imaging, computer vision, augmented reality, neurosurgery, simultaneous localization and mapping
I. Introduction
NEUROENDOSCOPY enables minimally invasive approach with direct visualization of deep brain targets about the cerebral ventricles for a variety of diagnostic and therapeutic procedures, including biopsy, cyst resection/drainage, ventriculostomy for hydrocephalus, and emerging approaches for deep-brain stimulation (DBS) [1], [2], [3], [4]. Transventricular neuroendoscopy is particularly helpful in accessing structures in proximity to the lateral and third ventricles, which can be difficult to reach via lateral or transsphenoidal endoscopy [5]. An important consideration for high-precision neuronavigation, however, is geometric deformation due to brain shift and change in cerebrospinal fluid (CSF) volume in the course of endoscopic approach. Brain shift is associated primarily with deformation of the cortical surface and peripheral regions of brain parenchyma (e.g., ~5–24 mm [6] upon incision of the dura) and deformation in periventricular structures (~4–10 mm [7] due to CSF volume changes). Precise target localization can be challenging even with direct neuroendoscopic visualization due, for example, to inconspicuous appearance, occlusion in hard-to-reach recesses of the ventricles, and presentation beyond the visible surface of the ventricles.
Conventional neuronavigation relies on rigid registration to preoperative imaging and is limited in its ability to resolve such deformations. Traditional approaches employ a stereotactic frame [8] to co-register the patient coordinates with 3D image data. Both framed and frameless navigation systems [9] using optical or electromagnetic (EM) tracking similarly employ rigid registration with preoperative imaging via correspondence of fiducial markers [10] or surfaces [11]. The same is true of robotic systems for tool stabilization, cooperative control, and trajectory positioning in neurosurgery and neuroendoscopy [12] – for example, systems such as the Neuromate® (Renishaw, Wotton-under-Edge, United Kingdom) for placement of DBS electrodes [13], the Excelsius® (Globus Medical, Audubon, PA) for spine and brain surgery [14], the Galen Surgical Platform (Galen Robotics, Baltimore, MD) for ENT microsurgery [15], and the ROSA ONE® Brain (Zimmer Biomet, Warsaw, IN) for DBS and stereoelectroencephalography (SEEG) [10]. Such tracking and robotic assistance technologies have provided major strides for confident localization of intracranial targets; however, the reliance on rigid registration to preoperative images fundamentally limits the accuracy with which such systems can be expected to perform in the presence of brain deformation.
Intraoperative magnetic resonance (MR) imaging can provide exquisite depiction of intracranial structures and has demonstrated benefit in a variety of neurosurgical procedures [16]; however, limited field strength (associated with long scan times and/or poor image quality), relatively high operational costs, workflow disruptions, and the need for MR-compatible instrumentation and site requirements challenge its widespread use [17], [18]. Intraoperative computed tomography (CT) [19], [20] and cone-beam CT (CBCT) [21] offer high spatial resolution and clear visualization of bone and intravascular/intraarterial contrast. Recent developments in 3D image reconstruction and artifact correction algorithms [22] aim to improve the soft-tissue contrast resolution in CBCT to a level supporting visualization of gray/white matter parenchyma, the cerebral ventricles, and hemorrhage [23]. Intraoperative ultrasound imaging is an effective means to visualize intracranial structures, including cysts, tumors, and hemorrhage [24]. While it offers advantages of real-time visualization and versatility, the technique is not without challenges, including operator variability and difficulties with image quality, artifacts, and image interpretation [25]. Despite these obstacles, ultrasound image guidance remains a valuable tool for neurosurgeons in the intraoperative setting.
Alternatively, investigators have sought to address challenges of brain shift via biomechanical modeling to predict tissue deformation and provide more accurate, model-based guidance. Early work simulated brain shift due to CSF loss [6], and later work enriched the deformation models by accounting for the material characteristics of various tissues [24]. Recent work employs intraoperative measurements of geometric shift to guide the biomechanical models [25], as investigated using laser range scanners [26] and stereo cameras [27]. More generally, multi-modality image-based deformable registration is an active area of ongoing research, including physics-based [28] and neural network-based [29] solutions aiming to geometrically resolve preoperative and intraoperative imaging in the presence of brain shift.
Computer vision-based registration methods pertinent to neurosurgery have been used to support tracking and guidance in a variety of interventional procedures. Previous work reported video and tracker augmented guidance [30] in bronchoscopy for lung biopsy, wedge resection, and lymph node staging via registration to preoperative CT. Similarly, in laparoscopy, video tracking of deformable tissue surfaces has been demonstrated in liver surgery [31], with some using models for motion compensation and tissue biomechanics [32]. Other work examines whole stomach 3D reconstruction using monocular chromoendoscopic imaging [33] for gastric navigation and lesion localization. Endoscopic video registered to preoperative CT or MR images has also been used to guide trans-sphenoid skull-base surgery [34] for treatment of skull base tumors. Other work demonstrates video-to-CT registration for navigation in rhinoscopy [35] to better localize nerves and arteries and thus monitor deformation over the course of the surgery.
Robot-assisted ventriculoscopy (RAV) is a promising technology that combines the precision of robotic positioning with vision-based guidance to address the challenges posed by deep-brain deformation. In current clinical implementation, a RAV system can provide stable alignment and positioning of the ventriculoscope [36] along a planned trajectory relative to the cranium. More advanced RAV implementations aim to reconstruct and register the neuroendoscopic video scene with 3D imaging and planning data for up-to-date guidance [37]. Registration of the video scene with intraoperative imaging (MR, CT, CBCT, or ultrasound) – in turn deformably registered to preoperative MR and planning data – provides the potential to overcome geometric errors associated with deep-brain deformation. Augmentation of the video scene with deformably registered planning data could thereby provide accurate guidance and visualization beyond the visible endoscopic scene and assist in localizing inconspicuous targets about the lateral and third ventricles even in the presence of deformation.
3D reconstruction of the video scene can be performed via two prevalent classes of computer vision algorithms – structure-from-motion (SfM), and simultaneous localization and mapping (SLAM). SfM uses a collection of unstructured sets of images to estimate camera poses and reconstruct a point-cloud representation by backprojection of covisible features across multiple views. Due to the lack of structure during image acquisition, such an approach is best suited to applications for which a one-time, offline reconstruction is sufficient. SLAM, on the other hand, uses a sequence of video frames to estimate camera poses in a local window, followed by simultaneous reconstruction for tracking. The constrained problem of frame-to-frame camera localization makes SLAM potentially suitable to online, real-time application. Previous work demonstrated the capability of both SfM and SLAM for 3D video reconstruction and visual tracking in endoscopy [38]. Moreover, recent advances in deep learning methods promise to extend such capability further – e.g., neural networks for dense depth estimation and reconstruction in sinus endoscopy [39] along with appearance and/or geometric priors [40].
Previous work by the authors [37] demonstrated the initial development of a RAV prototype for intracranial neurosurgery, including geometric calibration and the ability to use the robot pose for improved 3D video reconstruction using SfM. The system demonstrated accuracy of 1.2 mm in 3D reconstruction and registration but was limited in its runtime to 4.9 s per frame. The work reported below aims to develop SLAM for 3D video reconstruction more suitable to practical application in neuroendoscopy by providing real-time guidance, following early work describing parameter selection [41]. As a precursor to future clinical studies, a series of experiments is reported in rigorous, systematic development and evaluation of the method in laboratory studies. Such work provides essential guidance and validation of the algorithm, rigorous analysis of parameter selection, geometric accuracy, runtime capability, and characterization of uncertainties in system performance that are requisite to clinical translation. Specific contributions of the work include first reporting of: (1) development of a preclinical mobile platform for real-time RAV with SLAM video reconstruction; (2) investigation of SLAM parameter sensitivity and optimal/nominal parameter selection; (3) experimental characterization of reconstruction and registration uncertainty associated with limited data (viz., feature sparsity and regional/spatial ablation); and (4) evaluation of geometric accuracy and runtime in comparison to SfM.
II. Methods
A. System for Robot-Assisted Ventriculoscopy
1). System Components:
The preclinical RAV prototype illustrated in Fig. 1 was designed as a platform for 3D video reconstruction and registration integrated with intraoperative imaging and navigation. The RAV was implemented on a mobile cart for tableside positioning superior to the patient as shown in Fig. 1b. The ventriculoscope (Lotta 28164LA 6°, Karl Storz, Tuttlingen, Germany) was coupled to a camera head and console (AIM1588, Stryker, Kalamazoo MI, USA) with the video acquired at fixed (1X) zoom and focus at 1920 × 1080 pixel resolution and 10 frames-per-second (fps). Video was captured with a frame grabber (USB2DVIE 3.0, Epiphan systems, Ottawa, Canada) on a computer workstation (Dell Precision T7600 with Intel Xeon E5–2620, 32GB memory and NVIDIA GTX 960Ti GPU). The ventriculoscope was rigidly coupled to a robotic arm (UR3e, Universal Robots, Denmark) with a pose repeatability of ± 0.03 mm (per UR3e specifications) to ensure repeatable and consistent data acquisition and ease of experimentation. Previous work [37] demonstrated advantages in geometric accuracy and runtime by exploiting robot pose information in 3D point-cloud reconstruction, but robot pose was not used in the current work, which focused on aspects of a reconstruction algorithm that could translate either to robot-assisted or freehand use. The O-arm™ (Medtronic, Littleton MA, USA) was used for intraoperative CBCT with standard cranial scan protocols and algorithms for high-quality 3D image reconstruction sufficient for visualization of the ventricles [42]. A surgical tracker (StealthStation™, Medtronic, Boulder CO, USA) was used to track the ventriculoscope using a rigidly attached dynamic reference frame (DRF, SureTrak™) and a reference marker attached to the cranium (shown in Fig. 1b on a head phantom; LiquiPhil, The Phantom Lab, Greenwich NY, USA). A custom 3D-printed anthropomorphic ventricle phantom with simulated neurovasculature is shown inside the cranial phantom in Fig. 1b and in the example video frame in Fig. 1c.
Fig. 1.

Preclinical system for robot-assisted 3D neuroendoscopy. (a) Systems for intraoperative CBCT imaging, surgical tracking, and robot-assisted ventriculoscopy. (b) Positioning of the robot end effector and neuroendoscope in relation to the head phantom. (c) Neuroendoscopic view of the anthropomorphic ventricle phantom, showing lateral ventricles, Foramen of Monro, simulated vasculature, and BB target points.
2). Coordinate Frames, Transforms, and Calibration:
Coordinate frames and transformations pertinent to the RAV system are shown in Figure 2, taking the surgical tracker as the world frame (). The camera frame is denoted with origin at the center of the endoscope lens. The DRF (frame ) attached to the endoscope is tracked relative to the reference marker (frame ). Following CBCT acquisition, the ventricular surface was segmented (denoted in CBCT coordinates). Notation represents the homogenous rigid transformation locating frame with respect to frame . Intraoperatively, the tracking system yields and and consequently . The endoscope is first localized in the tracker frame using one-time, rigid hand-eye calibration () using . The CBCT image is registered with the reference frame (via ) automatically by the StealthStation™ tracker, which observes the reference marker during CBCT acquisition. is used as a one-time initialization to localize the CBCT surface with the camera via . Preoperative imaging (CT and/or MR) can be registered to intraoperative CBCT via image registration (CBCT-to-CT [43] or CBCT-to-MR [29]) or co-registered fiducial points. As detailed below, SLAM (alternatively, SfM) was used to reconstruct a 3D point-cloud representation of the ventricles from endoscopic video, with the point-cloud coordinate frame denoted . Real-time camera tracking relative to the phantom is achieved via , and the point cloud is registered with the CBCT surface to give . The transformation chain thereby yields guidance and augmented visualization of 3D structures from the CBCT reference frame directly in the endoscopic view.
Fig. 2.

Coordinate frames and transformations for video reconstruction and registration. The hand-eye transformation (, in gray) is a one-time calibration performed preoperatively. The reference-to-CBCT transform (, in gray) is automatically obtained using StealthStation™, and used for a one-time initialization. All other transformations are dynamically updated for real-time registration of the 3D endoscopic point-cloud and intraoperative CBCT.
B. Point-Cloud Reconstruction and Registration
A SLAM approach was used for real-time video tracking of the interior ventricular surface, with algorithmic components summarized in Fig. 3. Section II–B1 details the data acquisition and distortion correction procedure, followed by simultaneous tracking (feature correspondence and localization) and mapping (sparse reconstruction), summarized as follows. First, a one-time system calibration is performed to estimate the camera intrinsics and tracker-to-camera transformation (). Ventriculoscopic video data () are input and processed on a per frame basis after correction for lens distortion. For each image, features () are extracted and matched (with feature matches denoted by ) with features in consecutive frames to yield an approximate relative camera pose () and thus track the ventriculoscope. Salient frames observed during tracking are designated as keyframes according to duration elapsed since last keyframe and minimum visual change conditions [44] for further mapping. Simultaneously, the mapping thread maintains a pose graph [45], [46] whose nodes correspond to the keyframe pose estimates and whose edges represent relative camera pose constraints. Triangulation of matched features yields a point-cloud representation () of the scene, and a local bundle adjustment is performed to yield the optimized point cloud () along with further improved relative poses () to minimize reprojection error. Finally, as described in Section II–B2, the resulting point cloud is filtered and registered with ground truth to estimate , enabling real-time overlay of target structures in the ventriculoscopic view.
Fig. 3.

Algorithmic pipeline illustrating preprocessing along with simultaneous feature detection/pose prediction (“Tracking” – blue) and 3D point-cloud reconstruction (“Mapping” – green). The resulting point cloud is registered to intraoperative image coordinates for overlay of CT or MR image/planning data in the video scene. The transform registers the 3D point cloud to the CBCT surface (ventricle segmentation), and maps the data defined in CBCT coordinates directly to the endoscopic view – alternatively maps the camera pose to CBCT coordinates for 3D navigation.
1). Real-Time Point-Cloud Reconstruction via SLAM:
SLAM-based methods offer potential advantages for continuous, real-time tracking and reconstruction, as opposed to processing unstructured images as reported previously with SfM methods, which may be more appropriate to one-time, offline reconstruction of the scene. The SLAM pipeline summarized in Fig. 3 was developed to enable real-time tracking of camera pose along with an iteratively updated 3D point-cloud reconstruction of the visual scene. In the current work, the terms ‘localization’ and ‘mapping’ are used interchangeably with ‘tracking’ and ‘reconstruction’, respectively. A parallel architecture [47] combined with use of keyframes for keypoint backprojection and mapping presents an important advantage in computational cost for real-time tracking. Similarly, bundle adjustment subroutines performed in a local window of frames gives a substantial reduction in computational cost. The PySLAM v2 library [48] along with OpenCV [49] and TREK [50] were used for the reconstruction pipeline.
Neuroendoscopic video was acquired as a series of video frames denoted . The RAV setup of Fig. 1 was used to avoid freehand tremor and motion artifacts and reproducibly focus the analysis on the geometric accuracy of the SLAM vs. SfM algorithms. Linear motion along the axis of the endoscope was executed via the robot in a trajectory analogous to clinical approach between the lateral and third ventricles via the Foramen of Monro. Previous work showed the ability to perform 3D video reconstruction for alternative motion trajectories and described the preoperative calibration procedure for estimation of camera intrinsics () and correction of lens distortion effects [51]. Hand-eye calibration was used to estimate the rigid transform () to localize the camera coordinate frame with DRF.
As illustrated in the upper arm (blue) of Fig. 3, the tracking thread performs the tasks of feature detection and pose prediction in sequential camera frames. For each video frame, Speeded Up Robust Features (SURF) features [52] were used to establish correspondences between sequential images and estimate homography. For a gray-scale converted video frame, , SURF features designated as were extracted with denoting a feature descriptor at pixel location, . A Hessian blob detector over precomputed integral images in scale space [53] was used to identify interest points. The determinant of the Hessian matrix at interest points was then thresholded to limit the maximum number of keypoints to . For this purpose, a series of response maps were obtained by convolving the input image with increasing filter sizes to construct the scale space, divided into octaves and scale level subdivisions. The corresponding feature detection runtime is designated as .The SURF-128 extended feature descriptor at each keypoint location was obtained by computing 4D Haar wavelet response along 4 × 4 subregions of the image. The descriptors are computed separately for wavelet responses <0 and ≥0, resulting in a 128D descriptor for each keypoint location.
The resulting features were matched to refine tracked camera poses and assess visual saliency for keyframe decision. First, a constant velocity motion model was used to predict relative camera poses, designated , defining (identity matrix). Features () were then projected using the approximate pose over image to perform a guided search for matches. A Fast Library for Approximate Nearest Neighbors (FLANN) matcher [54] with norm distance was used to establish a match matrix of features, , with describing the strength of correspondence between features in images and . A pair of interest points is matched if the descriptor distance is closer than the distance ratio () of the second nearest neighbor, resulting in successful matches. Finally, if the current frame tracks less than a fraction () of keypoints compared to the previous keyframe, and at least frames have passed since the last keyframe insertion, the frame is designated a keyframe and inserted into a pose graph for the mapping thread.
In parallel with the tracking thread, a mapping thread (lower arm in Fig. 3 marked in green) operates on the pose graph to observe inserted keyframes and reconstruct the 3D scene. For each inserted keyframe, matched features are triangulated using the approximate camera poses () obtained by tracking to estimate a coarse point-cloud reconstruction (). A local bundle adjustment routine is then used to optimize the current keyframe pose along with other keyframes sharing matches in a local window. The bundle adjustment routine is then used to minimize the net reprojection error (RPE), jointly refining approximate camera poses () and coarse point cloud (), defined as:
| (1) |
The reprojection error (RPE) for a 3D point () obtained by triangulation of corresponding feature pixel locations is:
| (2) |
where the feature, , is detected in an image with estimated pose, , using the calibrated camera intrinsics matrix, . The resulting refined point cloud (denoted ) and precise camera poses (denoted ) are obtained within a scale factor for registration. To reduce computational load, redundant keyframes are detected and culled, and all keyframes in which more than a fraction, , of keypoints have been observed in at least 3 other keyframes are marked as redundant and removed. Further, outlier points in the reconstruction are detected and removed based on thresholding of the triangulation angle and focal length within realistic values supported by the system geometry and anatomical context. Finally, a statistical outlier filter with neighbors and standard deviation was used to trim outlier points, yielding the reconstructed sparse point cloud, (also within a scale factor).
2). Point-Cloud to Surface Registration:
The pose of the reconstructed point cloud () was initialized to the CBCT surface segmentation, (or preoperative CT or MR images registered to CBCT) via the initial affine transform () using the tracked endoscope pose () with a manual initialization of scale (). The point cloud () was then registered to the surface () using iterative optimization via the trimmed iterative closest point (ICP) method [55] to estimate the similarity transform with scale. The Levenberg-Marquardt optimizer was used with a point-to-plane distance metric optimized for iterations. The ICP registration routine is then defined to minimize normal distance to mesh () as:
| (3) |
and
| (4) |
for a given point () closest to surface point () in with face normal (). The sum-of-squared residual distances was calculated for all points and sorted by distance. A fraction of inlier points denoted , with lowest residual distance were chosen for minimization, yielding an incremental update converging to the solution and . CloudCompare [56] was used for filtering and visualization along with ICP routines from MATLAB 2020a (MathWorks, Natick MA, USA) for registration. The resulting registration then enables visualization and overlay of target structures directly in the video frame along with conventional mapping of the endoscope trajectory with respect to the ventricle surface, , and other data registered to CBCT coordinates. Blender [57] was used to render target structures as viewed from camera poses, and OpenCV [49] was used for alpha blending and overlay of rendered views in endoscopic video.
C. Experimental Evaluation and Validation
A series of three experiments was performed for quantitative evaluation of the accuracy, uncertainty, and runtime in 3D point-cloud reconstruction and registration: (1) the sensitivity of feature detection performance to SLAM parameter selection; (2) the uncertainty of 3D point-cloud reconstruction and registration under conditions of limited data – viz., as a function of feature sparsity (loss in density of features) and spatial ablation (regional loss of features); and (3) evaluation of the geometric accuracy in point-cloud reconstruction and registration along with system runtime in comparison to prior work based on SfM.
1). Experiment 1(Selection of SLAM Reconstruction Parameters):
The first experiment was designed to inform optimal and nominal selection of parameters affecting feature detection and reconstruction using an anthropomorphic ventricle phantom [37]. The ventricle phantom [Fig. 1(c)] presents semirealistic structures corresponding to the lateral and third ventricles, Foramen of Monro, and neurovasculature such as thalamostriate vein and smaller vessels. The walls of ventricles were embedded with radiopaque Teflon BBs (green, Fig. 1c) for measurement of target registration error (TRE). High-resolution CBCT of the phantom was acquired using the O-arm™ with a high-definition cranial scan protocol: 745 X-ray projections; 100 kV and 150 mAs; 360° scan orbit; and 0.3 × 0.3×0.3 mm3 voxel size reconstructed using 3D filtered backprojection. The reconstructed volume was segmented using a fast-marching algorithm [58] to extract a triangulated surface mesh () along with surface normal, used as ground truth for registration. Target fiducials (Teflon BBs) were manually localized for analysis of TRE.
Ventriculoscopic video was acquired by translating the RAV end-effector in a linear trajectory, traversing the lateral ventricles and entering the third ventricle through the Foramen of Monro. The trajectory mimics clinical ventriculoscopic motion through a sheath inserted via a Kocher point on the cranium, with a length of 50 mm traversed at approximately 2 mm/sec to avoid motion artifacts. Reconstruction in each experiment was based on a single video sequence capturing images (e.g., spanning a 50 mm path length for the linear trajectory). Each trajectory was highly reproducible due to rigid mounting of the ventriculoscope to the robot end effector and high degree of precision in robot positioning (± 0.03 mm).
The performance of feature detection and reconstruction /registration accuracy was investigated as a function of algorithm parameter settings. Fixed parameters in these experiments were set according to previous and other work not shown for brevity: the feature descriptor ratio for successful matches was set () to 0.7 ; for keyframe selection, the fraction was set to 0.9, and the minimum number of frames () was set to 20 ; a local window of 10 frames was used for bundle adjustment, and global bundle adjustment was not performed; the reconstructed point cloud was filtered using neighbors and , filtered for 3 iterations; and the trimmed ICP fraction of overlap () was set to 0.95, and the registration was assumed to be converged for tolerance limits of in rotation and in translation between two successive iterations. The window size for bundle adjustment did not have a strong effect on reconstruction quality – viz, less than 10% increase in the number of points and minimal influence on track length for window size up to 40 frames – attributed to the relatively small frame-to-frame camera motion and static phantom.
Primary experimental variables in the sensitivity analysis were the number of scale space octaves (), number of scale levels per octave (), and the number of keypoints (). A complete bivariate analysis was conducted with and as experimental variables to measure the effect on the number of 2D image features extracted () and the feature detection runtime was varied from 3 to 8, and for each was varied from 3 to 17. The corresponding and were then measured ignoring data transfer overheads. Sensitivity to the strength (saliency) of extracted keypoints was investigated by selecting the strongest keypoints from candidate interest points via thresholding of the Hessian detector. The effect of on geometric accuracy was evaluated in terms of projected error (PE):
| (5) |
where is the projection of residual error on mesh surface. A -nearest neighbor interpolation for PE was then performed over the surface with . To reduce extrapolation artifacts, the ground truth surface was first truncated to the region that was within a convex hull of the reconstructed point cloud’. The projected error (PE) was subsequently measured over this truncated region of the ground truth surface.
The system was also evaluated in terms of overall runtime (termed ) along with its individual components, tracking runtime () and mapping runtime (). For SfM-based methods, the offline, one-time execution results in a net duration . For applications in neuroendoscopy, however, the procedure is performed in two stages. First, the endoscope is oriented and the surrounding anatomy is mapped using views of the lateral ventricles. In the second stage, the endoscope is advanced through the Foramen of Monro into third ventricles to locate critical target structures, during which only tracking is following registration with the ground truth surface, resulting in . The per-frame system runtime was also measured, defined as for a one-to-one comparison of SLAM and SfM methods.
2). Experiment 2(Uncertainty in Video Reconstruction and Registration):
The second experiment was designed to quantify the uncertainty in reconstruction and registration using two distinct ablation models of limited data, motivated by two challenges observed clinically in neuroendoscopy. In each case, only matched features were ablated (cf, random features that may or may not contribute to the point cloud reconstruction). The first model emulates the scenario of anatomy with a lower density of features throughout the scene (e.g., a relatively “featureless” ventricular surface with less conspicuous neurovasculature or other salient structures). The second model approximates a regional loss of visibility (e.g., an air bubble on the endoscope lens or large-scale glare obscuring a relatively large spatial extent). For the second model and a linear trajectory with relatively small incremental motion between frames, the image occluded approximately the same part of the 3D scene and thereby is a reasonable approximation of large-scale artifacts such as glare and specular reflection. Each experiment employed the phantom as a typical best-case presentation of feature density and visibility (e.g., neurovasculature on the ventricular surface) – the first model evaluating the sensitivity of SLAM to feature density by systematically reducing the density of salient feature matches in the video data (feature sparsity), and the second model evaluating performance as a function of regional spatial ablation in the image domain (spatial ablation). 3D reconstructions were computed from a single video sequence acquired using the linear trajectory, with 20 trials conducted over various degree of ablation for both models.
The first model (feature sparsity) was developed to emulate loss of salient features observed in video frames, whereas SURF features may or may not be pertinent to reconstruction due to spurious detections. Note again that only successfully matched features which would be subsequently triangulated and thereby contribute to the reconstructed 3D point cloud were ablated. The fraction of matched features was reduced from 1.0 to 0.3. A fraction of successful matches between consecutive frames was then randomly ablated, with the ablation fraction varied from 0.0 (representing un-ablated reconstructions) to 0.7 (where 70% of successfully matched features were randomly ablated) with step size of 0.1. For each ablation fraction, 20 independent trials were performed, resulting in assessment of uncertainty in 160 3D point-cloud reconstructions.
The second model (spatial ablation) targeted regional removal of observed features, where all features within an angular sector from the image center were discarded. Sectors spanning a particular extent (wedge-shaped sector) of the video image were selected, and the extent of sector was varied from 0° (representing un-ablated reconstruction) to 240° (for which features observed in 2/3 of image were ablated) in 20° increments. For each ablation extent, the orientation of the ablated sector was rotated from 0° to 360° in increments of 18° to obtain 20 trials, yielding 260 point-cloud reconstructions that sampled a broad range of regional ablations.
The uncertainty in point-cloud reconstruction was assessed in terms of PE in the registered point-cloud, the number of points in the 3D reconstruction (, which reflects the density of the point-cloud reconstruction), and the precision and recall of points therein. The median registration error was anticipated to degrade with increased feature sparsity or spatial ablation, and the variability in registration error across the distribution of ablations was analyzed to characterize the uncertainty associated with the registration result. Analysis of precision and recall metrics as defined in [59] further quantified the accuracy and completeness of the reconstruction. For a reconstructed point, , the distance to ground truth () is:
| (6) |
Precision is defined as the fraction of points from reconstruction () within a distance threshold () to ground truth :
| (7) |
where is the Iverson bracket binarizing outcome, and is defined to lie in range [0, 1]. Conversely, for a ground truth point , the distance to reconstruction () is defined as:
| (8) |
and recall is thus defined as the fraction of points from ground truth () within a distance threshold () to reconstruction ():
| (9) |
where is also in the range [0, 1]. Low recall is associated with a lower scene coverage in that a smaller number of ground truth points have a corresponding point in the reconstruction within the distance threshold. Precision and recall curves were computed by varying the threshold to characterize the operating curves for a particular level of geometric accuracy.
3). Experiment 3(Evaluation of Geometric Accuracy):
Following the parameter selection of Experiment 1 and the uncertainty assessment of Experiment 2, a third experiment assessed the geometric accuracy of SLAM reconstruction compared to previously reported SfM methods using the experimental setup of Fig. 1. The COLMAP library [60], [61] was used for SfM reconstruction, with the Scale-invariant feature transform (SIFT) [53] for feature detection and description. Linear motion of the endoscope from the lateral to third ventricle via the Foramen of Monro was used to obtain reconstruction without use of end-effector pose information to facilitate direct comparison of SLAM and SfM methods. Shared parameters were commonly fixed between methods – for example, statistical outlier filtering with neighbors with for 3 iterations prior to trimmed ICP registration with the ground truth surface.
The reconstructed point clouds for both methods were assessed in terms of geometric accuracy and system runtime. In addition to analysis of PE, the BBs embedded in the phantom were manually segmented, with the 3D position of BBs denoted in the reconstructed point cloud following registration along with the position of corresponding BBs in the ground truth image (). The TRE was calculated as the RMS distance between BBs:
| (10) |
A Mann-Whitney U test was used to evaluate statistical significance in observed differences for both the PE and TRE measurements under the null hypothesis that two independent samples draw from identical distributions with equal median. This form of test is appropriate in comparison of distributions without prior assumptions on the shape of the error distribution (e.g., no assumption of a Gaussian distribution as with Student t-tests). The tracking and mapping runtimes for both SLAM and SfM methods were measured, along with system runtime () and per-frame runtime (). The system runtimes were assessed as a function of the number of images used in forming the reconstruction, , ranging from 10 to 300 images. Finally, target structures annotated in CBCT (or preoperative CT or MR registered to CBCT – e.g., simulated target lesions outside the ventricular wall) were rendered using the resulting registration and overlaid to qualitatively depict an augmented view of the endoscopic scene.
III. Results
A. Experiment 1: Selection of SLAM Reconstruction Parameters
Figure 4 summarizes the effect of SLAM tracking and reconstruction parameters on geometric accuracy and runtime. As shown in Figure 4a by isocontours of , the mean obtained is a function of both and . For any , an increase in results in a sharp increase in the number of features detected for small values of (~3–5), followed by saturation at , beyond which the mean number of features is roughly unchanged. Similarly, an increase in results in larger sharply up to , beyond which a gradual saturation is reached for . A mean was selected, with and ensuring sufficient features for tracking.
Fig. 4.

Parameter selection and sensitivity analysis for feature detection. The number of octaves () and number of scale levels () were evaluated in terms of (a) the number of features detected () and (b) feature detection runtime (), depicted as isocontours. (c-d) Effect of the number of candidate keypoints used for feature detection () on the projected error (PE) and tracking runtime ().
Figure 4b shows the effect of and on feature detection runtime (), demonstrating a larger sensitivity to increases in compared to . The increase in for constant was approximately linear, with mean increasing from 0.01 s to ~0.08 s over the range of scale levels investigated. To achieve runtime consistent with clinical application (at least ~10 fps desired for real-time overlay in freehand endoscopy) while maintaining a sufficient number of features, a nominal value of was selected, resulting in a mean of 0.07 s.
Figure 4(c–d) demonstrates the sensitivity of PE and tracking runtime () with respect to the number of candidate keypoints . Selecting fewer (<500) results in poor feature detection and tracking, causing steep increases in median PE of ~0.85 mm. Increasing up to ~500 keypoints steeply reduces geometric error, beyond which mean PE saturates at ~0.2 mm with no further improvement over an order of magnitude in . The tracking runtime () increased sharply from up to 1.06 s over the corresponding range in . Therefore, to balance tradeoffs between runtime and geometric accuracy, was selected for further experiments.
B. Experiment 2: Assessment of Uncertainty in Reconstruction and Registration
Figure 5 summarizes the system performance for varying levels of feature sparsity. With the fraction of features ablated ranging from 0.0 to 0.7, the median PE increased from ~0.5 mm to ~1.1 mm. The variation in PE also increased for a lower density of features, with the interquartile range, IQR (and 95% confidence interval, ) in PE varying from 0.05 mm () to 0.31 mm (), signifying an increase in the uncertainty of reconstruction and registration for higher ablation fraction (i.e., lower feature density). Figure 5b shows the distribution in the number of 3D points () and PE over the range of resulting point-cloud reconstructions, with as well as the mean and uncertainty in PE observed to increase with higher ablation.
Fig. 5.

Uncertainty in reconstruction associated with the sparsity of visual features. (a) Projected error (|PE|) measured for various match ablation fractions demonstrating an increase in error and uncertainty of prediction as ablation fraction increases. (b) |PE| measured as a function of the number of reconstructed 3D points (). (c) Precision and Recall for a nominal threshold of (d) Precision and Recall as measured for various distance thresholds () from 0.5 mm to 2 mm. (e-f) PE from reconstructed point cloud (with BBs in green and vasculature in red) with an ablation fraction of 0.5 visualized as colorwash indicating an unbiased distribution of errors across 3D space.
The effect of feature density on the precision and recall of point-cloud reconstructions is shown in Figure 5c, taking as a nominal level of desired geometric accuracy. Precision and recall are reduced from ~0.95 and ~0.81, respectively, in best-case (un-ablated) reconstructions to ~0.85 and ~0.55, respectively for an ablation fraction of 0.4. These metrics quantify the quality of point-cloud reconstruction in that a high precision requires a large spatial coverage. For ablation fraction > 0.4, the reconstruction process is seen to breakdown, evident in the distribution of points with nearly zero recall. Such large ablations (i.e., relatively featureless scenes) result in failed reconstructions with insufficient coverage and degenerate registration solutions (i.e., diminishing recall and a broad range in precision). Figure 5d shows the (median) precision-recall curves for various settings of the distance threshold , with ablation fraction increasing from right to left along each curve. These operating curves quantify the expected precision and recall for any level of feature density (down to ablation fraction ~0.4) and any requirement in distance accuracy (from 0.5 mm to 2.0 mm, corresponding approximately to the best-case accuracy of surgical navigation systems [8], [9]). Precision is observed to reduce more slowly than recall, showing a stable, accurate operating curve for even with a loss in 40% of salient features.
Figure 5(e–f) depicts the spatial distribution in (signed) PE visualized with respect to the ground truth surface within the convex hull of the reconstructed point cloud for an ablation fraction of 0.5, where positive values represent distance along surface normal “outward” from the surface. Little or no directional bias is observed, and a relatively even distribution of error is observed across the surface, consistent with the spatially random nature of ablation in this experiment.
The effects of spatial/regional feature ablation are summarized in Figure 6. As shown in Figure 6a, the median PE over a broad range of spatial ablation (angular extent of the ablated sector) was observed to increase only slightly for larger ablations, varying from ~0.82 mm for no occlusion to ~0.95 mm for 220° ablation, beyond which the reconstruction and registration collapsed due to degeneracy. However, the uncertainty in PE increased substantially over the same range, with IQR increasing from 0.10 mm () to 0.56 mm (). The increase in variability signifies the more uncertain reconstruction and registration resulting from ablation of larger regions of the image. Figure 6b shows the number of 3D points () in the resulting reconstructions with respect to PE and ablation fraction, where a larger extent of ablation is seen to increase the magnitude and range (uncertainty) of PE.
Fig. 6.

Uncertainty in reconstruction associated with spatial/regional ablation of the video scene. (a) Projected Error (|PE|) measured as a function of the spatial ablation sector angle, demonstrating an increase in registration error and uncertainty of prediction as the extent of ablation increases. (b) |PE| measured as a function of the number of reconstructed 3D points. (c) Precision and recall for a nominal threshold of . (d) Precision and recall measured for various distance thresholds () from 0.5 mm to 2 mm. (e-f) Distance from ground truth visualized in an example point-cloud reconstruction with 180° spatial ablation, with larger errors seen in the ablated (anterior) region and corresponding reduction in recall.
Precision and recall for reconstructions resulting from regionally ablated scenes are shown in Figure 6(c–d). Taking as a nominal level of desired geometric accuracy, Fig. 6c shows a relatively stable level of median precision (up to ~0.95) over a broad range of ablation, whereas recall reduced from ~0.87 to ~0.56 for 120° of ablation. The range (uncertainty) in precision, however, similarly worsened for high levels of ablation, with IQR ~0.07 (CI95 = 0.18) at ablation angle = 120°. Furthermore, large spatial ablations (≥120°) resulted in noisy reconstructions and poor recall and/or degenerate registrations leading to low recall (and arbitrarily high precision), evident in the cluster of points with recall ≤ 0.2. Figure 6d shows the operating characteristic in precision-recall for various settings of . For each curve, a lower median recall and median precision was observed for increased ablation extent, and a precision was seen to decrease more slowly than recall for all . Compared to the characteristic curves of Fig. 4d for various levels of feature density (i.e., spatially random ablations), the operating curves exhibit similar areas under the curves, suggesting robustness against either model of ablation.
Finally, an example reconstruction is shown in Figure 6(e–f) for a spatial ablation of 180° (across the anterior horn of the ventricles in this example – e.g., glare or an air bubble occluding the “upper” half the visual scene). The colorwash overlay shows the distance to ground truth (). The reconstruction is seen to be precise (median precision = 0.92); however, owing to the relatively strong ablation (50% of the image), the recall was poor (~0.42) due to poor coverage.
C. Experiment 3: Evaluation of Geometric Accuracy
Guided by the experiments described above for optimal /nominal parameter selection and observed uncertainty in point-cloud reconstruction, the SLAM method was evaluated in comparison to a previously reported SfM method in terms of geometric accuracy and system runtime. As shown in Fig. 7a, the median (IQR) PE for the SfM method was 0.33 mm (0.38 mm). The SLAM method achieved median (IQR) PE = 0.35 mm (0.35 mm) with no statistically significant difference measured by the test (). The above was evaluated over the 4422 ground truth surface points for both methods, and while the large p-value does not prove that the PE distributions are the same, it shows that the current experiments did not detect a meaningful difference between the median values.
Fig. 7.

Evaluation of geometric accuracy of registration when using SLAM based reconstruction compared to Structure-from-Motion. (a) Projected error, |PE| (b) target registration error, and (c) System runtime per frame (for images) across reconstruction method (d) System runtime as a function of number of images, with SLAM demonstrating a major improvement in runtime (e) Colorwash overlay of projected error on the ground truth surface along with reconstructed point cloud. (f) Augmented view of the neuroendoscopic scene with vasculature and Foramen of Monro (FoM) overlaid with target BBs (yellow), green-scale depth cue, anatomical region (orange) and simulated target (purple). A video version of (f) is available as supplemental material.
Fig. 7b shows the measured TRE for both methods with either method achieving sub 2 mm error, a typical error observed in stereotactic navigation systems. The median (IQR) TRE for SfM method was measured as 0.73 mm (0.77 mm) from 30 BBs visible within the reconstructions, and the SLAM method achieving similar median (and IQR) TRE of 1.02 mm (1.16 mm) based on 18 BBs visible in the reconstruction. Again, there was no statistically significant difference between the measurements (), and while this does not prove equivalence, it suggests no difference detected by these experiments. SfM and SLAM methods yielded and 3D points respectively, achieving a recall of 0.90 and 0.87 for a threshold of 2 mm.
However, as shown in Fig. 7c, a major improvement in system runtime per frame (for images) is achieved due to the parallelized architecture of SLAM method. The offline, sequential SfM method required 4.79 s for tracking and mapping a single frame on an average, while the SLAM method required only 0.29 s for tracking , and 0.91 s for mapping . Vision-based tracking is pertinent to neuroendoscopic procedures, since mapping can be performed in non-critical regions of anatomy before guidance is required for targeting. The system achieved more than 16 × speedup in tracking runtime compared to SfM method.
Fig. 7d shows a similar trend of drastic reduction in system runtime (), observed across different used for reconstruction and tracking. Across all , the SLAM method demonstrated a lower runtime compared to SfM, with over an order of magnitude of improvement measured. While SfM method has a runtime complexity of with respect to [61], SLAM methods suggest an asymptotic increase in runtime due to the use of a local window for tracking and mapping.
The PE of SLAM method is visualized in Fig. 7e (along with reconstructed point cloud and BBs in green) overlaid as a colorwash on the ground truth surface. The error was observed to be unbiased across both lateral and third ventricles, with the reconstructed BBs in strong visual agreement with ground truth positions.
Fig. 7f illustrates the augmented overlay view of the endoscopic scene during tracking, with the ground truth surface in green and a target region rendered in purple. Other anatomical annotations such as the mamillary bodies on the floor of third ventricle is overlaid in orange. The BBs embedded within surface of the phantom are seen as yellow spheres, demonstrating overlap with BBs visible in green in the endoscopic view. The method enables not only tracking in the scene but permits localization of structures beyond the current endoscopic field of view. A video version of Fig. 7f is attached as supplemental material.
Finally, Table I presents the various components of the tracking thread along with breakdown of each runtime step for frames. The pose prediction step required the most time at 0.18 s (0.15 s) [median (IQR)], followed by feature detection and matching steps, requiring 0.07 s (0.01 s) and 0.04 s (0.03 s), respectively. The SLAM method thereby achieved an update rate of (1/0.29 sec) = 3.45 Hz during tracking, demonstrating the capability for real-time navigation with intermittent updates to reconstruction via the mapping thread.
TABLE I.
Tracking Runtime (ttr) for Slam (for NI = 300 Images)
| Step | Time/frame [Median (IQR)l |
|---|---|
|
| |
| Feature Detection | 0.07 s (0.01 s) |
| Feature Matching | 0.04 s (0.03 s) |
| Pose Prediction | 0.16 s (0.13 s) |
| Keyframe Decision | 0.02 s (0.01 s) |
| Runtime per Frame (tI) | 0.29 s (0.17 s) |
| Update Rate (Hz) | 1/0.29 = 3.45 Hz |
IV. Discussion
A series of laboratory studies was reported using an anthropomorphic phantom of the cerebral ventricles to rigorously evaluate the performance and variability of 3D video reconstruction and to provide a quantitative basis for translation to future clinical studies. Experiment 1 informed the selection of parameters that affect feature extraction and SLAM reconstruction. Key findings showed that accurate reconstruction was obtained even with relatively small values of 6 octaves, 10 scale levels, and 500 keypoints ( and , respectively). A distribution in was observed (varying from 100 to 800 features) as a function of parameter selection, consistent with the fact that some regions of the phantom (and for that matter, real anatomy) are more feature rich than others. Extraction of larger was found to assist in tracking feature-sparse regions with a tradeoff of increased by 0.1 s for an increase in and . The observed saturation in with increase in suggests that the maximal number of features required for accurate reconstruction was indeed extracted from the image. Similarly, minimal improvement in registration accuracy was observed beyond keypoints, indicating that the salient features for reconstruction and registration were captured, and further increase in does not improve precision. To maintain reasonable runtime that is potentially suitable to clinical application, parameter values were selected to balance and runtime ( and ) to achieve runtime >1 frame/sec. Other work [62], [63] also compared various feature descriptors and matching methods, including hand-crafted [53] and deep-learning based descriptors [64], and achieved similar to that reported above.
Experiment 2 provided information pertinent to translation of laboratory studies to clinical evaluation, where video of natural human anatomy in vivo may present different forms of limited data. The system performance in sparsely textured regions of the ventricles was evaluated by ablating the number of matched feature points, demonstrating robustness even with a loss of up to 40% of the number of features () present in the phantom studies. The system demonstrated 3D reconstruction and registration with PE <1 mm under such conditions. The uncertainty in PE was also observed to increase with a lower , quantified by the increased IQR in PE measured as a function of ablation. The precision and recall analysis quantified the effect further, showing the extent in ablation for which a high recall (i.e., point-cloud coverage) was achieved – also consistent with the expected feature density in clinical neuroendoscopy. By contrast, noisy 3D reconstructions demonstrated lower recall and arbitrarily high precision due to degenerate registration. The precision-recall analysis quantified the accuracy achieved with various settings of threshold . For a given , the analysis showed the changes in precision and recall expected across a broad range in ablation. A relatively smoother trend was observed for , suggesting that the system was robust to ablation (up to 40% of features) before suffering a substantial loss in precision and recall. The steep falloff in performance beyond indicated that feature points salient to 3D reconstruction tended to be within a threshold of 1 mm, also confirmed by measurements of PE for various ablation fractions.
Experiment 2 further evaluated the uncertainty in 3D reconstruction accuracy as a function of spatial ablation – i.e., angular sectors wholly ablated of feature content, analogous to a region of the video image being obscured by glare or other form of visual occlusion. Uncertainty increased with the extent of ablation, evident as an increase in the IQR and 95th percentile of PE. In observational studies of clinical neuroendoscopy, the view can be observed to be commonly occluded over a range of ~30° – 90°. The system developed in this work demonstrated robust performance for ablation angles of up to ~120° (one-third of the endoscopic field of view). Larger variability (and PE) was observed with increasing ablation angle due to a reduction in , and corresponding reduction in . Compared to findings in the feature sparsity experiment (above), performance with spatial ablation exhibited robustness such that a ~70% loss in precision and recall was tolerated before data failed to reliably reconstruct and register. Precision-recall curves showed a smoother trend for , demonstrating a large operating region and gradual decline of precision with reduced recall. The experiments thus quantify the precision and recall over a range that is expected to capture challenging, realistic clinical scenarios in which feature density and/or spatial distribution are lost, and the system consistently demonstrated sub-mm registration accuracy.
Experiment 3 involved a detailed analysis of video tracking and reconstruction based on SLAM with a quantitative comparison to a previously reported SfM-based method. A similar magnitude and distribution of geometric error – both PE and TRE – were observed for each reconstruction method, and there was no statistically significant difference in mean geometric accuracy () between the methods. That both methods achieved similar geometric accuracy suggests that both SLAM and SfM yield a sufficient number and distribution of features for accurate reconstruction of the 3D scene. The current experiments involved a linear trajectory of the endoscope, which is pertinent to the anticipated clinical trajectory of a ventriculoscope traversing the lateral ventricles to the third ventricles via the Foramen of Monro along a straight path, and previous work [37] showed this trajectory to perform similarly to circle or arc trajectories. Further, the experiments were conducted using a robot for stable, reproducible positioning in quantitative performance characterization, and the method is applicable in principle either to robot-assisted or freehand motion of the endoscope. Vitally important to eventual clinical translation, a major improvement in system runtime was demonstrated with the SLAM approach, yielding up to 16x speedup compared to SfM and an update rate of 3.45 Hz. Recently, end-to-end differentiable deep-learning based methods [40] have emerged for SLAM with the inclusion of appearance and geometric priors, demonstrating similar real-time update rates. Augmented overlay of the video scene with virtual structures registered from 3D CBCT or MRI illustrated accurate alignment with visual features and the ability to visualize virtual target structures not only in the direct field of view, but beyond the ventricular walls.
Overall, the experiments provided quantitative validation in support of translation of the approach to clinical studies. Of course, such work is not without limitations. First, the ventricle phantom did not contain CSF, which varies in optical quality on a case-by-case basis in clinical situations and could cause optical distortion that could dynamically perturb camera calibration. Although the phantom is based on a segmentation model derived from real ventricular anatomy and thereby gives a fairly realistic representation of shape, it may not reflect, tissue-specific specularity or other photometric properties of the innreller ventricular wall, causing an unknown degree of mismatch between the density and spatial distribution of features in the phantom studies versus what can be expected in clinical use. The experimental studies addressed this question as a function of “limited data” - i.e., loss of feature density and regional ablation over a broad range of such variations. Furthermore, some surgical procedures (e.g., endoscopic third ventriculostomy, which involves creation of a stoma on the third ventricular floor) could introduce floating debris that could challenge reliable feature detection for 3D video reconstruction. Addressing visual drift (i.e., compounding of errors in localization and mapping over time) in SLAM also remains a topic of active research [65], which could potentially occur in procedures with longer duration. Visual drift could also be caused by tissue deformation induced after acquisition of the intraoperative CBCT or MR images, thus motivating a deformable point-cloud-to-surface registration for accurate tracking. In the current studies, the robotic arm was used to provide steady positioning and motion of the ventriculoscope for rigorous, reproducible measurements, and the suitability of the methods to true freehand motion invites future investigation. Translation to clinical studies benefits from the rigorous evaluation of key system performance characteristics, and future work will address additional, high-level clinical challenges, such as the ease in maneuvering the ventriculoscope in proximity to sensitive anatomical structures and compatibility with the overall operating setup in procedures such as ETV and transventricular DBS. Further, deep-learning based methods will be investigated for real-time feature description and end-to-end SLAM based navigation with translation to clinical studies.
V. Conclusion
The first real-time SLAM method for 3D video tracking and reconstruction in ventriculoscopy was presented. The system is integral to a RAV prototype being developed for translation to clinical studies, and the quantitative evaluation of geometric accuracy, robustness, and runtime are essential steps in such development. The parallel algorithm architecture and real-time tracking intrinsic to SLAM enabled ~16× speedup with an update rate of 3.45 Hz compared to a previously reported SfM implementation without loss in geometric accuracy. Studies conducted using an anthropomorphic ventricle phantom identified the nominal/optimal parameters for real-time guidance using SLAM, and uncertainty in geometric performance was measured using two limited data models – feature sparsity and spatial ablation, demonstrating robust performance (≤ 1 mm PE) even for large degrees of ablation (e.g., up to 40% loss of features or 120° of spatial ablation). Precision-recall curves quantify the geometric accuracy and lack of bias in 3D reconstruction/registration, with PE = 0.35 mm and TRE = 1.0 mm and no statistically significant difference compared to SfM. These studies establish a quantitative characterization of algorithm settings and performance that supports translation of the system to clinical studies.
Supplementary Material
Acknowledgments
This work was supported in part by the National Institute of Health through the National Institute of Neurological Disorders and Stroke Project under Grant U01-NS-107133, and in part by the Biomedical Research Partnership (BRP) with Medtronic, Littleton, MA, USA.
Contributor Information
Prasad Vagdargi, Computer Science Department, Johns Hopkins University, Baltimore, MD 21218 USA.
Ali Uneri, Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD 21218 USA.
Xiaoxuan Zhang, Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD 21218 USA.
Craig K. Jones, Computer Science Department, Johns Hopkins University, Baltimore, MD 21218 USA
Pengwei Wu, Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD 21218 USA.
Runze Han, Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD 21218 USA.
Alejandro Sisniega, Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD 21218 USA.
Junghoon Lee, Department of Radiation Oncology, Johns Hopkins University, Baltimore, MD 21218 USA.
Patrick Helm, Research and Development Division, Medtronic, Littleton, MA 01460 USA.
Mark Luciano, Department of Neurosurgery, Johns Hopkins University, Baltimore, MD 21218 USA.
William S. Anderson, Department of Neurosurgery, Johns Hopkins University, Baltimore, MD 21218 USA
Gregory D. Hager, Computer Science Department, Johns Hopkins University, Baltimore, MD 21218 USA.
Jeffrey H. Siewerdsen, Department of Computer Science and the Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD 21218 USA; Department of Imaging Physics, The University of Texas M. D. Anderson Cancer Center, Houston, TX 77030 USA.
References
- [1].Little S et al. , “Bilateral adaptive deep brain stimulation is effective in Parkinson’s disease,” J. Neurol. Neurosurg. Psychiat, vol. 87, no. 7, pp. 717–721, 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [2].Mayberg HS et al. , “Deep brain stimulation for treatment-resistant depression,” Neuron, vol. 45, no. 5, pp. 651–660, 2005. [DOI] [PubMed] [Google Scholar]
- [3].Apetauerova D et al. , “End of day dyskinesia in advanced Parkinson’s disease can be eliminated by bilateral subthalamic nucleus or globus pallidus deep brain stimulation,” Movement Disord, vol. 21, no. 8, pp. 1277–1279, 2006. [DOI] [PubMed] [Google Scholar]
- [4].Sinha S, McGovern RA, and Sheth SA, “Deep brain stimulation for severe autism: From pathophysiology to procedure,” Neurosurg. Focus, vol. 38, no. 6, p. E3, 2015. [Google Scholar]
- [5].Liu S et al. , “Neuronavigation-guided transcortical-transventricular endoport-assisted endoscopic resection for thalamic lesions: Preliminary experience,” World Neurosurg, vol. 166, pp. 19–27, Oct. 2022. [DOI] [PubMed] [Google Scholar]
- [6].Nimsky C, Ganslandt O, Cerny S, Hastreiter P, Greiner G, and Fahlbusch R, “Quantification of, visualization of, and compensation for brain shift using intraoperative magnetic resonance imaging,” Neurosurgery, vol. 47, no. 5, pp. 1070–1079, 2000. [DOI] [PubMed] [Google Scholar]
- [7].Khan MF, Mewes K, Gross RE, and Skrinjar O, “Assessment of brain shift related to deep brain stimulation surgery,” Stereotact. Funct. Neurosurg, vol. 86, no. 1, pp. 44–53, 2008. [DOI] [PubMed] [Google Scholar]
- [8].Bjartmarz H and Rehncrona S, “Comparison of accuracy and precision between frame-based and frameless stereotactic navigation for deep brain stimulation electrode implantation,” Stereotact. Funct. Neurosurg, vol. 85, no. 5, pp. 235–242, 2007. [DOI] [PubMed] [Google Scholar]
- [9].Palys V and Holloway KL, “Frameless functional stereotactic approaches,” Progr. Neurol. Surg, vol. 33, pp. 168–186, Jan. 2018. [Google Scholar]
- [10].Liu L et al. , “Frameless ROSA® robot-assisted lead implantation for deep brain stimulation: Technique and accuracy,” Oper. Neurosurg, vol. 19, no. 1, pp. 57–64, 2020. [DOI] [PubMed] [Google Scholar]
- [11].Shamir RR, Freiman M, Joskowicz L, Spektor S, and Shoshan Y, “Surface-based facial scan registration in neuronavigation procedures: A clinical study—Clinical article,” J. Neurosurg, vol. 111, no. 6, pp. 1201–1206, 2009. [DOI] [PubMed] [Google Scholar]
- [12].Maarouf M and Neudorf C, “Robotics in stereotactic neurosurgery,” in Introduction to Robotics in Minimally Invasive Neurosurgery. Cham, Switzerland: Springer, 2022, pp. 25–38. [Google Scholar]
- [13].Engelhardt J et al. , “A Phase 2 randomized trial of asleep versus awake subthalamic nucleus deep brain stimulation for Parkinson’s disease,” Stereotact. Funct. Neurosurg, vol. 99, no. 3, pp. 230–240, May 2021. [DOI] [PubMed] [Google Scholar]
- [14].Benech CA, Perez R, Benech F, Greeley SL, Crawford N, and Ledonio C, “Navigated robotic assistance results in improved screw accuracy and positive clinical outcomes: An evaluation of the first 54 cases,” J. Robot. Surg, vol. 14, no. 3, pp. 431–437, Jun. 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [15].Olds KC, Chalasani P, Pacheco-Lopez P, Iordachita I, Akst LM, and Taylor RH, “Preliminary evaluation of a new microsurgical robotic system for head and neck surgery,” in Proc. IEEE Int. Conf. Intell. Robots Syst., 2014, pp. 1276–1281. [Google Scholar]
- [16].Jolesz FA, “Intraoperative imaging in neurosurgery: Where will the future take us?” Acta Neurochir. Suppl, vol. 109, pp. 21–25, Jan. 2011. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [17].Lipson AC, Gargollo PC, and Black PMCL, “Intraoperative magnetic resonance imaging: Considerations for the operating room of the future,” J. Clin. Neurosci, vol. 8, no. 4. pp. 305–310, 2001. [DOI] [PubMed] [Google Scholar]
- [18].Zaffino P, Moccia S, De Momi E, and Spadea MF, “A review on advances in intra-operative imaging for surgery and therapy: Imagining the operating room of the future,” Ann. Biomed. Eng, vol. 48, no. 8, pp. 2171–2191, 2020. [DOI] [PubMed] [Google Scholar]
- [19].Schichor C, Terpolilli N, Thorsteinsdottir J, and Tonn JC, “Intraoperative computed tomography in cranial neurosurgery,” Neurosurg. Clin. North America, vol. 28, no. 4, pp. 595–602, 2017. [Google Scholar]
- [20].Carl B, Bopp M, Saß B, and Nimsky C, “Intraoperative computed tomography as reliable navigation registration device in 200 cranial procedures,” Acta Neurochir, vol. 160, no. 9, pp. 1681–1689, 2018. [DOI] [PubMed] [Google Scholar]
- [21].Holewijn RA, Bot M, van den Munckhof P, and Schuurman PR, “Implementation of intraoperative cone-beam computed tomography (O-arm) for stereotactic imaging during deep brain stimulation procedures,” Oper. Neurosurg., vol. 19, no. 3, pp. E224–E229, Sep. 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [22].Sisniega A et al. , “High-fidelity artifact correction for cone-beam CT imaging of the brain,” Phys. Med. Biol, vol. 60, no. 4, pp. 1415–1439, Jan. 2015. [DOI] [PubMed] [Google Scholar]
- [23].Wu P et al. , “Cone-beam CT for imaging of the head/brain: Development and assessment of scanner prototype and reconstruction algorithms,” Med. Phys, vol. 47, no. 6, pp. 2392–2407, Jun. 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [24].Hamidian H, Soltanian-Zadeh H, Faraji-Dana R, and Gity M, “Data-guide for brain deformation in surgery: Comparison of linear and nonlinear models,” Biomed. Eng. Online, vol. 9, p. 51, Sep. 2010. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [25].Miga MI, “Computational modeling for enhancing soft tissue image guided surgery: An application in neurosurgery,” Ann. Biomed. Eng, vol. 44, no. 1, pp. 128–138, Jan. 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [26].Sun K, Pheiffer TS, Simpson AL, Weis JA, Thompson RC, and Miga MI, “Near real-time computer assisted surgery for brain shift correction using biomechanical models,” IEEE J. Transl. Eng. Health Med, vol. 2, pp. 1–13, 2014. [Google Scholar]
- [27].Sun H et al. , “Stereopsis-guided brain shift compensation,” IEEE Trans. Med. Imag, vol. 24, no. 8, pp. 1039–1052, 2005. [Google Scholar]
- [28].Drakopoulos F et al. , “Adaptive physics-based non-rigid registration for immersive image-guided neuronavigation systems,” Front. Digit. Health, vol. 2, Feb. 2021, Art. no. 613608. [Google Scholar]
- [29].Han R et al. , “Joint synthesis and registration network for deformable MR-CBCT image registration for neurosurgical guidance,” Phys. Med. Biol, vol. 67, no. 12, Jun. 2022, Art. no. 125008. [Google Scholar]
- [30].Reaungamornrat S et al. , “An on-board surgical tracking and video augmentation system for C-arm image guidance,” Int. J. Comput. Assist. Radiol. Surg, vol. 7, no. 5, pp. 647–665, Sep. 2012. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [31].Recasens D, Lamarca J, Fácil JM, Montiel JMM, and Civera J, “Endo-depth-and-motion: Reconstruction and tracking in endoscopic videos using depth networks and photometric constraints,” IEEE Robot. Autom. Lett, vol. 6, no. 4, pp. 7225–7232, Oct. 2021. [Google Scholar]
- [32].Mountney P and Yang G-Z, “Motion compensated SLAM for image guided surgery,” in Medical Image Computing and Computer-Assisted Intervention (Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)). Heidelberg, Germany: Springer, 2010. [Google Scholar]
- [33].Widya AR, Monno Y, Okutomi M, Suzuki S, Gotoda T, and Miki K, “Stomach 3D reconstruction using virtual chromoendoscopic images,” IEEE J. Transl. Eng. Health Med, vol. 9, pp. 1–11, 2021. [Google Scholar]
- [34].Mirota DJ, Wang H, Taylor RH, Ishii M, Gallia GL, and Hager GD, “A system for video-based navigation for endoscopic endonasal skull base surgery,” IEEE Trans. Med. Imag, vol. 31, no. 4, pp. 963–976, Apr. 2012. [Google Scholar]
- [35].Leonard S et al. , “Evaluation and stability analysis of video-based navigation system for functional endoscopic sinus surgery on in vivo clinical data,” IEEE Trans. Med. Imag, vol. 37, no. 10, pp. 2185–2195, Oct. 2018. [Google Scholar]
- [36].Hoshide R, Calayag M, Meltzer H, Levy ML, and Gonda D, “Robot-assisted endoscopic third ventriculostomy: Institutional experience in 9 patients,” J. Neurosurg. Pediatr, vol. 20, no. 2, pp. 125–133, 2017. [DOI] [PubMed] [Google Scholar]
- [37].Vagdargi P et al. , “Pre-clinical development of robot-assisted ventriculoscopy for 3-D image reconstruction and guidance of deep brain neurosurgery,” IEEE Trans. Med. Robot. Bionics, vol. 4, no. 1, pp. 28–37, Feb. 2022. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [38].Su YH, Lindgren K, Huang K, and Hannaford B, “A comparison of surgical cavity 3D reconstruction methods,” in Proc. IEEE/SICE Int. Symp. Syst. Integr, 2020, pp. 329–336. [Google Scholar]
- [39].Liu X et al. , “Dense depth estimation in monocular endoscopy with self-supervised learning methods,” IEEE Trans. Med. Imag, vol. 39, no. 5, pp. 1438–1447, May 2020. [Google Scholar]
- [40].Liu X, Li Z, Ishii M, Hager GD, Taylor RH, and Unberath M, “SAGE: SLAM with appearance and geometry prior for endoscopy,” in Proc. IEEE Int. Conf. Robot. Autom., 2022, pp. 5587–5593. [Google Scholar]
- [41].Vagdargi P et al. , “Robot-assisted neuroendoscopy for real-time 3D guidance of transventricular approach to deep-brain targets,” in Proc. SPIE Med. Imag, 2022, Art. no. 120340E. [Google Scholar]
- [42].Wu P et al. , “Using uncertainty in deep learning reconstruction for cone-beam CT of the brain,” in Proc. Fully3D, 2021, pp. 1–6. [Google Scholar]
- [43].Johnson HJ, Harris G, Williams K, Williams NK, and Williams K, “BRAINSFit: Mutual information registrations of whole-brain 3D images, using the insight toolkit,” Insight J, vol. 180, no. 10, pp. 1–10, 2007. [Google Scholar]
- [44].Mur-Artal R, Montiel JMM, and Tardos JD, “ORB-SLAM: A versatile and accurate monocular SLAM system,” IEEE Trans. Robot, vol. 31, no. 5, pp. 1147–1163, Oct. 2015. [Google Scholar]
- [45].Thrun S and Montemerlo M, “The graph SLAM algorithm with applications to large-scale mapping of urban structures,” Int. J. Robot. Res, vol. 25, nos. 5–6, pp. 403–429, 2006. [Google Scholar]
- [46].Grisetti G, Kümmerle R, Stachniss C, and Burgard W, “A tutorial on graph-based SLAM,” IEEE Intell. Transp. Syst. Mag, vol. 2, no. 4, pp. 31–43, Dec. 2010. [Google Scholar]
- [47].Klein G and Murray D, “Parallel tracking and mapping for small AR workspaces,” in Proc. Int. Symp. Mixed Augmented Real, 2007, pp. 225–234. [Google Scholar]
- [48].Freda L, “PySLAM v2.” 2022. Accessed: Sep. 9, 2022. [Online]. Available: https://github.com/luigifreda/pyslam [Google Scholar]
- [49].Bradski G, “The OpenCV Library,” Dr. Dobbs J. Softw. Tools, vol. 25, no. 11, pp. 120–125, 2000. [Google Scholar]
- [50].Uneri A et al. , “TREK: An integrated system architecture for intraoperative cone-beam CT-guided surgery,” Int. J. Comput. Assist. Radiol. Surg, vol. 7, no. 1, pp. 159–173, Jan. 2012. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [51].Vagdargi P et al. , “Robot-assisted ventriculoscopic 3D reconstruction for guidance of deep-brain stimulation surgery,” in Proc. SPIE Med. Imag, 2021, Art. no. 1159809. [Google Scholar]
- [52].Bay H, Ess A, Tuytelaars T, and Van Gool L, “Speeded-up robust features (SURF),” Comput. Vis. Image Underst, vol. 110, no. 3, pp. 346–359, 2008. [Google Scholar]
- [53].Lowe DG, “Distinctive image features from scale-invariant keypoints,” Int. J. Comput. Vis, vol. 60, no. 2, pp. 91–110, 2004. [Google Scholar]
- [54].Muja M and Lowe DG, “Fast approximate nearest neighbors with automatic algorithm configuration,” in Proc. Int. Conf. Comput. Vis. Theory Appl., 2009, pp. 1–10. [Google Scholar]
- [55].Chetverikov D, Svirko D, Stepanov D, and Krsek P, “The trimmed iterative closest point algorithm,” in Proc. Int. Conf. Pattern Recognit, 2002, pp. 545–548. [Google Scholar]
- [56].Girardeau-Montaut D, “CloudCompare: 3D point cloud and mesh processing software.” 2015. [Online]. Available: https://www.cloudcompare.org/ [Google Scholar]
- [57].“Home of the blender project—Free and open 3D creation software.” Blender.Org. 2015. [Online]. Available: https://www.blender.org/ [Google Scholar]
- [58].Sethian JA, “A fast marching level set method for monotonically advancing fronts,” Proc. Nat. Acad. Sci, vol. 93, no. 4, pp. 1591–1595, 1996. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [59].Knapitsch A, Park J, Zhou Q-Y, and Koltun V, “Tanks and temples: Benchmarking large-scale scene reconstruction,” ACM Trans. Graph, vol. 36, no. 4, p. 78, 2017. [Google Scholar]
- [60].Schönberger JL, Zheng E, Frahm JM, and Pollefeys M, “Pixelwise view selection for unstructured multi-view stereo,” in Proc. Eur. Conf. Comput. Vis., 2016, pp. 501–518. [Google Scholar]
- [61].Schönberger JL and Frahm J-M, “Structure-from-motion revisited,” in Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit., 2016, pp. 4104–4113. [Google Scholar]
- [62].Schönberger JL, Hardmeier H, Sattler T, and Pollefeys M, “Comparative evaluation of hand-crafted and learned local features,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017, pp. 6959–6968. [Google Scholar]
- [63].Chen L, Rottensteiner F, and Heipke C, “Feature detection and description for image matching: From hand-crafted design to deep learning,” Geo-Spatial Inf. Sci, vol. 24, no. 1, pp. 58–74, 2021. [Google Scholar]
- [64].Liu X et al. , “Extremely dense point correspondences using a learned feature descriptor,” in Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit., 2020, pp. 4847–4856. [Google Scholar]
- [65].Li Y, Brasch N, Wang Y, Navab N, and Tombari F, “Structure-SLAM: Low-drift monocular SLAM in indoor environments,” IEEE Robot. Autom. Lett, vol. 5, no. 4, pp. 6583–6590, Oct. 2020. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
