Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2026 Feb 18.
Published in final edited form as: IEEE Trans Biomed Eng. 2025 Jul;72(7):2249–2260. doi: 10.1109/TBME.2025.3538683

Self-Supervised Feature Detection and 3D Reconstruction for Real-Time Neuroendoscopic Guidance

Prasad Vagdargi 1, Ali Uneri 2, Stephen Z Liu 3, Craig K Jones 4, Alejandro Sisniega 5, Junghoon Lee 6, Patrick A Helm 7, Ryan P Lee 8, Mark G Luciano 9, Gregory D Hager 10, Jeffrey H Siewerdsen 11
PMCID: PMC12910575  NIHMSID: NIHMS2125896  PMID: 40031732

Abstract

Objective:

Transventricular approach to deep-brain targets offers direct visualization but also imparts deformation that challenges accurate neuronavigation. 3D reconstruction and registration of the endoscopic view could provide up-to-date, real-time guidance. We develop and evaluate a self-supervised feature detection method for 3D reconstruction and navigation in neuroendoscopy.

Methods:

Unlabeled neuroendoscopic video data from 15 clinical cases yielding 11,527 video frames yielding 11,527 video frames were used to train a self-supervised learning method (R2D2-E) with 5-fold cross validation integrated into a simultaneous localization and mapping (SLAM) pipeline for 3D reconstruction. A series of experiments guided nominal hyperparameters selection and evaluated performance in comparison to SIFT, SURF and SuperPoint in terms of the accuracy of feature matching and 3D reconstruction.

Results:

R2D2-E demonstrated a superior performance in feature matching and 3D reconstruction. R2D2-E features achieved a median projected error of 0.64 mm compared to 0.90 mm, 0.99 mm and 0.83 mm error for SIFT, SURF and SuperPoint, respectively. The method also improved F1 score by 14%, 25% and 22% compared to SIFT, SURF and SuperPoint, respectively.

Conclusion:

The proposed feature detection approach enables accurate, real-time 3D reconstruction in neuroendoscopy, offering robust feature detection in the presence of endoscopic artifacts and provides up-to-date navigation following soft-tissue deformation.

Significance:

The self-supervised feature detection method advances capabilities for vision-based guidance and augmented visualization of target structures in neuroendoscopic procedures. The approach could enhance the accuracy and precision of neurosurgery to improve patient outcomes.

Index Terms—: image-guided surgery, simultaneous localization and mapping, 3D reconstruction, augmented reality, neurosurgery, feature detection

I. Introduction

Neuroendoscopic procedures are emerging as a valuable approach for accessing deep-brain targets, offering direct visualization of structures that are typically challenging to reach through conventional methods [1], [2]. Clinical applications of the technique include endoscopic third ventriculostomy (ETV), choroid plexus cauterization, cyst fenestration or biopsy, intraventricular tumor resection, basal cistern exploration, and potentially deep-brain stimulation [3], [4], [5]. A transventricular approach is useful for accessing the targets in close proximity to lateral or third ventricles that are difficult to reach by other procedures. However, brain shift and egress of cerebrospinal fluid (CSF) during neuroendoscopic procedures induces geometric deformation of deep-brain structures, challenging accurate neuronavigation [6]. Brain shift induces deformation ranging 5-24 mm upon incision of the dura [7], primarily in regions close to the cortical surface, whereas CSF egress causes deformation about deep-brain structures in proximity to the ventricles by 4-10 mm [8]. Even with direct visualization, targets can present with inconspicuous appearance, occlusion, or location beyond the visible surface of the ventricles, leading to challenging target localization.

Conventional neuronavigation approaches for these procedures rely on rigid registration with preoperative magnetic resonance (MR) or computed tomography (CT) imaging. Both framed and frameless navigation systems utilizing optical [9] or electromagnetic tracking [10] employ fiducial markers [11] or surface delineation [12] for registration. While emerging robotic systems, such as ROSA (Zimmer Biomet, Warsaw, IN), Neuromate (Renishaw, Wotton-under-Edge, United Kingdom), and Excelsius (Globus Medical, Audubon, PA) offer advanced capabilities like tool stabilization and cooperative control, they also rely on rigid registration of the cranium, thereby limiting accuracy in the presence of deep-brain deformation. Intraoperative 3D imaging, including MR, CT, and cone-beam CT (CBCT), have been explored as means to capture accurate representations of deformed anatomy [13]. Intraoperative 3D imaging can provide clear visualization of intracranial structure, but workflow disruptions and high operational cost may limit broad utilization [14], [15]. Intraoperative ultrasound imaging presents another promising approach, offering real-time visualization of intracranial structures, but may be challenged by factors such as operator variability and poor image quality [16], [17].

Vision-based navigation has emerged as a promising method to support intraoperative tracking and guidance in various medical procedures [18]. Previous studies reported its application in bronchoscopy [19], transnasal skull-base surgery [20], and laparoscopy [21]. In these contexts, navigation was facilitated through 3D reconstruction of anatomy [22], which served as a means for registration with prior anatomical data. Two primary classes of algorithms were employed for this purpose: structure-from-motion (SfM) and simultaneous localization and mapping (SLAM). While SfM is best suited for one-time offline reconstruction, SLAM utilizes a sequence of video frames for pose estimation and reconstruction, making it potentially suitable for online, real-time navigation [23]. Previous work demonstrated development of a prototype Robot Assisted Ventriculostomy (RAV) system with 3D reconstruction and registration using SfM [24] and SLAM [25]. Recent advances in deep learning methods have also demonstrated significant promise in reconstructing 3D anatomy through dense visual tracking [26], depth estimation [20], and end-to-end volumetric reconstruction [27], with or without geometric priors [28].

Feature detection is crucial step for camera localization in both SfM and SLAM, involving extraction of distinctive features across a sequence of image frames that may be challenged by variations in lighting, viewpoint, occlusions, and other artifacts observed in endoscopy. Classical feature detection algorithms, such as SIFT [29], SURF [30], and ORB [31], are hand-crafted, deterministic representations to identify unique features represented as vector descriptors of the local image context of a detected feature location. Matching similar descriptor vectors establishes correspondence between images across frames, enabling the tracking of detected features in successive frames for estimating the camera pose.

Deep learning methods present an alternative means of feature detection that could address the limitations of classical detectors, including robustness to glare and occlusion. The development of robust feature detection methods for endoscopic imaging, particularly in neuroendoscopy, is limited by the scarcity of annotated ground truth data due to the labor-intensive process of labeling individual keypoints and establishing their correlations across images. Traditional supervised learning approaches, which typically required large amounts of labeled data, were thus challenging to implement in this context.

Self-supervised methods have emerged as a promising alternative [32], [33], [34], [35] offering the potential to leverage abundant unlabeled data available from various sources. Recent advancements in self-supervised feature detection techniques [34], [35] demonstrated significant potential for application in vision-based navigation endoscopy without the need for extensive manual annotation.

The work reported below aims to adapt self-supervised feature detection methods for navigation using 3D video reconstruction in neuroendoscopy. Specific contributions of the work include: (1) finetuning and adaptation of a data-driven, self-supervised approach for feature detection and description based on R2D2 [32] to a neuroendoscopy context; (2) the first report of clinical translation of the system for 3D reconstruction in a pilot study with 15 patients; (3) experiments evaluating nominal hyperparameter selection for feature detection and performance in image matching tasks; and (4) evaluation of geometric accuracy of 3D reconstruction in clinical data using the novel feature detector with SLAM. The robustness of feature detection against typical artifacts observed in endoscopy was also investigated. Augmentation of neuroendoscopic video with color overlay of structures registered from MR images was also implemented to demonstrate visualization of 3D structures beyond the visible surface directly for assistance in target localization.

II. Methods

A. Prototype Neuroendoscopy System for Clinical Studies

The neuroendoscopy prototype consists of components illustrated in Fig. 1. The system provides the basis for an IRB-approved clinical pilot study (IRB No.: IRB00283437) involving 15 patients undergoing left or right ETV involving neuroendoscopy of the lateral ventricles within their treatment under the standard of care. Each of the 15 cases accrued under the clinical pilot study yielded video segments of 10 s – 47 s acquired at 30 frames per second (fps), resulting in a total of 11,527 ventriculoscopic video frames. Reconstruction was based on a continuous video segment acquired in sweep of the lateral ventricles in an early phase of the procedure. The large number of frames were further augmented (style and geometric perturbations) to form a rigorous, quantitative basis for validation. 5-fold cross validation was performed for evaluating the geometric accuracy, where each of the five folds comprised 12 cases for training and 3 cases held out for validation. The held-out cases served as ground truth such that all cases were eventually analyzed for validation. Pre-operative T1-weighted MR images were collected from each patient in the study (along with intraoperative CBCT when available), and features of interest were segmented as detailed below [Fig. 1(c)]. The ventriculoscope was calibrated intraoperatively using a checkerboard pattern mounted on a calibration jig [36] as shown in Fig. 1(b). With the patient positioned supine – optionally with a radiolucent frame or headboard compatible with intraoperative CBCT – a straight 6° ventriculoscope (Lotta 28164LA, Karl Storz, Tuttlingen, Germany) was coupled with the camera (Image1HD H3-Z TH100, Karl Storz, Tuttlingen, Germany) for intraoperative neuroendoscopy. The endoscopic view was displayed on the integrated monitor (Image1 Hub, Karl Storz, Tuttlingen, Germany) – e.g., as illustrated in Fig. 1(d) – and the video stream was recorded using the Advanced Image and Data Acquisition/Archiving System (AIDA, Karl Storz, Tuttlingen, Germany) for offline analysis.

Fig. 1.

Fig. 1.

System and data acquisition for clinical neuroendoscopic procedures (a) Patient setup, ventriculoscope, and ventriculoscopic display. (b) Intraoperative ventriculoscope calibration setup and the custom calibration jig. (c) Pre-operative MR image with corresponding deep-brain structures segmented for the right hemisphere. (d) Endoscopic image showing the Foramen of Monro, choroid plexus, and adjacent neurovasculature. The ability to accurately overlay structures from (c) within real-time venticulostomy (d) could improve targeting in an emerging range of neuroendoscopic interventions.

An overview of the SLAM reconstruction pipeline with the proposed self-supervised feature detection network is shown with notation in Fig. 2(a). Preoperative MR images were denoted by the coordinate frame M, including segmentation of anatomical structures of interest, such as surfaces of the lateral and third ventricles, denoted S. When available, intraoperative CBCT imaging was denoted by the frame CT. Neuroendoscopic video data 𝒥={Ii:i=1Ni} comprised individual frames Ii acquired over Ni frames and taken as input to the feature detection and 3D point-cloud reconstruction methods described below. The resulting 3D point-cloud was denoted P={Xk:k=1nk}, where XkR3. The notation TAB denotes the homogenous rigid transformation locating frame B with respect to frame A. TMS was obtained by segmentation of the lateral ventricles in the MR image, forming the ground truth surface in the MR frame. The transform of each video frame with respect to the point-cloud reconstruction was similarly denoted TIiP. The 3D point cloud was registered to surface segmentations via TPS, and 3D annotations of target structures were fused with the endoscopic video via TMIi=TMS(TPS)1(TIiP)1. Alternatively, the registration of preoperative MR to intraoperative CBCT [37] (when available) yields the transform TCTM. The ventricular surface and segmented structures were then localized in the CBCT frame as TCTS=TCTMTMS. Target annotations were then fused with endoscopic video via TCTIi=TCTS(TPS)1(TIiP)1.

Fig. 2.

Fig. 2.

3D reconstruction and video fusion using self-supervised keypoint detection and description. (a) Flowchart depicting the steps for data acquisition, feature detection and 3D reconstruction. Integration of ventriculoscopic video with reconstruction methods provides direct fusion of planned targets with the live endoscopic video. (b) Dual-branch self-supervised network architecture for feature detection, with the upper branch computing descriptors for input image, and the lower branch detecting keypoints at the locations of maximal repeatability and reliability (c) Keypoint error estimation in feature matching using simulated warping of images.

As shown in Fig. 2(b), the network extracts features which are then matched across image pairs, Ii and Ii+1, to obtain the correspondences between two images. These matches were filtered using MAGSAC [38] with a homography model to estimate the relative pose of the camera frame between two images, TIiIi+1. The triangulation of these features (with camera intrinsics obtained from calibration) yields the sparse 3D point cloud (P) with bundle adjustment and statistical outlier filtering [39] to reduce noise from mismatched features. The point-cloud was then registered using trimmed iterative closest point registration (ICP) for video fusion and visualization of the target structures directly in the endoscopic view.

B. Feature Detection and Description

Classic feature detection via SIFT and SURF, along with a learned feature detection method [40] served as a basis of comparison to the proposed approach detailed below. For both SIFT and SURF, keypoint localization was performed by identifying local extrema (minima and maxima) in the scale space, ensuring that detected features were stable across different scales. Each keypoint was assigned an orientation based on the local gradient directions in its neighborhood, providing invariance to image rotation. A 128-dimensional descriptor was obtained for each SIFT keypoint by computing the histograms of gradient magnitude and orientations in a local window. Similarly, 128-dimensional SURF descriptors were computed according to their Haar wavelet responses in the local window centered on the keypoint location. The descriptors from different images were matched using L2 distance metric to establish correspondences between detected features.

Deep learning methods present an alternative means of feature detection that could address known limitations of classical detectors, including robustness to glare and occlusion. The recently reported learning-based feature detection approach, SuperPoint [40], employs a fully convolutional neural network architecture that jointly performs interest point detection and description in a single forward pass. The network consists of a shared encoder that processes a grayscale input image, followed by two decoder heads - one for interest point detection and another for descriptor computation. The detector head outputs a tensor of feature probabilities for each pixel, while the descriptor head generates a semi-dense grid of 256-dimensional descriptors. The training procedure involves (1) basic corner detection on a synthetic dataset of simple geometric shapes and (2) self-supervised homographic adaptation applied to adapt the method to real-world images. Step 2 generates multiple random homographic warps of each training image as pseudo-ground truth labels. SuperPoint was adapted to the neuroendoscopy following context Step 2 using the clinical dataset described above and the same partitioning of 12 cases held for training, 3 cases for validation, and 5-fold cross validation over all cases. The same camera calibration, image extraction, and cropping steps were followed, and as described below, data augmentation entailed spatial variations (randomized resizing, skew transformation, and rotation) and intensity variations (randomized glare, noise, brightness, hue, contrast, and saturation) applied for each image during training.

The R2D2 [32] fully convolutional neural network was adapted in this work to endoscopic video, referred to as “R2D2-E” and developed to jointly learn keypoint detection, local descriptors, and descriptor reliability from endoscopic training data. The architecture is similar to R2D2, as shown in Fig. 2(b), and predicts three outputs: dense descriptors, a repeatability heatmap for keypoint detection, and a reliability heatmap for descriptor discriminability. The network consists of a shared backbone of weights that couples the tasks of keypoint detection and description. The R2D2-E architecture used an L2Net backbone [41] with 7 convolutional blocks, with downsampling achieved by stride 2 convolution. The last 8×8 convolutional layer was replaced by three 2×2 convolutional layers to reduce the number of weights. Batch normalization was used after each convolutional layer along with ReLU activation. The shared backbone forms a 128-dimensional tensor, serving as input to a detector head. The detector head results in two branches following a 1×1 convolution and a softmax operation to form the repeatability heatmap denoted RP[0,1]:{W×H}, and the reliability heatmap denoted by RL[0,1]:{W×H}. Simultaneously, the backbone serves as an input to a L2 normalization layer resulting in the per-pixel descriptors tensor denoted D[0,1]:{H×W×128}. The loss function for the descriptor branch is based on a differentiable Average Precision (AP) metric [42] that learns to rank descriptors for matching. The repeatability head was trained to encourage consistent feature localization at local image maxima across viewpoints, whereas the reliability heatmap was trained jointly with the descriptor loss to avoid ambiguous or repetitive regions across different images.

R2D2-E was trained using self-supervision across pairs of endoscopic images, augmented to impart a broad variety of image characteristics in the training set, including blur, illumination glare, and perspective distortion. Given an image I1:{W×H}, the region outside the endoscopic circular view is masked, and a spatial transformation was applied (denoted T12, applying perspective transformations), followed by an image domain transformation (denoted H12, imparting effects of blur, illumination glare, and color shift). The resulting image, I2, has a known ground truth transformation mapping the pixels from the original image. Corresponding image patches denoted [p]P:{N×Npx}, were sampled from both images for training the network. Prior work [32] details the individual loss terms along with weighting factors between losses.

During inference, the input image was masked to the endoscopic circular field and processed by the network to obtain the repeatability, reliability and descriptors scores in a single forward pass. The feature confidence score was computed as RP×RL and thresholded, followed by non-maximum suppression to select a single feature located at pixel (x,y) from a patch p, thus allowing for choice in the density of features by varying the patch size. The corresponding L2-normalized feature descriptors were extracted at the feature locations, yielding Nf features denoted i={(fj(x,y),Dj):j=1:Nf}. Feature matching was performed between a pair of images with features (1,2) using L2 descriptor vector distance as the matching criterion, resulting in NM one-to-one matches. These matches were further filtered using MAGSAC [38] with a confidence threshold of 0.99, a reprojection error threshold of 1.5 px, and 1000 maximum iterations. The resulting filtered matches were used to obtain a subset of NFM matches consistent with the image homography.

C. 3D Surface Reconstruction and Registration

The pipeline in Fig. 2(a) was developed for real-time 3D video reconstruction and video fusion, consisting of two major components: feature detection and matching, followed by 3D reconstruction and registration – the latter using PySLAM v2 [43], OpenCV, and TREK [44] in a parallel multithreaded architecture for tracking and mapping. As part of the tracking thread, the image frame was taken as input to R2D2-E to compute features and corresponding descriptors (i) for each image (Ii). Matching was then performed for features detected in consecutive frames, followed by filtering. Successfully matched features were then designated as keypoints, and based on the fraction of keypoints tracked in the current frame (compared to the previous frame), the frame was designated a keyframe and inserted into a pose graph for mapping.

In parallel with the tracking thread, the mapping thread observes the inserted keyframes in the pose graph to reconstruct the 3D point cloud. After obtaining matched feature correspondences between image pairs, epipolar geometry constraints were applied to estimate the relative camera motion and 3D structure. For a known set of corresponding features, f1 and f2 in two images, the essential matrix (E) was estimated using the five-point algorithm with MAGSAC [38], requiring at least five point-to-point correspondences using the epipolar constraint: f2TKTEK1f1=0, where K1 and K2 are the camera intrinsic matrices obtained from calibration. The essential matrix encodes the relative rotation (R) and translation (t) between the two camera positions (TI1I2) decomposed as E=[t]xR, where [t]x represents the skew-symmetric matrix of the translation vector. For each point triangulated with homogeneous 3D coordinates denoted X, a linear system of equations was solved: [f1P1]X=0, and [f2P2]X=0, where P1 and P2 are the projection matrices for the two views. This system was solved using Direct Linear Transformation (DLT) [45] to yield the triangulated points, followed by nonlinear optimization to minimize the geometric reprojection error:

minIif1iπ(P1Xi)2+f2iπ(P2Xi)2 (1)

where π represents the perspective projection. The resulting 3D points and camera poses were further refined through bundle adjustment in a local window of keyframes to minimize the total reprojection error across multiple views simultaneously. Outlier points were filtered according to realistic system geometry (e.g., minimum or maximum focal length and triangulation angles). A local window of 10 frames was used for bundle adjustment, and global bundle adjustment was not performed. Keyframes in which more than a 0.9 fraction of keypoints were observed in at least 3 other keyframes were termed redundant and removed to reduce computational load. Finally, the resulting point cloud was refined using statistical filtering to remove outliers, returning a sparse point cloud consisting of NX 3D points, denoted P={Xi:i=1:NX}. R2D2-E was integrated with PySLAM [43] for feature detection.

The 3D point-cloud was registered using trimmed ICP [46] with structures segmented from MRI via a point-to-plane metric. The resulting transformation (TPS) precisely maps the point-cloud to the surface and the MR imaging (TMP=TMS(TPS)1). The individual camera pose and video frames were then localized with respect to MR surface as TMIi=TMP(TIiP)1, enabling augmentation of the video scene with the surgical plan, target structures and any critical anatomy as annotated in the MR imaging.

III. Experimental Methods

A. Training Data Generation

The dataset has diverse anatomy and different endoscopic imaging artifacts. MR images were automatically segmented using Fastsurfer [47], [48] to annotate target structures such as the ventricular surfaces, amygdala, thalamus, and nucleus accumbens. Vasculature such as the distal basilar artery and proximal posterior cerebral arteries were manually segmented. Video data were corrected for distortion according to intraoperative geometric calibration [Fig. 1(b)] and cropped to the endoscopic field of view. Video frames were randomly resized between 384 and 1024 px length maintaining aspect ratio, followed by cropping to (W,H) = (256 px, 256 px). Spatial augmentations (T12) included a perspective skew transform ranging randomly over the range [0,0.4]W. Images were further rotated randomly over ±180°. Image domain (intensity) augmentations included the addition of random distributions in glare: [0, 0.8], pixel noise: [0, 30 px], brightness: [0, 0.2], hue: [0, 0.1], contrast: [0, 0.3], and saturation: [0, 0.3]. The applied spatial augmentations were chosen to reflect a range of commonly observed variations in endoscopic procedures. Endoscopic artifacts such as glare, partial occlusion, bleeding and smoke from cauterization were partially reflected in augmentation via local intensity perturbations to create a more robust training routine.

Each fold of the network was trained on a workstation with a single NVIDIA Titan RTX GPU with 24 GB memory. 3D reconstruction (SLAM) was performed on a separate workstation with NVIDIA GTX 980Ti GPU with 6 GB memory.

B. Experimental Evaluation and Validation

A series of experiments was conducted to optimize and evaluate the performance of the R2D2-E feature detector. The first experiment focused on hyperparameter selection and validation in the image domain. The second experiment evaluated the accuracy of keypoint detection and matching. The final experiment assessed the 3D geometric accuracy of reconstruction and registration.

1). Selection of Network Hyperparameters

The first experiment was designed to inform nominal selection of hyperparameters affecting feature detection and matching. The network was first pretrained on natural images of indoor and outdoor scenery as done by [32]. The loss terms for repeatability and reliability were equally weighted, and other image sampling parameters were also set to the recommended values as in prior work [32]. A fixed batch size of 6 was selected according to GPU memory constraints, with the network being trained for 30 epochs. The network was trained using the Adam optimizer with λ=5×104, β1=0.9, β2=0.999, and ε=1×108. To adapt the network to endoscopic imaging, the descriptor loss weighting factor κ was set to 0.2 for epoch 1, with an increase of 0.01 per epoch. Two univariate studies were conducted with: 1) the learning rate (LR) for the network varied as a parameter over the range [10−5, 10−2]; and 2) the patch size [p] varied as a parameter over the range [4, 128] to assess the effect of feature density on error.

Performance of the feature detector was evaluated in terms of the following metrics. The median validation loss across epochs 20-30 was used to assess the convergence of network optimization. The number of features obtained for each patch size was noted as a measure of feature density, and the fraction of matches (Fmatch) for a pair of images (I1,I2) and a corresponding set of features presenting with N1f and N2f features in each image respectively, resulting in NM successful matches was computed as:

Fmatch(%)=NMmin(N1f,N2f)×100 (2)

The number of inlier matches (following MAGSAC filtering) was denoted by NFM, and the filtered match percentage (Fmatchf) was computed as:

Fmatchf(%)=NFMNM×100 (3)

Finally, the keypoint error for features fj and fk in I1 and I2 respectively, as shown in Fig. 2(c), was defined as the error between successfully matched and filtered keypoints on the image:

KPE=T12fj(xj,yj)fk(xk,yk) (4)

2). Evaluation of Feature Matching Accuracy

With parameter selections determined by the previous experiment, the performance in feature matching was evaluated for SIFT, SURF, and R2D2-E approaches. The number of features extracted from the image was chosen as the primary experimental variable, measuring the effect on the fraction of keypoints that were successfully matched across a series of video frames, defined as:

Kmatch(%)=NFMmin(N1f,N2f)×100 (5)

The resulting fraction of matched keypoints (before and after filtering) was compared while varying parameters that affect the number of detected keypoints, namely: 1) the contrast threshold for SIFT; 2) the Hessian threshold for SURF; and 3) the confidence score threshold for R2D2-E. For each method, the number of features extracted was varied over the range [500, 5000] in increments of 500 by varying the above thresholds. R2D2-E was trained with 5-fold cross validation, retaining 3 cases in each fold for validation. The KPE was quantitatively evaluated with these cases, resulting in performance validation for all cases individually. Finally, the distribution in track lengths was evaluated for each method. The accuracy of feature matching was evaluated in terms of KPE as well as qualitative assessment across a range of endoscopic views varying in quality (e.g., scenes with blur or glare) and content (scenes varying in feature density from relatively sparse to dense).

3). Evaluation of 3D Geometric Accuracy

The third experiment assessed the geometric accuracy of 3D point cloud reconstruction with the various feature detection methods as the primary experimental variable. For all methods, the reconstructed point cloud was filtered using knn=5 neighbors and σnn=0.1mm, filtered for 3 iterations, and the trimmed ICP fraction of overlap (γ) was set to 0.95. Registration was taken to be converged for tolerance limits in rotation (Δr<0.05°) and translation (Δt<0.01mm) between successive iterations. CloudCompare [49] was used for point-cloud filtering and registration.

The geometric accuracy of registration was evaluated in terms of projected error (PE). The right lateral ventricular surface was segmented from MR imaging using previously established methods [50], [51] to provide the ground truth surface, denoted S. For a given point Xi in the cloud and closest point Si on S, PE is:

PE=projS(XiSi) (6)

where projS is the projection of error on S, followed by k-nearest neighbor interpolation over the surface with k=5. To reduce extrapolation artifacts, S was truncated to the region within a convex hull of the reconstructed point cloud. In a similar manner as noted in Section III.B.2, the PE was evaluated for all methods across the same 15 cases using 5 fold cross validation, with 3 cases held out for each fold.

Further assessment of the point cloud reconstruction to quantify its accuracy and completeness was performed in terms of precision P(dt) and recall R(dt) as described in [52] with respect to a variable distance threshold (dt). For a reconstructed point Xi𝒳, the distance to S was:

eXS=minSiSXiSi (7)

Precision is therefore the fraction of points from the point-cloud reconstruction (𝒳) to the S within the distance threshold (dt):

P(dt)=1NXi=1NX[eXS<dt] (8)

where the square brackets [] represent the Iverson binarizing operator, giving P(dt) in the range [0, 1]. Similarly, for a ground truth point SiS, the distance to the point-cloud reconstruction (𝒳) is:

eS𝒳=minXi𝒳SiXi (9)

yielding recall as the fraction of points from S to 𝒳 within threshold dt:

R(dt)=1NSi=1NS[eS𝒳<dt] (10)

with R(dt) in the range [0, 1]. Precision and recall therefore form a basis for assessing system performance under an accuracy requirement (dt), similar to an operating curve. F1 score was also computed as the harmonic mean of precision and recall for each distance threshold, as derived in prior work [53].

Finally, visualization of the resulting registration in endoscopic video was achieved by overlay of target structures annotated in preoperative MR images (such as the amygdala, thalamus and nucleus accumbens) transformed to the camera frame. Blender [54] was used to define the 3D scene and camera poses, and segmented structures were rendered in alpha-blended overlay for direct visualization in the endoscopic field of view.

IV. Results

A. Selection of Network Hyperparameters

Figure 3 summarizes the effect of network hyperparameters on validation loss and image matching accuracy, with 5-fold cross-validation during training resulting in 15 data points for each boxplot shown. The validation loss achieved for different learning rates (LRs) was evaluated, using the default N=16 px as done in [32], as shown in Figure 3(a-c). A slight decrease in loss was observed with increasing learning rates, with a loss of 1.06 achieved for LR=10−3. However, for larger LRs, the network failed to converge due to gradient explosion, resulting in final losses of 1.56 and 1.79 for LRs of 5 × 10−3 and 10−2, respectively. Figure 3(b) shows the effect of LR on the match fraction, showing an increase in both Fmatchf and Fmatch up to a maximum of 84% and 41%, respectively, with an increase in LR up to 10−3. Beyond this value, the network failed to reliably predict features, resulting in no matches. Similarly, larger LRs resulted in lower median KPE up to an LR of 10−3, as shown in Fig. 3(c), for which a KPE of 1.01 px was achieved. Notably, for the successfully trained networks, a KPE of less than 2 px was consistently achieved. Based on these findings, a nominal LR of 10−3 was selected for further experiments.

Fig. 3.

Fig. 3.

Hyperparameter selection and sensitivity analysis for feature detection and description by evaluation of (a) validation loss (b) successful matches (c) keypoint error as a function of learning rate (α) and (d, e, f) as a function of patch size (N) for keypoint detection.

Figure 3(d-f) demonstrates the sensitivity of validation loss and match accuracy with respect to the patch size. Selecting smaller patch sizes, such as 4 px and 8 px, resulted in a larger number of features; however, the corresponding networks did not converge sufficiently, with final losses of 2.08 and 1.51, respectively. Conversely, larger patch sizes led to lower final losses and a reduced number of features, as shown in Fig. 3(d). Figure 3(e) further illustrates the variation of Fmatchf with increasing patch sizes. While the Fmatch values were comparably similar for the evaluated patch sizes, a sudden increase in Fmatchf was observed for N>8 px. A maximum Fmatchf of 94.5% was achieved for a patch size of 128 px, albeit with a tradeoff of obtaining only 167 features. The KPE was observed to reduce minimally with larger patch sizes, varying from a maximum of 1.03 px for N=8 px to 0.94 px for N=128 px. To achieve a feature density and accuracy consistent with clinical applications, a nominal value of N=32 px was selected for further experiments, yielding a median of 3142 features and a KPE of 0.97 px.

B. Evaluation of Feature Matching Accuracy

A quantitative evaluation of R2D2-E for feature matching tasks was performed and compared against SIFT and SURF, as shown in Figure 4. The sensitivity of matched keypoints (Kmatch) with respect to the number of extracted features (NF) was demonstrated in Figure 4(a). A sharp reduction in matched keypoints fraction (Kmatch) was observed for both SIFT and SURF with an increase in NF, varying from 18% and 16% respectively for 500 features to 5.8% for 5,000 features. Kmatch using SuperPoint reduced gradually in comparison, varying from 12.1% for 500 features to 7.0% for 5000 features. Classical feature detectors outperformed SuperPoint in scenarios with NF<3000, beyond which SuperPoint was marginally better. In contrast, R2D2-E successfully matched a higher number of features across all NF>1000, with or without filtering. R2D2-E achieved a higher and exhibited a gradual decrease from 13.55% for 500 features to 8.3% for 5,000 features.

Fig. 4.

Fig. 4.

Assessment of keypoint detection and matching accuracy: Median matching accuracy (validation) measured as a function of epochs compared to classical feature detectors (a) before and after filtering, demonstrating a superior performance following training. (b) KPE of self-supervised method compared to classical feature detectors across 5 folds. (c) Track lengths (no. of images) following feature matching for each feature detector.

Figure 4(b) showed that SIFT, SURF and SuperPoint achieved median KPEs of 2.20 px, 1.70 px, and 1.46 px respectively, while R2D2-E achieved a superior, sub-pixel median KPE of 0.83 px with fewer outliers.

The performance during feature tracking was quantified in Figure 4(c), showing the track lengths measured in the number of images, where a larger track length signified better performance. The median (IQR) track length for SIFT, SURF and SuperPoint methods was measured as 9 images (6 images), 11 images (9 images) and 13 images (7 images) respectively. R2D2-E demonstrated a major improvement in track length, with a median (IQR) of 19 images (7 images).

The qualitative performance of the methods was visualized in Fig. 5, with (a) showing a feature-dense endoscopic scene viewing the ventricular adhesions and (b) depicting a feature-sparse lateral ventricle surface with additional simulated artifacts. Successfully matched features with a KPE of less than 1 px were shown in green, while matches with a KPE greater than 1 px were shown in red. For the same number of extracted features, R2D2-E demonstrated a higher fraction of correct matches (Fmatchf) and a lower KPE, with features evenly distributed across the image.

Fig. 5.

Fig. 5.

Qualitative assessment of keypoint detection on original and simulated images for (a) feature dense and (b) feature sparse regions of lateral ventricles. Successful matches (KPE of <1 px) are depicted in green, whereas matches with higher errors (KPE of >1 px) are marked in red. Blue arrows mark example areas of interest. SIFT and SURF tended to cluster features around high contrast structures (including artifacts). R2D2-E tended to compute more evenly distributed features associated with anatomical landmarks and vasculature (and better avoiding artifacts), resulting in successful more matches across the entire image.

SIFT and SURF methods typically detected high contrast structures and edges as features, shown in Fig. 5(c) by blue arrows, resulting in feature clustering around tissues such as the choroid plexus. In contrast, R2D2-E detected more evenly distributed features that appear more clearly associated with anatomical landmarks such as vasculature and other intraventricular anatomy, and assigned low confidence to regions affected by artifacts.

C. Evaluation of 3D Geometric Accuracy

The geometric accuracy of R2D2-E for 3D reconstruction was evaluated in comparison to other methods as shown in Fig. 6. Projected Error (PE) following reconstruction is shown in Fig. 6(a), with all methods achieving sub 2 mm accuracy, a typical error seen in stereotactic navigation systems. The median (IQR) PE for SIFT and SURF were 0.90 mm (1.45 mm) and 0.99 mm (1.60 mm) as evaluated on 7 and 13 cases respectively, with 6 cases which failed to track and reconstruct for SIFT implying a success rate of 54%. SuperPoint achieved a relatively lower PE of 0.83 mm (1.26 mm) as evaluated on 10 cases implying a success rate of 77%. R2D2-E achieved a lower PE and reduced variability across all cases, with a median of 0.64 mm (IQR: 0.84 mm) and successful reconstructions for all 13 cases. The median (IQR) runtime for feature detection with R2D2-E was 0.39 s (0.01 s), with a tracking runtime of 1.11 s (0.06 s) per frame, permitting tracking at 0.90 fps in the current implementation.

Fig. 6.

Fig. 6.

Evaluation of geometric accuracy of registration when using SLAM based reconstruction with R2D2-E feature detection, compared to other feature detectors. (a) Projected error (PE) for different reconstructions (b) F1 score (c) Precision (d) Recall for point-cloud reconstructions.

The F1 score was used to further evaluate the point-cloud reconstructions as shown in Fig. 6(b). As the distance threshold (dt) increased, the F1 score gradually improved for all methods. At dt values of 1 mm and 2 mm, SIFT achieved F1 scores of 0.63 and 0.83, while SURF achieved scores of 0.48 and 0.75, respectively. SuperPoint achieved scores of 0.59 and 0.91 at the same distance thresholds, indicating better performance than SURF. R2D2-E outperformed both methods for all distance thresholds, consistently maintaining a smaller IQR and achieving scores of 0.72 and 0.94 for dt values of 1 mm and 2 mm, respectively. This represents an improvement of 14%, 25%, and 22% in F1 scores compared to SIFT, SURF, and SuperPoint, respectively, for dt=1mm.

The precision and recall breakdown is shown in Fig. 6(c-d). Precision gradually improved for all methods with an increase in dt. SIFT and R2D2-E achieved similar median precision up to dt=1.5mm, while SURF and SuperPoint exhibited lower overall precision. R2D2-E achieved a precision of 0.66 (0.26) and 0.91 (0.10) for dt values of 1 mm and 2 mm, respectively, with a consistently lower IQR compared to other methods. In terms of recall, R2D2-E consistently outperformed the other methods for dt1.5mm, demonstrating a drastic increase between dt of 0 to 1 mm and achieving a recall of 0.81 (0.14) and 0.97 (0.06) at 1 mm and 2 mm, respectively.

The segmentation and annotation of typical neuroendoscopic targets in MR imaging, along with the augmented overlay view of the right lateral ventricle, are illustrated in Figure 7. Following point-cloud registration, target structures surrounding the lateral ventricles including accumbens (cyan), amygdala (yellow), pallidum (green) and thalamus (magenta) were rendered and overlaid with depth cues as intensity. The augmented visualizations are demonstrated for a feature dense scene (Fig. 7a) along with a feature sparse scene (Fig. 7b).

Fig. 7.

Fig. 7.

Augmented neuroendoscopic guidance with corresponding structures shown in MR imaging as demonstrated with a (a) feature-dense scene and (b) feature-sparse scene. Deep-brain targets such as accumbens (cyan), amygdala (yellow), pallidum (green) and thalamus (magenta) were augmented with depth cues as intensity for overlay. (c) Critical structures in neuroendoscopy such as the distal basilar artery and proximal posterior cerebral arteries are shown in red, providing spatial context and depth cues in endoscopic third ventriculostomy.

Critical structures in neuroendoscopy such as the distal basilar artery and proximal posterior cerebral arteries were also augmented with the neuroendoscopic view as shown in Fig. 7(c), demonstrating localization of targets both within and beyond the neuroendoscopic field of view to provide valuable spatial context for the surgeon. A video version of the augmented overlay is provided as supplemental material to demonstrate the dynamic nature of the visualization and its potential to enhance intraoperative guidance during neuroendoscopic procedures.

V. Discussion

A series of experiments were conducted to quantitatively evaluate the performance of R2D2-E for feature matching and 3D video reconstruction in clinical data. The first experiments informed the selection of nominal parameters for feature detection and matching in neuroendoscopic images. The findings showed that accurate feature detection was achieved with LR1×103, and higher LR resulted in failure to converge, lower Fmatch, and KPE >1 px. Analysis of patch size further showed an inverse relationship between NF and patch size. A strong drop in Fmatchf was observed for patch size <16 px, associated with noisy feature sets that were unreliable for image matching. Larger patch size similarly reduced KPE at the cost of reduced NF, and a nominal patch size of 32 px was determined to best balance the tradeoffs between quantity and quality of keypoints for robust scene mapping and 3D reconstruction. Previous work reported a similar tradeoff between accuracy and NF [32] in outdoor video scenes and achieved similar matching performance.

The feature mapping performance of R2D2-E was superior to both: classical methods like SIFT and SURF and SuperPoint, the learning based method, as evidenced in the second set of experiments. R2D2-E extracted strong features across a range of NF, while other methods showed a sharp decline in Kmatch with larger NF, indicating unreliable features. R2D2-E also demonstrated a higher and more stable matching rate and achieved lower median KPE with reduced variance. The resulting improvement in homography and camera pose estimation following matching by R2D2-E could potentially reduce error propagation associated with scale/camera drift compounded over time – a common problem in traditional endoscopic SLAM systems [55]. If longer video segments are available, prior loop detection and closure methods [56], [57] could further mitigate localization errors and correct for deviations with a continuous map update computed over a separate thread. As shown by arrows in Fig. 5(c), unlike classical methods that tend to cluster keypoints in high-contrast regions such as the choroid plexus, R2D2-E exhibited a more even distribution of keypoints over clinically relevant structures such as vasculature and enabled robust tracking in the presence of occlusions. The method intrinsically assigned low reliability and repeatability to regions with image artifacts, thereby rejecting keypoints in areas of glare or blur. The longer track length distribution and higher median also demonstrated robust tracking of features across a series of video frames, which is crucial for consistent localization and reconstruction in SLAM. These findings suggest that R2D2-E not only outperformed other evaluated methods in terms of accuracy but also provided a more robust and clinically practical approach to feature detection and matching in neuroendoscopic applications.

3D reconstructions performed with R2D2-E features consistently outperformed those from SIFT and SURF, with lower median and IQR in PE. All feature detectors resulted in registration accuracy of PE < 2 mm, which may be clinically acceptable, but only in cases where the reconstruction contained a large number of points and registration was successful. R2D2-E demonstrated more consistent tracking and stable reconstructions with all reconstructions successfully completed compared to other methods. R2D2-E yield in a slight improvement in PE and, more importantly, was much less susceptible to failure in real data. The augmented frame shown in Fig. 5(c) compares the detected features and track lengths improvement with R2D2-E. Prior work reports similar values of error and tracking runtime as reported here [22], and as evaluated in phantom studies of varying anatomy using SLAM. The robustness of R2D2-E was evident in higher median F1 score and reduced IQR across all tested distance thresholds. While reconstructions with classical features demonstrated a high level of precision in some instances, they were subject to larger variability and failure rate compared to R2D2-E. The superior recall achieved with learning-based methods (SuperPoint and R2D2-E) was attributed to a more even distribution of keypoints across the entire surface of the lateral ventricle. The superior precision of R2D2-E was attributed to its network architecture and augmentation techniques, and superior recall was attributed to the more uniform distribution of keypoints across the image.

The present study has several limitations. First, the dataset size of 15 cases, while sufficient for early validation studies, may not fully represent the spectrum of anatomical variations and surgical scenarios encountered for testing in clinical practice. Although the cases exhibited diverse patient-specific anatomical features in both endoscopic and MR imaging, thereby partially mitigating this limitation, robustness of the approach warrants evaluation in future work with larger, diverse datasets featuring more anatomical variations. While the method appears well suited to procedures involving direct visualization of the lateral ventricles (e.g., ETV, CPC, intraventricular tumor resection, and deep brain stimulation), further investigation could also address the generalizability of the trained network across other neuroendoscopic procedures. By comparison, classical feature detectors without reliance on learned exemplars may be intrinsically more generalizable across different anatomical regions. 5-fold cross-validation during training improved the robustness in the challenging scenario investigated in this work, and future studies could incorporate additional data to improve generalizability. The specific parameter values for image augmentation could also be tuned for generalizability across other endoscopic procedures. While our overlay visualization approach demonstrated feasibility, the current implementation lacks sophisticated depth cues such as motion parallax or depth contours, which could enhance surgeons’ perception of subsurface structures. Future work could further improve augmented visualization using an interactive display and viewpoint rendering to improve depth perception of the overlaid 3D structures. Other methods to improve depth perception could use stereoscopic endoscopes and displays or use haptic depth feedback.

Several strategies employed in this work for implementation of the R2D2-E pipeline may prove valuable for future researchers. Careful management of PySLAM dependencies and version control was crucial, particularly when integrating multiple libraries. Interfacing between Python-based deep learning frameworks and C++-based computer vision libraries employed custom binding wrappers to ensure interoperability. Precise calibration of the endoscopic camera, along with masking the image area per frame to avoid the sheath and other instrumentation was essential for accurate tracking and reconstruction. Additionally, a modular architecture for individual components of reconstruction, registration, and visualization tools facilitated debugging within the complex pipeline. System runtime could be further improved by optimization in compiled libraries (cf. scripted languages like Python) and more performant GPU hardware. To aid reproducibility, detailed documentation of specific software versions, configuration files, and custom scripts developed for data preprocessing or validation were maintained within a local git repository.

Looking ahead to potential clinical application, the ability of R2D2-E to provide accurate 3D reconstruction with a tracking runtime of 0.90 fps is potentially suitable for application in clinical neuroendoscopy. Augmented overlay of 3D structures segmented in MR, CT, or CBCT and registered directly to the endoscopic view could improve visualization and targeting of deep-brain structures in neuroendoscopic interventions and warrants further clinical investigation to demonstrate its clinical value.

VI. Conclusion

The study presented R2D2-E, a self-supervised approach to robust feature detection for real-time 3D video reconstruction and navigation in clinical neuroendoscopy. Experiments identified nominal hyperparameters and demonstrated superior performance in image matching tasks compared to SIFT and SURF, for example: median KPE = 0.83 px (compared to 2.20 px and 1.70 px, respectively); median PE = 0.64 mm (compared to 0.90 mm 0.99 mm, respectively; and F1 = 0.72 (at 1 mm distance threshold) and 0.94 (at 2 mm threshold), representing a 14% and 25% improvement over SIFT and SURF, respectively. The method achieved tracking runtime of 0.90 fps, demonstrating its potential for real-time reconstruction in clinical neuroendoscopy. A clinical pilot study conducted in the context of endoscopic third ventriculoscopy showed the potential value of augmenting real-time endoscopy with spatial context depicting critical structures beyond the visible field of view. These experiments established a rigorous quantitation of performance in clinical neuroendoscopy and supports future integration with clinical systems.

Supplementary Material

R2D2E_AugmentedEndoscopy_VagdargiEtAl
Download video file (13.3MB, mp4)

Acknowledgments

This work is supported by NIH grant U01-NS-107133 and Biomedical Research Partnership (BRP) with Medtronic, Littleton, MA.

Contributor Information

Prasad Vagdargi, Department of Computer Science, Johns Hopkins University, Baltimore, MD 21218 USA.

Ali Uneri, Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA.

Stephen Z. Liu, Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA

Craig K. Jones, Department of Computer Science, Johns Hopkins University, Baltimore, MD 21218 USA

Alejandro Sisniega, Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA.

Junghoon Lee, Departments of Radiation Oncology and Biomedical Engineering, Johns Hopkins School of Medicine, Baltimore, MD, USA.

Patrick A. Helm, Cranial and Spinal Technologies, Medtronic, Boxborough, MA, USA

Ryan P. Lee, Department of Neurosurgery, Johns Hopkins Medicine, Baltimore, MD, USA

Mark G. Luciano, Department of Neurosurgery, Johns Hopkins Medicine, Baltimore, MD, USA

Gregory D. Hager, Department of Computer Science, Johns Hopkins University, Baltimore, MD 21218 USA.

Jeffrey H. Siewerdsen, Departments of Biomedical Engineering, Computer Science, and Neurosurgery, Johns Hopkins University, Baltimore, MD, USA and the Departments of Imaging Physics, Neurosurgery, and Radiation Physics at The University of Texas MD Anderson Cancer Center, Houston, TX, USA.

References

  • [1].Liu S et al. , “Neuronavigation-Guided Transcortical-Transventricular Endoport-Assisted Endoscopic Resection for Thalamic Lesions: Preliminary Experience,” World Neurosurg, vol. 166, pp. 19–27, Oct. 2022, doi: 10.1016/J.WNEU.2022.06.110. [DOI] [PubMed] [Google Scholar]
  • [2].Di Rocco F et al. , “Neuroendoscopic transventricular ventriculocystostomy in treatment for intracranial cysts,” J Neurosurg, vol. 103 PEDIAT, no. SUPPL. 1, pp. 54–60, 2005, doi: 10.3171/ped.2005.103.1.0054. [DOI] [PubMed] [Google Scholar]
  • [3].Kulkarni A et al. , “Transventricular Endoscopic Approach for Cystic Craniopharyngioma: Case Series,” J Neurol Surg B Skull Base, vol. 84, no. 6, pp. 591–597, 2022, doi: 10.1055/a-1934-9307. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [4].Teping F and Oertel J, “The minimally invasive transventricular endoscopic approach to third ventricular lesions in pediatric patients—all-rounder with limitations?,” Child’s Nervous System, vol. 39, no. 12, pp. 3381–3389, 2023, doi: 10.1007/s00381-023-06096-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [5].Poulen G et al. , “Microendoscopic transventricular deep brain stimulation of the anterior nucleus of the thalamus as a safe treatment in intractable epilepsy: A feasibility study,” Rev Neurol (Paris), vol. 178, no. 9, pp. 886–895, 2022, doi: 10.1016/j.neurol.2022.03.023. [DOI] [PubMed] [Google Scholar]
  • [6].Ivan ME et al. , “Brain shift during bur hole-based procedures using interventional MRI: Clinical article,” J Neurosurg, vol. 121, no. 1, pp. 149–160, 2014, doi: 10.3171/2014.3.JNS121312. [DOI] [PubMed] [Google Scholar]
  • [7].Nimsky C et al. , “Quantification of, visualization of, and compensation for brain shift using intraoperative magnetic resonance imaging,” Neurosurgery, vol. 47, no. 5, pp. 1070–9; discussion 1079-80, Nov. 2000, doi: 10.1097/00006123-200011000-00008. [DOI] [PubMed] [Google Scholar]
  • [8].Khan MF et al. , “Assessment of brain shift related to deep brain stimulation surgery.,” Stereotact Funct Neurosurg, vol. 86, no. 1, pp. 44–53, 2008, doi: 10.1159/000108588. [DOI] [PubMed] [Google Scholar]
  • [9].Palys V and Holloway KL, “Frameless Functional Stereotactic Approaches,” Prog Neurol Surg, vol. 33, pp. 168–186, 2018, doi: 10.1159/000481102. [DOI] [PubMed] [Google Scholar]
  • [10].Candy NG et al. , “Neuronavigation in endoscopic skull base surgery and the accuracy of different MRI sequences,” Journal of Clinical Neuroscience, vol. 123, pp. 203–208, 2024, doi: 10.1016/j.jocn.2024.04.006. [DOI] [PubMed] [Google Scholar]
  • [11].Liu L et al. , “Frameless ROSA® Robot-Assisted Lead Implantation for Deep Brain Stimulation: Technique and Accuracy,” Operative Neurosurgery, vol. 19, no. 1, pp. 57–64, 2020, doi: 10.1093/ons/opz320. [DOI] [PubMed] [Google Scholar]
  • [12].Duque SG et al. , “Endoscopic endonasal transphenoidal surgery using the BrainLAB ® headband for navigation without rigid fixation,” J Neurol Surg A Cent Eur Neurosurg, vol. 75, no. 4, pp. 267–269, 2014, doi: 10.1055/s-0032-1331383. [DOI] [PubMed] [Google Scholar]
  • [13].Holloway K and Docef A, “A quantitative assessment of the accuracy and reliability of O-arm images for deep brain stimulation surgery,” Neurosurgery, vol. 72, 2013, doi: 10.1227/NEU.0b013e318273a090. [DOI] [PubMed] [Google Scholar]
  • [14].Manfield J et al. , “Evaluation of 3D C-Arm Fluoroscopy versus Diagnostic CT for Deep Brain Stimulation Stereotactic Registration and Post-Operative Lead Localization,” Stereotact Funct Neurosurg, vol. 102, no. 3, pp. 195 – 202, 2024, doi: 10.1159/000536017. [DOI] [PubMed] [Google Scholar]
  • [15].Zaffino P et al. , “A Review on Advances in Intraoperative Imaging for Surgery and Therapy: Imagining the Operating Room of the Future,” 2020. doi: 10.1007/s10439-020-02553-6. [DOI] [PubMed] [Google Scholar]
  • [16].Ivanov M et al. , “Intraoperative ultrasound in neurosurgery - A practical guide,” Br J Neurosurg, vol. 24, no. 5, pp. 510–517, 2010, doi: 10.3109/02688697.2010.495165. [DOI] [PubMed] [Google Scholar]
  • [17].Ganau M et al. , “Real-time intraoperative ultrasound in brain surgery: Neuronavigation and use of contrast-enhanced image fusion,” 2019. doi: 10.21037/qims.2019.03.06. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [18].Schmidt A et al. , “Tracking and mapping in medical computer vision: A review,” 2024. doi: 10.1016/j.media.2024.103131. [DOI] [PubMed] [Google Scholar]
  • [19].Visentini-Scarzanella M et al. , “Deep monocular 3D reconstruction for assisted navigation in bronchoscopy,” Int J Comput Assist Radiol Surg, vol. 12, no. 7, pp. 1089–1099, Jul. 2017, doi: 10.1007/s11548-017-1609-2. [DOI] [PubMed] [Google Scholar]
  • [20].Liu X et al. , “Dense Depth Estimation in Monocular Endoscopy with Self-Supervised Learning Methods,” IEEE Trans Med Imaging, vol. 39, no. 5, pp. 1438–1447, May 2020, doi: 10.1109/TMI.2019.2950936. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [21].Lin B et al. , “Video-based 3D reconstruction, laparoscope localization and deformation recovery for abdominal minimally invasive surgery: a survey,” 2016. doi: 10.1002/rcs.1661. [DOI] [PubMed] [Google Scholar]
  • [22].Yang Z et al. , “3D reconstruction from endoscopy images: A survey,” Comput Biol Med, vol. 175, p. 108546, 2024, doi: 10.1016/j.compbiomed.2024.108546. [DOI] [PubMed] [Google Scholar]
  • [23].Klein G and Murray D, “Parallel tracking and mapping for small AR workspaces,” in International Symposium on Mixed and Augmented Reality, 2007. doi: 10.1109/ISMAR.2007.4538852. [DOI] [Google Scholar]
  • [24].Vagdargi P et al. , “Pre-Clinical Development of Robot-Assisted Ventriculoscopy for 3-D Image Reconstruction and Guidance of Deep Brain Neurosurgery,” IEEE Trans Med Robot Bionics, vol. 4, no. 1, pp. 28–37, Feb. 2022, doi: 10.1109/TMRB.2021.3125322. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [25].Vagdargi P et al. , “Real-Time 3-D Video Reconstruction for Guidance of Transventricular Neurosurgery,” IEEE Trans Med Robot Bionics, vol. 5, no. 3, pp. 669–682, 2023, doi: 10.1109/TMRB.2023.3292450. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [26].Mahmoud N et al. , “Live tracking and dense reconstruction for handheld monocular endoscopy,” IEEE Trans Med Imaging, vol. 38, no. 1, pp. 79–89, 2019, doi: 10.1109/TMI.2018.2856109. [DOI] [PubMed] [Google Scholar]
  • [27].Vagdargi P et al. , “End-to-end 3D neuroendoscopic video reconstruction for robot-assisted ventriculostomy,” in Medical Imaging 2024: Image-Guided Procedures, Robotic Interventions, and Modeling, Siewerdsen JH and Rettmann ME, Eds., SPIE, 2024, p. 129280M. doi: 10.1117/12.3008758. [DOI] [Google Scholar]
  • [28].Liu X et al. , “SAGE: SLAM with Appearance and Geometry Prior for Endoscopy,” Proc IEEE Int Conf Robot Autom, pp. 5587–5593, 2022, doi: 10.1109/ICRA46639.2022.9812257. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [29].Lowe DG, “Distinctive image features from scale-invariant keypoints,” Int J Comput Vis, vol. 60, no. 2, pp. 91–110, 2004, doi: 10.1023/B:VISI.0000029664.99615.94. [DOI] [Google Scholar]
  • [30].Bay H et al. , “Speeded-Up Robust Features (SURF),” Computer Vision and Image Understanding, vol. 110, no. 3, pp. 346–359, 2008, doi: 10.1016/j.cviu.2007.09.014. [DOI] [Google Scholar]
  • [31].Rublee E et al. , “ORB: An efficient alternative to SIFT or SURF,” in Proceedings of the IEEE International Conference on Computer Vision, 2011, pp. 2564–2571. doi: 10.1109/ICCV.2011.6126544. [DOI] [Google Scholar]
  • [32].Revaud J et al. , “R2D2: Repeatable and reliable detector and descriptor,” in Advances in Neural Information Processing Systems, 2019. [Google Scholar]
  • [33].Detone D et al. , “SuperPoint: Self-supervised interest point detection and description,” in IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, 2018. doi: 10.1109/CVPRW.2018.00060. [DOI] [Google Scholar]
  • [34].X. and H. B. and S. N. Wang Qianqian and Zhou, “Learning Feature Descriptors Using Camera Pose Supervision,” in Computer Vision – ECCV 2020, H. and B. T. and F. J.-M. Vedaldi Andrea and Bischof, Ed., Cham: Springer International Publishing, 2020, pp. 757–774. [Google Scholar]
  • [35].Liu C et al. , “A Review of Keypoints’ Detection and Feature Description in Image Registration,” Sci Program, vol. 2021, no. 1, p. 8509164, 2021, doi: 10.1155/2021/8509164. [DOI] [Google Scholar]
  • [36].Vagdargi P et al. , “Real-time 3D neuroendoscopic guidance using SLAM: first clinical studies,” 2023, p. 3. doi: 10.1117/12.2654595. [DOI] [Google Scholar]
  • [37].Han R et al. , “Joint synthesis and registration network for deformable MR-CBCT image registration for neurosurgical guidance,” Phys Med Biol, vol. 67, no. 12, p. 125008, Jun. 2022, doi: 10.1088/1361-6560/ac72ef. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [38].Barath D et al. , “MAGSAC++, a fast, reliable and accurate robust estimator,” in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2020, pp. 1301–1309. doi: 10.1109/CVPR42600.2020.00138. [DOI] [Google Scholar]
  • [39].Rusu RB et al. , “Towards 3D Point cloud based object maps for household environments,” Rob Auton Syst, vol. 56, no. 11, pp. 927–941, 2008, doi: 10.1016/j.robot.2008.08.005. [DOI] [Google Scholar]
  • [40].DeTone D et al. , “SuperPoint: Self-Supervised Interest Point Detection and Description,” in 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2018, pp. 337–33712. doi: 10.1109/CVPRW.2018.00060. [DOI] [Google Scholar]
  • [41].Tian Y et al. , “L2-Net: Deep Learning of Discriminative Patch Descriptor in Euclidean Space,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 6128–6136. doi: 10.1109/CVPR.2017.649. [DOI] [Google Scholar]
  • [42].He K et al. , “Local Descriptors Optimized for Average Precision,” in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2018, pp. 596–605. doi: 10.1109/CVPR.2018.00069. [DOI] [Google Scholar]
  • [43].Freda L, “PySLAM, Monocular Visual Odometry Pipeline.” [Online]. Available: https://github.com/luigifreda/pyslam [Google Scholar]
  • [44].Uneri A et al. , “TREK: An integrated system architecture for intraoperative cone-beam CT-guided surgery,” Int J Comput Assist Radiol Surg, vol. 7, no. 1, pp. 159–173, Jan. 2012, doi: 10.1007/s11548-011-0636-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [45].Nister D, “An efficient solution to the five-point relative pose problem,” IEEE Trans Pattern Anal Mach Intell, vol. 26, no. 6, pp. 756–770, 2004, doi: 10.1109/TPAMI.2004.17. [DOI] [PubMed] [Google Scholar]
  • [46].Chetverikov D et al. , “The Trimmed Iterative Closest Point algorithm,” in Proceedings - International Conference on Pattern Recognition, IEEE Comput. Soc, 2002. doi: 10.1109/icpr.2002.1047997. [DOI] [Google Scholar]
  • [47].Shao M et al. , “Brain ventricle parcellation using a deep neural network: Application to patients with ventriculomegaly,” Neuroimage Clin, 2019, doi: 10.1016/j.nicl.2019.101871. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [48].Henschel L et al. , “FastSurfer - A fast and accurate deep learning based neuroimaging pipeline,” Neuroimage, vol. 219, 2020, doi: 10.1016/j.neuroimage.2020.117012. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [49].Girardeau-Montaut D, “CloudCompare: 3D point cloud and mesh processing software.” [Online]. Available: https://www.cloudcompare.org/ [Google Scholar]
  • [50].Shao M et al. , “Brain ventricle parcellation using a deep neural network: Application to patients with ventriculomegaly,” Neuroimage Clin, 2019, doi: 10.1016/j.nicl.2019.101871. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [51].Henschel L et al. , “FastSurfer - A fast and accurate deep learning based neuroimaging pipeline,” Neuroimage, vol. 219, 2020, doi: 10.1016/j.neuroimage.2020.117012. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [52].Knapitsch A et al. , “Tanks and temples: Benchmarking large-scale scene reconstruction,” ACM Trans Graph, vol. 36, no. 4, 2017, doi: 10.1145/3072959.3073599. [DOI] [Google Scholar]
  • [53].Knapitsch A et al. , “Tanks and temples: Benchmarking large-scale scene reconstruction,” ACM Trans Graph, vol. 36, no. 4, 2017, doi: 10.1145/3072959.3073599. [DOI] [Google Scholar]
  • [54].blender.org, “blender.org - Home of the Blender project - Free and Open 3D Creation Software,” Blender.Org. [Online]. Available: https://www.blender.org/ [Google Scholar]
  • [55].Ma R et al. , “Real-Time 3D Reconstruction of Colonoscopic Surfaces for Determining Missing Regions,” in Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 2019, pp. 573–582. doi: 10.1007/978-3-030-32254-0_64. [DOI] [Google Scholar]
  • [56].Gao X et al. , “LDSO: Direct Sparse Odometry with Loop Closure,” in 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2018, pp. 2198–2204. doi: 10.1109/IROS.2018.8593376. [DOI] [Google Scholar]
  • [57].Lin L et al. , “Unsupervised monocular visual odometry with decoupled camera pose estimation,” Digit Signal Process, vol. 114, p. 103052, 2021, doi: 10.1016/j.dsp.2021.103052. [DOI] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

R2D2E_AugmentedEndoscopy_VagdargiEtAl
Download video file (13.3MB, mp4)

RESOURCES