Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2024 Apr 30.
Published in final edited form as: Otolaryngol Head Neck Surg. 2023 Mar 8;169(4):988–998. doi: 10.1002/ohn.317

A Self-Configuring Deep Learning Network for Segmentation of Temporal Bone Anatomy in Cone-Beam CT Imaging

Andy S Ding 1,2, Alexander Lu 1,3, Zhaoshuo Li 2, Manish Sahu 2, Deepa Galaiya 1, Jeffrey H Siewerdsen 2,3, Mathias Unberath 2, Russell H Taylor 2, Francis X Creighton 1
PMCID: PMC11060418  NIHMSID: NIHMS1986158  PMID: 36883992

Abstract

Objective.

Preoperative planning for otologic or neurotologic procedures often requires manual segmentation of relevant structures, which can be tedious and time-consuming. Automated methods for segmenting multiple geometrically complex structures can not only streamline preoperative planning but also augment minimally invasive and/or robot-assisted procedures in this space. This study evaluates a state-of-the-art deep learning pipeline for semantic segmentation of temporal bone anatomy.

Study Design.

A descriptive study of a segmentation network.

Setting.

Academic institution.

Methods.

A total of 15 high-resolution cone-beam temporal bone computed tomography (CT) data sets were included in this study. All images were co-registered, with relevant anatomical structures (eg, ossicles, inner ear, facial nerve, chorda tympani, bony labyrinth) manually segmented. Predicted segmentations from no new U-Net (nnU-Net), an open-source 3-dimensional semantic segmentation neural network, were compared against ground-truth segmentations using modified Hausdorff distances (mHD) and Dice scores.

Results.

Fivefold cross-validation with nnU-Net between predicted and ground-truth labels were as follows: malleus (mHD: 0.044 ± 0.024 mm, dice: 0.914 ± 0.035), incus (mHD: 0.051 ± 0.027 mm, dice: 0.916 ± 0.034), stapes (mHD: 0.147 ± 0.113 mm, dice: 0.560 ± 0.106), bony labyrinth (mHD: 0.038 ± 0.031 mm, dice: 0.952 ± 0.017), and facial nerve (mHD: 0.139 ± 0.072 mm, dice: 0.862 ± 0.039). Comparison against atlas-based segmentation propagation showed significantly higher Dice scores for all structures (p < .05).

Conclusion.

Using an open-source deep learning pipeline, we demonstrate consistently submillimeter accuracy for semantic CT segmentation of temporal bone anatomy compared to hand-segmented labels. This pipeline has the potential to greatly improve preoperative planning work-flows for a variety of otologic and neurotologic procedures and augment existing image guidance and robot-assisted systems for the temporal bone.

Keywords: automated segmentation, deep learning, neural network, temporal bone


The past half-century has seen immense evolution in the landscape of temporal bone surgery as it adapts to and integrates new technological developments. In particular, patient-specific anatomic models can now be used by robotic systems to enforce safety barriers around critical structures or by “augmented reality” systems to enhance visual information available to the surgeon.16 Successful use of these innovative approaches requires precise preoperative planning to determine safe operating pathways and visualize surrounding structures. Currently, manual segmentation of relevant anatomic structures on computed tomography (CT) images is the hallmark of such planning. Proper planning for minimally invasive or robotic cochlear implantation requires accurate segmentation of the facial nerve, chorda tympani, and cochlea to delineate a safe drilling or electrode insertion path through the facial recess7,8; similarly, novel robot assistance for vestibular schwannoma resection relies on segmentation of the internal auditory canal (IAC), facial nerve, and bony labyrinth to determine safe approaches for tumor exposure.9 While manual segmentation is often highly accurate, it can be time-consuming and tedious when segmenting multiple structures.

Atlas-based segmentation methods, in which a pre-segmented reference volume is co-registered to a patient CT, have been the traditional technique for automatically segmenting patient anatomy. While our group has previously validated this technique for temporal bone anatomy,1013 a few key shortcomings preclude its clinical use. First, anatomy-altering pathologies, such as Mondini dysplasia, cholesteatoma, and vestibular schwannoma, are difficult to label with atlas-based methods due to the differences in shape and structure of target anatomy and their counterparts in the labeled atlas. Second, atlas-based methods rely on the accuracy of deformable image registration between the atlas and target scans of interest. Soft tissue structures such as the chorda tympani and the vestibular nerve are highly variable in shape and trajectory and are particularly difficult to register with modern algorithms.

Due to the prohibitive disadvantages of atlas-based segmentation and the increase in computational power of modern computers, there has been rising interest in training deep learning networks to segment temporal bone anatomy.1421 Deep learning networks for medical imaging segmentation are extensively used in orthopedics,2224 neurosurgery,2528 surgical oncology,2931 and other surgical specialties32,33 but are relatively scarce in-otology/neurotology. We, therefore, present an end-to-end deep learning model that accurately segments temporal bone structures with potential applications for surgical planning, surgical skills training, and image guidance.

Methods

This study was approved by the Johns Hopkins Medicine Institutional Review Board. High-resolution temporal bone CT scans were manually segmented to create a set of high-quality labeled data sets for neural network training. The 3-dimensional (3D) segmentation neural network no new U-Net (nnU-Net)34 was then used for generating predicted segmentations.

Creation of Manual Temporal Bone Segmentation Data Sets

A total of 15 deidentified and cropped cone-beam CT scans of the temporal bone were obtained from the Johns Hopkins Department of Otolaryngology-Head and Neck Surgery. The resolution of temporal bone images used was 0.1 mm per voxel length, with image dimensions of 512 × 512 × N voxels, where N refers to the number of axial CT slices for a given image. Scans with prior surgery or foreign bodies were excluded from this study. For labeled-data set creation, 16 anatomical structures, including the ossicles, bony labyrinth, facial nerve, chorda tympani, IAC, and internal carotid artery (ICA), were labeled for each temporal bone using the open-source software 3D Slicer (Table 1). Manual segmentations10 were carried out by 2 labelers with experience in temporal bone anatomy and were verified by the senior author. Since soft tissue structures are difficult to differentiate on CT, we implemented strategies to minimize segmentation variability in this study. In particular, the facial nerve was segmented from the labyrinthine segment to the distal end of the stylomastoid segment, while the chorda tympani was segmented along its course in the mastoid process until it enters the middle ear. Furthermore, the inferior and superior vestibular nerves were segmented from their branchpoint medial to the IAC.

Table 1.

Dice Scores and Modified Hausdorff Distances Calculated Between Predicted Labels and Ground Truth for All Anatomical Structures Segmented

Dice score Modified Hausdorff
Structure Mean SD Mean (mm) SD (mm)
Bone 0.955 0.012 0.055 0.023
Malleus 0.914 0.035 0.044 0.024
Incus 0.916 0.034 0.051 0.027
Stapes 0.560 0.106 0.147 0.113
Bony labyrinth 0.952 0.017 0.038 0.031
Internal auditory canal 0.908 0.584 0.195 0.168
Superior vestibular nerve 0.584 0.471 0.326 0.072
Inferior vestibular nerve 0.471 0.200 1.222 2.458
Cochlear nerve 0.759 0.102 0.236 0.168
Facial nerve 0.862 0.039 0.139 0.072
Chorda tympani 0.608 0.210 0.877 1.512
Internal carotid artery 0.919 0.028 0.236 0.094
Sigmoid sinus + dura 0.800 0.032 0.817 0.305
Vestibular aqueduct 0.678 0.091 0.269 0.165
Mandible 0.939 0.042 0.087 0.105
External auditory canal 0.838 0.030 0.761 0.218

Abbreviation: SD, standard deviation.

nnU-Net Training and Validation

nnU-Net34 was selected for evaluation in this study due to its ability to automatically self-configure to new data sets.This framework was built upon U-Net,35 a convolutional network architecture that has revolutionized biomedical image segmentation. The nnU-Net repository was installed onto an NVIDIA GeForce RTX 3090 graphics processing unit (GPU) workstation running the Ubuntu 18.04 LTS operating system. Labeled data sets in this study were rigidly co-registered to standardize geometry, dimensions, and coordinate spaces. Five training folds were planned with a 20 to 80 train-validation split for each fold. Training data sets did not overlap between folds to decrease bias and data leakage in validating this pipeline. For each fold, nnU-Net was trained for 300 epochs and a compound loss function36 consisting of cross entropy and Dice loss. Rather than an entire image data set as a training instance, patches were randomly sampled from data sets for training nnU-Net. After nnU-Net model training, predicted segmentations were then generated for the validation data sets in each fold. Inference and prediction of segmentations on validation data sets utilized a sliding-window approach, in which the window size is equivalent to the patch size used for training. Predicted patches were then stitched together, and stitching artifacts were addressed with Gaussian importance weighting.37

Predicted Segmentation Accuracy Metrics

Predicted segmentations were evaluated against hand-segmented ground-truth labels using a 5-fold cross-validation method. Dice scores38 were calculated to measure the degree of overlap between predictions and ground truth. A Dice score of 0 corresponds to no overlap, while a score of 1 corresponds to complete overlap. Modified Hausdorff distances (mHDs)39 were also calculated to capture the shape-wise spatial error between predictions and ground-truth segments. An mHD of 0 mm corresponds to complete overlap while increasing distance corresponds to increased error. These accuracy metrics were averaged across all 5 folds to produce final metrics for the prediction of each anatomical structure. Accuracy metrics for nnU-Net were then compared against those using atlas-based segmentation propagation10 on the same data sets using multiple Mann-Whitney tests with a Holm-Šídák adjusted p value of .05.

Results

Qualitative Analysis

Representative predicted segmentations compared to ground-truth manual segmentations are shown in Figure 1. Natural anatomical borders between structures were preserved with high fidelity. Discrepancies between predicted and ground-truth segmentations occurred primarily at free ends of structures (eg, inferior extent of the mastoid facial nerve and petrous carotid artery, arbitrary superior border of the skull base dura, lateral extent of the external auditory canal). Well-enclosed structures such as ossicles and bony labyrinths were segmented without areas of discontinuity.

Figure 1.

Figure 1.

Visual comparison between ground-truth (right) and predicted (left) segmentations from a representative data set. (A) Lateral view. (B) View of the facial recess. (C) View of the middle cranial fossa. Segment IDs are presented for clarity.

Descriptive Statistics of Accuracy

mHDs and Dice scores for all segments are listed in Table 1. Since 1 data set had absent stapes, metrics for the stapes were averaged among 14 data sets, while metrics for other structures were averaged among all 15 data sets. Predicted labels for all structures except the inferior vestibular nerve demonstrated submillimeter mHDs. We found that a single data set contributed to a disproportionately high mHD of 9.95 mm for the inferior vestibular nerve. To protect against outliers, we calculated 90% Winsorized means40 for accuracy metrics (Table 2), which replaces values less than the 5th percentile and greater than the 95th percentile with the closest valid value. This analysis showed similar Dice scores compared to true mean Dice scores and submillimeter mHDs for all structures.

Table 2.

Ninety Percent Winsorized Mean Dice Scores and Modified Hausdorff Distances Calculated Between Predicted Labels and Ground Truth for All Anatomical Structures

Dice score Modified Hausdorff
Structure Mean SD Mean (mm) SD (mm)
Bone 0.956 0.011 0.054 0.022
Malleus 0.917 0.028 0.042 0.019
Incus 0.918 0.028 0.050 0.025
Stapes 0.562 0.095 0.134 0.071
Bony labyrinth 0.953 0.014 0.035 0.019
Internal auditory canal 0.908 0.032 0.191 0.086
Superior vestibular nerve 0.583 0.139 0.323 0.164
Inferior vestibular nerve 0.471 0.196 0.838 1.022
Cochlear nerve 0.763 0.091 0.226 0.138
Facial nerve 0.861 0.034 0.137 0.060
Chorda tympani 0.623 0.162 0.777 1.204
Internal carotid artery 0.919 0.027 0.236 0.091
Sigmoid sinus + dura 0.801 0.031 0.805 0.259
Vestibular aqueduct 0.680 0.084 0.258 0.133
Mandible 0.943 0.030 0.074 0.056
External auditory canal 0.838 0.029 0.765 0.209

Abbreviation: SD, standard deviation.

Accuracy Comparison to Labeled-Atlas Segmentation Propagation

Since segmentation propagation is not feasible for data sets with abnormal anatomy, the same data set that was excluded from descriptive statistics analysis was also excluded from the accuracy comparison between nnU-Net and segmentation propagation. Among the remaining 14 data sets, 1 data set was chosen for creating the labeled atlas for segmentation propagation, leaving 13 data sets for final comparison. nnU-Net consistently performed with higher Dice scores for all structures and with comparable to lower mHDs for all structures (Table 3). Using a threshold of 0.75, Dice scores from nnU-Net were sufficiently high in 11 of 16 structures compared to 6 structures from segmentation propagation (Figure 2A). Furthermore, nnU-Net predictions exhibited submillimeter mHDs against ground-truth segmentations for 15 of 16 structures compared to 13 of 16 structures from segmentation propagation (Figure 2B).

Table 3.

Comparison of Mean Dice Scores and Modified Hausdorff Distances Between nnU-Net and Seg Prop Predictions for All Anatomical Structures

Dice score Modified Hausdorff
Structure nnU-Net Seg Prop Adjusted p nnU-Net Seg Prop Adjusted p
Bone 0.954 (±0.013) 0.762 (±0.025) .000003 0.055 (±0.024) 0.545 (±0.274) .000003
Malleus 0.916 (±0.036) 0.821 (±0.074) .000187 0.043 (±0.025) 0.124 (±0.081) .000267
Incus 0.923 (±0.026) 0.830 (±0.054) .000004 0.048 (±0.028) 0.125 (±0.062) .000875
Stapes 0.555 (±0.108) 0.329 (±0.123) .000521 0.153 (±0.115) 0.243 (±0.184) .047045
Bony labyrinth 0.951 (±0.016) 0.840 (±0.044) .000003 0.038 (±0.031) 0.216 (±0.134) .00002
Internal auditory canal 0.906 (±0.035) 0.774 (±0.086) .000187 0.198 (±0.104) 0.671 (±0.494) .000205
Superior vestibular nerve 0.603 (±0.145) 0.337 (±0.164) .001152 0.297 (±0.163) 0.509 (±0.288) .048423
Inferior vestibular nerve 0.484 (±0.212) 0.060 (±0.087) .000069 1.335 (±2.633) 0.832 (±0.463) .448301
Cochlear nerve 0.776 (±0.097) 0.552 (±0.139) .001152 0.220 (±0.171) 0.489 (±0.261) .020734
Facial nerve 0.860 (±0.032) 0.600 (±0.177) .000003 0.139 (±0.066) 0.495 (±0.442) .000104
Chorda tympani 0.592 (±0.221) 0.073 (±0.122) .000187 0.995 (±1.598) 2.232 (±1.464) .030008
Internal carotid artery 0.922 (±0.025) 0.716 (±0.142) .000021 0.227 (±0.084) 0.834 (±0.486) .000047
Sigmoid sinus + dura 0.796 (±0.031) 0.628 (±0.061) .000003 0.810 (±0.315) 1.324 (±0.485) .003938
Vestibular aqueduct 0.678 (±0.090) 0.312 (±0.148) .000003 0.280 (±0.170) 0.757 (±0.557) .003938
Mandible 0.938 (±0.045) 0.609 (±0.176) .000004 0.090 (±0.113) 1.291 (±1.077) .00002
External auditory canal 0.843 (±0.028) 0.794 (±0.062) .02215 0.724 (±0.209) 0.913 (±0.355) .258453

Abbreviations: nnU-Net, no new U-Net; Seg Prop, segmentation propagation.

Figure 2.

Figure 2.

No new U-Net (nnU-Net) and segmentation propagation (Seg Prop) predictions. (A) Dice scores with an accuracy threshold (dotted line) of 0.75. (B) Modified Hausdorff distances with an accuracy threshold (dotted line) of 1 mm. p values: *<.05; **<.01; ***<.001; ****<.0001.

Segmentation of smaller and/or thinner structures, in particular the stapes, superior and inferior vestibular nerves, and vestibular aqueduct, exhibited lower Dice scores compared to larger and/or thicker structures, a phenomenon that was also observed in segmentation propagation.10 Constrained linear regressions (y-intercept = 1.0) between Dice scores and mHDs demonstrated that Dice scores for small or thin structures decreased significantly more (p < .0001) with respect to mHDs (slope: −0.5354; 95% confidence interval: −0.7995 to −0.2713) than those for large or thick structures (slope: −0.2513; 95% confidence interval: −0.3448 to −0.1579) (Figure 3).

Figure 3.

Figure 3.

Accuracy comparisons for smaller/thinner (unshaded) versus larger/thicker (shaded) structures. Constrained linear regressions show the effects of modified Hausdorff distance on Dice score.

Discussion

Due to the innately compact nature of the temporal bone, advances in pre- and intraoperative techniques have arisen to better visualize and avoid critical structures encased in bone. Preoperative planning pipelines using 3D model visualization software,19,41 virtual reality,42 and physically printed temporal bone models43 for complex cases have shown improved visualization of patient-specific anatomy with a high degree of fidelity. These planning systems are also fundamental for minimally invasive procedures using multiport strategies44 or keyhole approaches45 to reach target anatomy without violating critical structures. Furthermore, image-guided navigation46 and robotic surgical systems1,41,4752 have continued to shape the landscape of otology and neurotology by providing intraoperative methods for vital anatomy detection.

Crucial to these developing technologies is the presence of robust anatomical data, which typically involves hand-segmenting relevant anatomy on CT imaging. This requisite may impede workflow efficiency when multiple structures must be segmented with high precision and accuracy. To address this issue, we have presented a deep learning-based system using nnU-Net to segment a comprehensive set of critical structures accurately and reliably within the temporal bone. nnU-Net is a deep learning framework built upon the U-Net architecture,35 a widely used model for labeling objects and structures in image data sets. To accomplish this complex task, U-Net is divided into 2 main phases. The first phase, called the “encoding phase,” iteratively extracts key patterns and characteristics of an input image and encodes it as a compact, abstract map of where each segment resides. The second phase called the “decoding phase,” then upscales this abstract map back to the original size of the input image while simultaneously learning the shape of each segment to output a final prediction. This concept of downsampling and encoding an image, then upsampling and segmenting relevant structures is akin to the shape of a “U,” after which this architecture was named.

A key drawback of deep learning systems for segmentation is the frequent need for manual refinement of neural network settings, termed hyperparameter tuning, to maximize performance. As a result, applying the same deep learning architecture to different data sets often requires different hyperparameters to achieve optimal performance. nnU-Net, on the other hand, systematizes and effectively automates hyperparameter tuning to find high-quality configurations for a variety of data sets.34 Furthermore, nnU-Net is a patch-based algorithm in which random patches of data are used for training and produces a sufficiently large number of training instances for deep learning despite a limited sample size of 15 data sets. This unique generalizability of nnU-Net is particularly amenable for medical imaging data sets which are both heterogeneous and data-dense.

Accuracy analyses of this nnU-Net pipeline have shown submillimeter accuracy for most of the structures included in this study. By using Winsorized means to protect against outliers with a 2-standard deviation threshold, nnU-Net performs with submillimeter accuracy for segmenting all included structures. Predictably, Dice scores for smaller or thinner structures were significantly lower than those for larger or thicker structures. This disproportionate effect on the Dice score relative to small spatial perturbations for smaller structures aligns with other observations in the literature.38 Interestingly, Winsorizing the data presented in this study had no significant effect on Dice score measurements but had a notable effect on mHDs, particularly for the inferior vestibular nerve. Upon further analysis of predicted outputs, we found 1 outlier data set in which 1 portion of the inferior vestibular nerve was properly labeled while a second separate region of the temporal bone was improperly labeled. The resultant 2 “islands” that were predicted to be the inferior vestibular nerve, therefore, resulted in a disproportionately high mHD of 9.95 mm with a reasonable Dice score of 0.459. This example highlights the lack of context-specific spatial information in Dice score calculations, particularly for predictions consisting of islands (Figure 4).

Figure 4.

Figure 4.

Example of Dice score insensitivity to shape variation and islands. While example (A) produces an inferior vestibular nerve prediction with a distant island and example (B) does not, they maintain similar Dice scores but significantly different modified Hausdorff distances (mHDs).

Prior studies have also successfully developed deep learning models for temporal bone segmentation (Table 4). Neves et al compared the performance of the anisotropic hybrid network, U-Net, and residual neural network on segmenting the inner ear, ossicles, facial nerve, and sigmoid sinus with high Dice scores (0.75 or higher) for all structures.14 Lv et al and Wang et al implemented W-Net for facial nerve, ossicle, and bony labyrinth segmentation with similar reported Dice scores.15,16 Nikan et al developed PWD-3DNet, a patch-wise network adapted from the DenseVNet architecture, with post-processing in 3D Slicer to remove small islands.17 This network outperformed nnU-Net in stapes segmentation but underperformed in all other structures. Wang et al developed a 3D V-Net neural network to segment the middle ear ossicles, resulting in superior segmentation performance compared to existing neural networks in the literature.18 To address the computational demand of 3D U-Net, Fauser et al developed an ensemble method of 2D U-Net predictions from axial, coronal, and sagittal slices, followed by probabilistic active shape modeling from a labeled atlas to ensure anatomically accurate labels.19,53 This method, however, is not fully automated, since predictions must be manually input to an active shape modeling algorithm for final predictions. Furthermore, while this method ensures the removal of aberrant islands from its final output, its use of a labeled atlas may have similar limitations to solely atlas-based segmentation techniques—namely its vulnerability to anatomical variants or abnormalities. Li et al also developed an integrated U-Net and transformer-based model called UNETR for cochlear segmentation and achieved an average Dice score of 0.92.20 Finally, Dong et al presented a dedicated neural network for facial nerve segmentation called FNSegNet, which uses a 2D model as its backbone and a module to expand the network’s receptive field for small objects.21 This dedicated segmentation network achieved comparable results to nnU-Net with an average Dice score of 0.858.

Table 4.

Comparison of Mean Dice Scores Between nnU-Net and Other Neural Networks

Structure nnU-Net34 AH-Net14 W-Net15,16 PWD-3DNet17 3D V-Net18 2D U-Net + ASM19 UNETR20 FNSegNet21
Bone 0.955 - - - - - - -
Malleus 0.914 - - 0.84 0.92 - - -
Incus 0.916 - - 0.85 0.925 - - -
Stapes 0.560 - - 0.77 0.835 - - -
Ossicles 0.912 - 0.85 0.844a 0.921 a 0.832 - -
Bony labyrinth 0.952 0.91 0.91 0.9 - 0.871 0.92 -
Internal auditory canal 0.908 - - 0.89 - 0.863 - -
Superior vestibular nerve 0.584 - - - - - - -
Inferior vestibular nerve 0.471 - - - - - - -
Cochlear nerve 0.759 - - - - - - -
Facial nerve 0.862 0.75 0.703 0.74 - 0.692 - 0.858
Chorda tympani 0.608 - - - - 0.545 - -
Internal carotid artery 0.919 - - 0.81 - 0.899 - -
Sigmoid sinus + dura 0.800 0.86 b - - 0.86 b 0.738b - -
Vestibular aqueduct 0.678 - - - - - - -
Mandible 0.939 - - - - - - -
External auditory canal 0.838 - - - - 0.766 - -

Top Dice scores are bolded for structures predicted by multiple neural networks in the literature.

Abbreviations: AH-Net, anisotropic hybrid network; FNSegNet, facial nerve segmentation network; nnU-Net, no new U-Net; PWD-3DNet, patch-wise densely connected 3-dimensional network; UNETR, U-Net and transformer.

a

Dice score estimated by a volumetric weighted average of the malleus, incus, and stapes.

b

Dice scores reported only for the sigmoid sinus or jugular bulb.

With respect to existing neural networks, nnU-Net was consistently the top or second-best performing model for segmenting temporal bone structures. Moreover, data sets included in this study contained significantly more anatomical labels compared to others in the literature, rendering our trained nnU-Net model as the most comprehensive segmentation network for the temporal bone. It should be noted that these deep learning models were not compared using the same data sets, and therefore any comparisons between these models have not been formally validated.

One limitation of nnU-Net is its requirement of rigidly co-registered images to ensure consistent geometry and voxel spacing. While this standardizes segmentation and aids in self-configuration for nnU-Net, co-registration invariably results in loss of information near image edges. For structures that have been cropped at the edges (eg, ICA, sigmoid sinus + dura, mandible), co-registration may negatively affect the accuracy metrics presented in this study. A second limitation is the significant computational resources and time necessary for nnU-Net training. At a minimum, nnU-Net requires 8 gigabytes (GB) of video random access memory (VRAM). Even with a top-of-the-line 24 GB VRAM GPU, training per fold took 27.5 hours, while 5-fold cross-validation took almost 6 days to complete. Although training takes a significant amount of time up-front, labeling new images take minutes. Finally, this study has a small sample size of 15 data sets, which poses a risk of overtraining and decreasing generalizability to other data sets. However, applying our trained nnU-Net model on unlabeled data sets can bootstrap manual segmentation and expand our set of labeled images for further training. Despite these limitations, nnU-Net represents a state-of-the-art neural network framework that is more easily applied and adapted to new data sets than other deep learning models.

While preoperative planning with labeled CT imaging has been integral for certain complex procedures, intraoperative image guidance in temporal bone surgery has not been widely adopted, mainly due to the requirement of exceptionally accurate target registration errors for safe temporal bone surgery. Given the density of complex structures, temporal bone surgery is thought to require a 3-standard deviation error of ≤0.5 mm and preoperative imaging slice thickness of ≤0.2 mm.54 These requirements for intraoperative image guidance in the temporal bone are markedly stricter than in sinus surgery, which can tolerate errors above 1 mm.55,56 Wider adoption of anatomy-aware image-guidance in the form of robotic or free-hand navigation probes thus depends on the ability to (1) consistently register preoperative CTs and (2) efficiently segment anatomy such that the combined errors of these 2 tasks are sufficiently low. This study has demonstrated a vital step toward the latter by generating accurate 3D segmentations of relevant anatomy. We believe that future improvements in this method, together with improved CT image registration methods, will enable the meaningful use of robotic and image-guided temporal bone surgery.

We recognize that image guidance does not necessarily improve outcomes in otolaryngological cases. For sinus surgery, in which intraoperative navigation is used extensively, clinical outcomes with image guidance were noninferior to conventional surgery, though its cost-effectiveness is equivocal.5759 At the very least, these systems can be used to augment surgical training and better visualize patient-specific anatomical landmarks. With the current dearth of literature surrounding the cost-effectiveness and clinical outcomes of these technologies for temporal bone surgery, further investigation into their clinical and didactic utility is necessary.

Conclusion

Within the landscape of powerful deep learning platforms, nnU-Net exemplifies a paradigm shift in neural network models by streamlining hyperparameter tuning and optimization. This automated end-to-end pipeline for segmenting temporal bone anatomy has the potential to augment robot-assisted, image-guided, and/or preoperative planning systems for comprehensive visualization of relevant anatomical structures. Given the competitive accuracy of nnU-Net compared to existing neural networks described in the literature, we believe that this study presents an advancement toward comprehensive anatomical segmentation of temporal bone imaging. Future work will focus on refining the accuracy of this model with novel evaluation methods during training, as well as on evaluating the generalizability of this model to novel data sets.

Funding source:

This study was sponsored under a K08 Grant (NIDCD 5K08DC019708-02) awarded to Francis X. Creighton. Funding and equipment support was provided by a contract between Galen Robotics and Johns Hopkins University.

Footnotes

This article was presented at the 2022 AAO-HNSF 2022 Annual Meeting & OTO Experience; September 10–14, 2022; Philadelphia, Pennsylvania.

Competing interests: Under a license agreement between Galen Robotics Inc and Johns Hopkins University, Russell H. Taylor and the University are entitled to royalty distributions on technology related to the technology described in the study discussed in this publication. Russell H. Taylor also is a paid consultant to and owns equity in Galen Robotics Inc. This arrangement has been reviewed and approved by Johns Hopkins University in accordance with its conflict-of-interest policies.

References

  • 1.Ding AS, Capostagno S, Razavi CR, et al. Volumetric accuracy analysis of virtual safety barriers for cooperative-control robotic mastoidectomy. Otol Neurotol. 2021;42(10):e1513–e1517. doi: 10.1097/MAO.0000000000003309 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Li Z, Gordon A, Looi T, Drake J, Forrest C, Taylor RH. Anatomical mesh-based virtual fixtures for surgical robots. IEEE/RSJ International Conference on Intelligent Robots and Systems; 2020; pp. 3267–3273. [Google Scholar]
  • 3.Chen JX, Yu SE, Ding AS, et al. Augmented reality in otology/neurotology: a scoping review with implications for practice and education. Laryngoscope. Published online December 15, 2022. doi: 10.1002/lary.30515 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Lim H, Matsumoto N, Cho B, et al. Semi-manual mastoidectomy assisted by human–robot collaborative control—a temporal bone replica study. Auris Nasus Larynx. 2016;43(2):161–165. doi: 10.1016/j.anl.2015.08.008 [DOI] [PubMed] [Google Scholar]
  • 5.Rose AS, Kim H, Fuchs H, Frahm J-M. Development of augmented-reality applications in otolaryngology–head and neck surgery. Laryngoscope. 2019;129(S3):S1–S11. doi: 10.1002/lary.28098 [DOI] [PubMed] [Google Scholar]
  • 6.Wong K, Yee HM, Xavier BA, Grillone GA. Applications of augmented reality in otolaryngology: a systematic review. Otolaryngol Head Neck Surg. 2018;159(6):956–967. doi: 10.1177/0194599818796476 [DOI] [PubMed] [Google Scholar]
  • 7.Noble J, Warren F, Labadie R, Dawant B, Fitzpatrick J. Determination of drill paths for percutaneous cochlear access accounting for target positioning error [6509–76]. Int Soc Opt Eng. 1999;2007:650925. [Google Scholar]
  • 8.Labadie RF, Noble JH, Dawant BM, Balachandran R, Majdani O, Fitzpatrick JM. Clinical validation of percutaneous cochlear implant surgery: initial report. Laryngoscope. 2008;118(6):1031–1039. doi: 10.1097/MLG.0b013e31816b309e [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.McBrayer KL, Wanna GB, Dawant BM, Balachandran R, Labadie RF, Noble JH. Resection planning for robotic acoustic neuroma surgery. J Med Imaging. 2017;4(2): 025002. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Ding AS, Lu A, Li Z, et al. Automated registration-based temporal bone computed tomography segmentation for applications in neurotologic surgery. Otolaryngol Head Neck Surg. 2022;167(1):133–140. doi: 10.1177/01945998211044982 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Sinha A, Leonard S, Reiter A, Ishii M, Taylor RH, Hager GD. Automatic segmentation and statistical shape modeling of the paranasal sinuses to estimate natural variations. Proc SPIE Int Soc Opt Eng. 2016;9784:97840D. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Ding AS, Lu A, Li Z, et al. Automated extraction of anatomical measurements from temporal bone CT imaging. Otolaryngol Head Neck Surg. 2022;167(4):731–738. doi: 10.1177/01945998221076801 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Ding AS, Lu A, Li Z, et al. Statistical shape model of the temporal bone using segmentation propagation. Otol Neurotol. 2022;43(6):e679–e687. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Neves CA, Tran ED, Kessler IM, Blevins NH. Fully automated preoperative segmentation of temporal bone structures from clinical CT scans. Sci Rep. 2021;11(1):116. doi: 10.1038/s41598-020-80619-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Lv Y, Ke J, Xu Y, Shen Y, Wang J, Wang J. Automatic segmentation of temporal bone structures from clinical conventional CT using a CNN approach. Int J Med Robot Comput Assisted Surg. 2021;17(2):e2229. doi: 10.1002/rcs.2229 [DOI] [PubMed] [Google Scholar]
  • 16.Wang J, Lv Y, Wang J, et al. Fully automated segmentation in temporal bone CT with neural network: a preliminary assessment study. BMC Med Imaging. 2021;21(1):166. doi: 10.1186/s12880-021-00698-x [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Nikan S, Van Osch K, Bartling M, et al. PWD-3DNet: a deep learning-based fully-automated segmentation of multiple structures on temporal bone CT scans. IEEE Trans Image Process. 2021;30:739–753. doi: 10.1109/TIP.2020.3038363 [DOI] [PubMed] [Google Scholar]
  • 18.Wang X-R, Ma X, Jin L-X, et al. Application value of a deep learning method based on a 3D V-Net convolutional neural network in the recognition and segmentation of the auditory ossicles. Front Neuroinform. 2022;16:937891. doi: 10.3389/fninf.2022.937891 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Fauser J, Bohlender S, Stenin I, et al. Retrospective in silico evaluation of optimized preoperative planning for temporal bone surgery. Int J Comput Assisted Radiol Surg. 2020;15(11):1825–1833. doi: 10.1007/s11548-020-02270-4 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Li Z, Zhou L, Tan S, Tang A. Application of UNETR for automatic cochlear segmentation in temporal bone CTs. Auris Nasus Larynx. Published online August 12, 2022. doi: 10.1016/j.anl.2022.06.008 [DOI] [PubMed] [Google Scholar]
  • 21.Dong B, Lu C, Hu X, Zhao Y, He H, Wang J. Towards accurate facial nerve segmentation with decoupling optimization. Phys Med Biol. 2022;67(6):065007. doi: 10.1088/1361-6560/ac556f [DOI] [PubMed] [Google Scholar]
  • 22.Lee S-W, Ye H-U, Lee K-J, et al. Accuracy of new deep learning model-based segmentation and key-point multidetection method for ultrasonographic developmental dysplasia of the hip (DDH) screening. Diagnostics. 2021;11(7):1174. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Prasoon A, Petersen K, Igel C, Lauze F, Dam E, Nielsen M. Deep Feature Learning for Knee Cartilage Segmentation Using a Triplanar Convolutional Neural Network. Springer; 2013:246–253. [DOI] [PubMed] [Google Scholar]
  • 24.Zeng G, Zheng G. Deep learning-based automatic segmentation of the proximal femur from MR images. In: Zheng G, Tian W, Zhuang X, eds. Intelligent Orthopaedics: Artificial Intelligence and Smart Image-guided Technology for Orthopaedics. Springer; 2018:73–79. [DOI] [PubMed] [Google Scholar]
  • 25.Ranjbarzadeh R, Bagherian Kasgari A, Jafarzadeh Ghoushchi S, Anari S, Naseri M, Bendechache M. Brain tumor segmentation based on deep learning and an attention mechanism using MRI multi-modalities brain images. Sci Rep. 2021;11(1):10930. doi: 10.1038/s41598-021-90428-8 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Yogananda CGB, Shah BR, Vejdani-Jahromi M, et al. A fully automated deep learning network for brain tumor segmentation. Tomography. 2020;6(2):186–193. doi: 10.18383/j.tom.2019.00026 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Wang G, Li W, Ourselin S, Vercauteren T. Automatic brain tumor segmentation based on cascaded convolutional neural networks with uncertainty estimation. Front Comput Neurosci. 2019;13:56. doi: 10.3389/fncom.2019.00056 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Havaei M, Davy A, Warde-Farley D, et al. Brain tumor segmentation with deep neural networks. Med Image Anal. 2017;35:18–31. [DOI] [PubMed] [Google Scholar]
  • 29.Baccouche A, Garcia-Zapirain B, Castillo Olea C, Elmaghraby AS. Connected-UNets: a deep learning architecture for breast mass segmentation. NPJ Breast Cancer. 2021;7(1):151. doi: 10.1038/s41523-021-00358-x [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Gite S, Mishra A, Kotecha K. Enhanced lung image segmentation using deep learning. Neural Comput Appl. Published online January 3, 2022. doi: 10.1007/s00521-021-06719-8 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Akilandeswari A, Sungeetha D, Joseph C, et al. Automatic detection and segmentation of colorectal cancer with deep residual convolutional neural network. Evid Based Complement Alternat Med. 2022;2022:3415603. doi: 10.1155/2022/3415603 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Tanzi L, Piazzolla P, Porpiglia F, Vezzetti E. Real-time deep learning semantic segmentation during intra-operative surgery for 3D augmented reality assistance. Int J Comput Assisted Radiol Surg. 2021;16(9):1435–1445. doi: 10.1007/s11548-021-02432-y [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Kumazu Y, Kobayashi N, Kitamura N, et al. Automated segmentation by deep learning of loose connective tissue fibers to define safe dissection planes in robot-assisted gastrectomy. Sci Rep. 2021;11(1):21198. doi: 10.1038/s41598-021-00557-3 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Isensee F, Jaeger PF, Kohl SAA, Petersen J, Maier-Hein KH. nnU-Net: a self-configuring method for deep learning-based biomedical image segmentation. Nat Methods. 2021;18(2):203–211. doi: 10.1038/s41592-020-01008-z [DOI] [PubMed] [Google Scholar]
  • 35.Ronneberger O, Fischer P, Brox T. U-Net: Convolutional Networks for Biomedical Image Segmentation. Springer International Publishing; 2015:234–241. [Google Scholar]
  • 36.Taghanaki SA, Zheng Y, Kevin Zhou S, et al. Combo loss: handling input and output imbalance in multi-organ segmentation. Comput Med Imaging Graph. 2019; 75(33):24–33. doi: 10.1016/j.compmedimag.2019.04.005 [DOI] [PubMed] [Google Scholar]
  • 37.Xu Y, Hu S, Du Y. Research on optimization scheme for blocking artifacts after patch-based medical image reconstruction. Comput Math Methods Med. 2022;2022:2177159. doi: 10.1155/2022/2177159 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38.Zou KH, Warfield SK, Bharatha A, et al. Statistical validation of image segmentation quality based on a spatial overlap index. Acad Radiol. 2004;11(2):178–189. doi: 10.1016/S1076-6332(03)00671-8 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.Dubuisson M-P, Jain AK. A modified Hausdorff distance for object matching. Proceedings of the IEEE 12th International Conference on Pattern Recognition; 1994; pp. 566–568. [Google Scholar]
  • 40.Wilcox RR, Keselman HJ. Modern robust data analysis methods: measures of central tendency. Psychol Methods. 2003;8(3):254–274. [DOI] [PubMed] [Google Scholar]
  • 41.Mueller F, Hermann J, Weber S, O’Toole Bom Braga G, Topsakal V. Image-based planning of minimally traumatic inner ear access for robotic cochlear implantation. Front Surg. 2021;8:761217. doi: 10.3389/fsurg.2021.761217 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42.Timonen T, Iso-Mustajärvi M, Linder P, et al. Virtual reality improves the accuracy of simulated preoperative planning in temporal bones: a feasibility and validation study. Eur Arch Otrhinolaryngol. 2021;278(8):2795–2806. doi: 10.1007/s00405-020-06360-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43.Mukherjee P, Cheng K, Flanagan S, Greenberg S. Utility of 3D printed temporal bones in pre-surgical planning for complex BoneBridge cases. Eur Arch Otrhinolaryngol 2017;274(8):3021–3028. doi: 10.1007/s00405-017-4618-4 [DOI] [PubMed] [Google Scholar]
  • 44.Stenin I, Hansen S, Becker M, et al. Minimally invasive multiport surgery of the lateral skull base. BioMed Res Int. 2014;2014:379295. doi: 10.1155/2014/379295 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45.Nguyen Y, Miroir M, Vellin J-F, et al. Minimally invasive computer-assisted approach for cochlear implantation: a human temporal bone study. Surg Innov. 2011;18(3):259–267. doi: 10.1177/1553350611405220 [DOI] [PubMed] [Google Scholar]
  • 46.Kong TH, Park YA, Seo YJ. Image-guided implantation of the Bonebridge™ with a surgical navigation: a feasibility study. Int J Surg Case Rep. 2017;30:112–117. doi: 10.1016/j.ijscr.2016.11.057 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47.Klenzner T, Ngan CC, Knapp FB, et al. New strategies for high precision surgery of the temporal bone using a robotic approach for cochlear implantation. Eur Arch Otrhinolaryngol. 2009;266(7):955–960. doi: 10.1007/s00405-008-0825-3 [DOI] [PubMed] [Google Scholar]
  • 48.Dillon NP, Balachandran R, Fitzpatrick JM, et al. A compact, bone-attached robot for mastoidectomy. J Med Device. 2015;9(3):031003. doi: 10.1115/1.4030083 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49.Caversaccio M, Wimmer W, Anso J, et al. Robotic middle ear access for cochlear implantation: first in man. PLoS One. 2019;14(8):e0220543. doi: 10.1371/journal.pone.0220543 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 50.Razavi CR, Wilkening PR, Yin R, et al. Image-guided mastoidectomy with a cooperatively controlled ENT microsurgery robot. Otolaryngol Head Neck Surg. 2019;161(5):852–855. doi: 10.1177/0194599819861526 [DOI] [PubMed] [Google Scholar]
  • 51.Creighton FX, Razavi CR, Wilkening PR, Taylor RH, Carey JP. Image-guided mastoidectomy with the robotic ENT microsurgery system. Otolaryngol Head Neck Surg. 2018;159(1):P130. doi: 10.1177/0194599818785627.f [DOI] [PubMed] [Google Scholar]
  • 52.Majdani O, Rau TS, Baron S, et al. A robot-guided minimally invasive approach for cochlear implant surgery: preliminary results of a temporal bone study. Int J Comput Assisted Radiol Surg. 2009;4(5):475–486. doi: 10.1007/s11548-009-0360-8 [DOI] [PubMed] [Google Scholar]
  • 53.Fauser J, Stenin I, Bauer M, et al. Toward an automatic preoperative pipeline for image-guided temporal bone surgery. Int J Comput Assisted Radiol Surg. 2019;14(6):967–976. doi: 10.1007/s11548-019-01937-x [DOI] [PubMed] [Google Scholar]
  • 54.Schneider D, Hermann J, Mueller F, et al. Evolution and stagnation of image guidance for surgery in the lateral skull: a systematic review 1989–2020. Front Surg 2021;7:604362. doi: 10.3389/fsurg.2020.604362 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 55.Linxweiler M, Pillong L, Kopanja D, et al. Augmented reality-enhanced navigation in endoscopic sinus surgery: a prospective, randomized, controlled clinical trial. Laryngoscope Investig Otolaryngol 2020;5(4):621–629. doi: 10.1002/lio2.436 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 56.Leonard S, Reiter A, Sinha A, Ishii M, Taylor R, Hager G. Image-based navigation forfunctional endoscopic sinus surgery using structure from motion. Proc SPIE. 2016;9784:97840V. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 57.Masterson L, Agalato E, Pearson C. Image-guided sinus surgery: practical and financial experiences from a UK centre 2001–2009. J Laryngol Otol 2012;126(12):1224–1230. doi: 10.1017/S002221511200223X [DOI] [PubMed] [Google Scholar]
  • 58.Govil N, Shaffer AD, Stapleton AL. The use and cost-effectiveness of intraoperative navigation in pediatric sinus surgery. Laryngoscope. 2020;130(12):E742–E749. doi: 10.1002/lary.28486 [DOI] [PubMed] [Google Scholar]
  • 59.Schmale IL, Vandelaar LJ, Luong AU, Citardi MJ, Yao WC. Image-guided surgery and intraoperative imaging in rhinology: clinical update and current state of the art. Ear Nose Throat J. 2021;100(10):475. doi: 10.1177/0145561320928202 [DOI] [PubMed] [Google Scholar]

RESOURCES