A Deep Learning Algorithm to Identify Anatomical Landmarks on Computed Tomography of the Temporal Bone

Zubair Hasan; Seraphina Key; Michael Lee; Fiona Chen; Layal Aweidah; Aaron Esmaili; Raymond Sacks; Narinder Singh

doi:10.5152/iao.2023.231073

. 2023 Sep 1;19(5):360–367. doi: 10.5152/iao.2023.231073

A Deep Learning Algorithm to Identify Anatomical Landmarks on Computed Tomography of the Temporal Bone

Zubair Hasan ^1,^2,^✉, Seraphina Key ³, Michael Lee ¹, Fiona Chen ⁴, Layal Aweidah ², Aaron Esmaili ⁵, Raymond Sacks ^1,⁶, Narinder Singh ^1,²

PMCID: PMC10645193 PMID: 37789621

Abstract

Background:

Petrous temporal bone cone-beam computed tomography scans help aid diagnosis and accurate identification of key operative landmarks in temporal bone and mastoid surgery. Our primary objective was to determine the accuracy of using a deep learning convolutional neural network algorithm to augment identification of structures on petrous temporal bone cone-beam computed tomography. Our secondary objective was to compare the accuracy of convolutional neural network structure identification when trained by a senior versus junior clinician.

Methods:

A total of 129 petrous temporal bone cone-beam computed tomography scans were obtained from an Australian public tertiary hospital. Key intraoperative landmarks were labeled in 68 scans using bounding boxes on axial and coronal slices at the level of the malleoincudal joint by an otolaryngology registrar and board-certified otolaryngologist. Automated structure identification was performed on axial and coronal slices of the remaining 61 scans using a convolutional neural network (Microsoft Custom Vision) trained using the labeled dataset. Convolutional neural network structure identification accuracy was manually verified by an otolaryngologist, and accuracy when trained by the registrar and otolaryngologist labeled datasets respectively was compared.

Results:

The convolutional neural network was able to perform automated structure identification in petrous temporal bone cone-beam computed tomography scans with a high degree of accuracy in both axial (0.958) and coronal (0.924) slices (P < .001). Convolutional neural network accuracy was proportionate to the seniority of the training clinician in structures with features more difficult to distinguish on single slices such as the cochlea, vestibule, and carotid canal.

Conclusion:

Convolutional neural networks can perform automated structure identification in petrous temporal bone cone-beam computed tomography scans with a high degree of accuracy, with the performance being proportionate to the seniority of the training clinician. Training of the convolutional neural network by the most senior clinician is desirable to maximize the accuracy of the results.

Keywords: Otolaryngology, radiographic image interpretation, artificial intelligence, neural networks

Introduction

Surgery of the temporal bone and mastoid is often required for the clearance of chronic ear diseases such as chronic suppurative otitis media or cholesteatoma, for complications of acute mastoiditis and prior to implanting hearing devices such as cochlear implants.¹ Between July 2017 and June 2022, 4356 patients underwent mastoidectomy in Australia).² Recognizing intraoperative landmarks such as the mastoid air cells, sigmoid sinus, tegmen, facial nerve, and ossicular chain is critical to avoiding complications of surgery such as hearing loss, vertigo, facial paralysis, or CSF leak.^1,3-5 As a result, accurate identification of these landmarks is a crucial requirement for radiologists and otolaryngologists.^3,6

Petrous temporal bone (PTB) cone-beam computed tomography (CBCT) is performed to aid pre-operative diagnosis and identification of landmarks with 262 364 PTB CBCT scans performed in Australia between July 2017 and June 2022.² Conventional interpretation of CT imaging is subject to several human factors such as clinical experience, fatigue, and time pressures.⁷ Artificial intelligence (AI) may augment pre-operative identification by automated labeling and identification of important anatomical structures and landmarks on CT imaging in a consistent and identifiable manner and be utilized as a valuable teaching instrument.

There is an expanding interest in utilizing deep learning in radiological analysis given the data-centric nature of radiologic imaging. Deep learning may facilitate identification of these structures and has previously been utilized for structure and region identification on chest x-rays.⁸ Convolutional neural networks (CNNs) is a promising methodological technique which can be applied to a range of computer vision tasks. Recent advances in hardware and software in the past 2 to 3 decades have expanded the role of CNNs which have been applied increasingly to real-world computer-vision problems, including in medical imaging.⁹ The utility of deep learning in temporal bone imaging to date is sparse, although applications in other domains of radiology such as chest x-rays (CXR) and CT are more well developed.^10-14 Radiology studies in CNN typically approach image analysis as a classification-based problem,⁶ although object detection by bounding boxes is also a well-established technique, popularized in facial detection and self-driving cars.^15,16 In object detection by bounding boxes, a training data set “trains” the algorithm to localize pre-labeled named objects, to which a class label is attached, within the overall image. Subsequently, a test set is presented and the algorithm is challenged to localize the object in question in previously “unseen” images.¹⁷ As opposed to the binary approach of classification-based computer vision tasks, bounding boxes have increased complexity. They seek to not only identify if a structure is present in an image but also where in the image it is located.¹⁸ Bounding boxes have been used in radiology to identify anatomy and pathology on CXR¹⁴ and have proposed uses elsewhere such as identifying brain tumors/edema on MRI and identifying fractures on pelvic and wrist x-rays.^8,19,20

Structure identification on temporal bone imaging is an area of paucity in the AI literature²¹ and hence may be a good test of the ability of CNNs in identifying fine anatomical structures within a confined space. The aim of this study is to determine the accuracy of using a deep-learning CNN algorithm to identify critical structures in temporal bone CT imaging through object detection with bounding boxes. A secondary aim is to compare the accuracy of the CNN in identifying these structures when trained separately by an otolaryngology registrar as compared to a board-certified otolaryngologist.

Material and Methods

Institutional ethics approval was obtained from the Western Sydney Local Health District Ethics Committe (2021/PID03049), and the study was conducted in accordance with the principles of the Declaration of Helsinki. Verbal informed consent was obtained from the patients who agreed to take part in the study.

De-identified retrospective PTB CBCT scans were obtained from a large Australian public tertiary hospital radiology Picture Archiving and Communication System (PACS), with a total of 129 scans extracted. Petrous temporal bone cone-beam computed tomography scans were obtained from a scanner with capability for fine slice image acquisition (0.3 mm). Axial and coronal slices at the level of the malleoincudal joint were chosen for the purposes of this study, as these are the views an otolaryngologist would look through pre-operatively to identify important landmarks. Exclusion criteria were previous temporal bone surgery based upon radiological appearance. Clinical information was not collected.

Images were divided into training and test sets. Azure’s Custom Vision (Microsoft Corporation, Redmond, Washington, USA) platform was utilized to draw bounding boxes around the relevant structures using the test set where the structures were present. Custom Vision allows application of machine learning (ML) algorithms to perform classification tasks as well as identification tasks using bounding boxes on custom datasets as small as 50 images.

Bounding box mode was utilized on 68 axial and 66 coronal training images. Images were interpreted in Joint Photographic Experts Group (JPEG) format in their original resolution on high-resolution computer monitors with bounding boxes placed around structures by a board-certified otolaryngologist (ZH) and independently by an otolaryngology registrar (FC). Important temporal bone structures of interest to an operating surgeon were chosen for identification. On axial imaging, they were mastoid air cells, sigmoid sinus, internal acoustic meatus, facial nerve, ossicles, cochlea, and the vestibule. On coronal imaging, they were mastoid air cells, internal acoustic meatus, facial nerve, cochlea, the vestibule, carotid canal, ossicles, semi-circular canal, and tegmen. Training images were uploaded to the platform and subsequently the algorithm was trained on this image set using the 1-hour training mode.

A second unseen test set composed of 61 axial and 61 coronal sequences was uploaded to the Custom Vision platform and the trained algorithm was tested in automated structure identification on the training performed by both the otolaryngology registrar and board-certified otolaryngologist (test 2). Algorithm performance in correctly identifying anatomical structures of interest was recorded. Responses were recorded, including the number of instances the structure of interest was identified manually in the test set (ground truth) and the number of instances the algorithm correctly identified the structure of interest when trained on the training set labeled by the otolaryngology registrar (test 1) and otolaryngologist (test 2).

Statistical analysis was performed using MedCalc 2011 (Ostend, Belgium). Sensitivity and specificity were calculated with 95% confidence intervals. Receiver-operating characteristic (ROC) curves were generated with area under the curve calculations and 95% confidence intervals based on the methodology described by DeLong et al.²²

Results

The final dataset included 129 axial images and 127 coronal images. In 2 CT series, an adequate coronal slice where all the necessary structures were visible could not be identified. The axial images were split into 68 training images and 61 test images and the coronal images were divided into 66 training images and 61 test images.

Axial Imaging

Mastoid Air Cells

Mastoid air cells were present in all 61 test images. The mastoid air cells were identified reliably on the test set in 100% (61/61) of images when the dataset was trained based on labeling both by the otolaryngology registrar (train 1) as well as the otolaryngologist (train 2). One image had a pneumatized mastoid apex, and this was also labeled by the algorithm as “mastoid air cells” (Figure 1).

Figure 1. — Pneumatized petrous apex misclassified as mastoid air cells.

Sigmoid Sinus

The sigmoid sinus was present in 57 test images. In train 1, it was identified accurately in 55/57 (96.49%) images and in all 57 (100%) images in train 2.

Internal Acoustic Meatus

The internal acoustic meatus was present in 60 test images. In train 1, it was identified accurately in 53/60 (88.33%) images and in all 60 (100%) images in train 2.

Facial Nerve

The facial nerve was present in 52 test images. In train 1, it was identified accurately in 39/52 (75%) images and in 51/52 (98.08%) images in train 2.

Cochlea

The cochlea was present in 54 test images. In train 1, it was identified accurately in 33/54 (61.11%) images and in 53/54 (98.15%) images in train 2.

Vestibule

The vestibule was present in 55 test images. In train 1, it was identified accurately in 23/54 (41.82%) images and in 53/54 (96.36%) images in train 2.

Total Structures Identified

A total of 400 structures were present in the test set (see Table 1 – ground truth) and a total of 27 structures were not present across the test set images, producing a total of 427 identification tasks for the algorithm (Figure 2). Receiver operating characteristic curve was generated with area under the curve of 0.851 (95% CI, 0.813-0.883) when the algorithm was trained by the registrar compared to 0.958 (95% CI, 0.934-0.975) when trained by an ENT surgeon (see Figure 3). Sensitivity was 81.25 (95% CI, 77.1-85.0) and specificity was 88.89 (95% CI, 70.8-97.6) when trained by the registrar and 99.00 (95% CI, 97.5-99.7), and 92.59 (95% CI, 75.7-99.1), respectively, when trained by the surgeon.

Table 1.

Ground Truth and Percentage Accuracy for Anatomical Landmarks in Axial CT


	Mastoid Air Cells	Sigmoid Sinus	IAM	Facial Nerve	Ossicles
Anatomical Structure
Cochlea	Vestibule
Ground truth	61	57	60	52	61
54	55
ENT registrar	61	55	53	39	61
33	23
CNN train 1 (%)	100.00	96.49	88.33	75.00	100.00
61.11	41.82
ENT surgeon	61	57	60	51	61
53	53
CNN train 2 (%)	100.00	100.00	100.00	98.08	100.00	98.15	96.36

Open in a new tab

CNN, convolutional neural network.

Figure 2. — Labeled structures on axial CT imaging. CT, computed tomography.

Figure 3. — Receiver operator curve (ROC) for axial CT training by ENT registrar (left) and ENT surgeon (right). CT, computed tomography.