Deep learning for real-time detection of nasopharyngeal carcinoma during nasopharyngeal endoscopy

Zicheng He; Kai Zhang; Nan Zhao; Yongquan Wang; Weijian Hou; Qinxiang Meng; Chunwei Li; Junzhou Chen; Jian Li

doi:10.1016/j.isci.2023.107463

. 2023 Jul 24;26(10):107463. doi: 10.1016/j.isci.2023.107463

Deep learning for real-time detection of nasopharyngeal carcinoma during nasopharyngeal endoscopy

Zicheng He ^1,^2,⁶, Kai Zhang ^3,⁶, Nan Zhao ³, Yongquan Wang ¹, Weijian Hou ⁴, Qinxiang Meng ⁵, Chunwei Li ¹, Junzhou Chen ^3,^∗, Jian Li ^1,^2,^7,^∗∗

PMCID: PMC10502364 PMID: 37720094

Summary

Nasopharyngeal carcinoma (NPC) is known for high curability during early stage of the disease, and early diagnosis relies on nasopharyngeal endoscopy and subsequent pathological biopsy. To enhance the early diagnosis rate by aiding physicians in the real-time identification of NPC and directing biopsy site selection during endoscopy, we assembled a dataset comprising 2,429 nasopharyngeal endoscopy video frames from 690 patients across three medical centers. With these data, we developed a deep learning-based NPC detection model using the you only look once (YOLO) network. Our model demonstrated high performance, with precision, recall, mean average precision, and F1-score values of 0.977, 0.943, 0.977, and 0.960, respectively, for internal test set and 0.825, 0.743, 0.814, and 0.780 for external test set at 0.5 intersection over union. Remarkably, our model demonstrated a high inference speed (52.9 FPS), surpassing the average frame rate (25.0 FPS) of endoscopy videos, thus making real-time detection in endoscopy feasible.

Subject areas: Cancer systems biology, Machine learning

Graphical abstract

Highlights

•
We employed the YOLO network to develop an NPC diagnostic model
•
Datasets from three clinical centers were used to develop and validate the model
•
The proposed model demonstrated outstanding performance and robustness
•
Real-time detection for NPC during nasopharyngeal endoscopic videos can be achieved

Cancer systems biology; Machine learning

Introduction

Nasopharyngeal carcinoma (NPC) is a malignant tumor originating from the mucosal epithelium of the nasopharynx. In 2020, there were 96,371 new cases and 58,094 deaths worldwide, with over 70% of new cases occurring in East and Southeast Asia, revealing a highly uneven global distribution.¹^,²

Patients with early stage NPC exhibit a high overall survival rate after treatment.³ However, due to the atypical symptoms often associated with early stage NPC and the possibility of asymptomatic cases, the majority (>70%) of patients are diagnosed at an advanced stage of NPC. For the early screening of NPC, endoscopy is considered to be indispensable. The ultimate gold standard for diagnosing NPC is nasopharyngeal endoscopy-guided biopsy of abnormal nasopharyngeal lesions.⁴ Hence, it is particularly important for endoscopists to observe the morphological characteristics of masses through endoscopy and make preliminary judgments. However, distinguishing nasopharyngeal inflammation, lymphoid hyperplasia, adenoid hypertrophy, and residual adenoid tissue from early NPC under endoscopy can be challenging, resulting in false-negative outcomes.⁵^,⁶^,⁷ Atypical and small lesions may require multisite and repeated biopsies to improve detection rates. Repeated biopsies increase patient trauma and may delay treatment. Therefore, accurately identifying lesions and localizing biopsy sites are critical for early tumor diagnosis. However, not all endoscopists possess the necessary training, experience, or equipment to adequately identify and localize nasopharyngeal lesions, particularly early stage, insidious lesions. In addition, repeatedly reviewing endoscopic images of NPC can be time-consuming and mentally exhausting for endoscopists, as the human eye and brain are less sensitive to identifying lesions. Consequently, developing automatic computer-aided detection (CADe) and diagnosis (CADx) systems to support physicians in diagnosing NPC is crucial.

Computer-aided systems employing machine learning (ML) and deep learning (DL) techniques, such as convolutional neural networks (CNNs), can enhance disease detection and diagnostic accuracy and efficiency. By learning feature information from input images, CNN models can recognize specific patterns and correlate them with predefined results or parameters (output detection or diagnosis) to train network parameters. In recent years, CNNs have emerged as a promising method for image recognition or classification, serving as the foundation for automated image perception, processing, and decision-making. CNNs have proven highly beneficial in endoscopy and have been applied in various medical endoscopy imaging areas. For example, a large-dataset DL model was developed for the detection of upper gastrointestinal tumors in digestive endoscopy, an ear endoscopic image classification model based on DL was developed and validated, a DL model was applied to laryngoscopy for real-time laryngeal cancer detection, and a real-time system using DL was applied to detect and track ureteral orifices during urinary endoscopy.⁸^,⁹^,¹⁰^,¹¹^,¹²^,¹³^,¹⁴ In the field of endoscopic NPC recognition, several studies have developed artificial intelligence models based on static endoscopic images, demonstrating their feasibility and recognition performance. However, these models often require a balance of high accuracy and fast inference speed, impeding real-time dynamic detection in video nasopharyngeal endoscopy.¹⁵^,¹⁶^,¹⁷^,¹⁸^,¹⁹ Furthermore, existing studies have mainly focused on the recognition and classification of endoscopic images of NPC, which makes it difficult to accurately locate lesions in images. This is obviously unfavorable for guiding inexperienced doctors to perform biopsy site selection. In addition, as these studies rely on single-center datasets, the actual performance, generalization, and robustness of the models still need to be investigated.

The First Affiliated Hospital of Sun Yat-sen University, Macau Kiang Wu Hospital, and Guangzhou First People’s Hospital are located in southeastern China, an area with a high global prevalence of NPC. By leveraging artificial intelligence technology and our extensive nasopharyngeal endoscopy data, we utilized the you only look once (YOLO) network to develop a fast and accurate real-time object detection model for NPC. In this study, we assessed the model’s performance, determined the optimal configuration, and validated the feasibility and effectiveness of the model for real-time automated NPC detection in nasopharyngeal endoscopy using both internal and external datasets. The proposed model represents a novel approach to assist physicians with NPC identification and guide biopsy site selection during nasopharyngeal endoscopy. In addition, the model’s predictive capabilities can validate a physician’s clinical judgment. Our contributions are as follows:

1.
We have harnessed the power of the YOLO network to create a real-time NPC diagnostic model designed specifically for video nasopharyngeal endoscopy. This model excels in terms of diagnostic accuracy and inference speed, thereby enabling rapid and precise localization of NPC.
2.
We use a dataset from three different clinical centers to develop and validate our model. This unique approach facilitates a more realistic assessment of our model’s performance and generalizability in real-world clinical environments.
3.
We have developed a robust model that maintains consistent performance across a wide array of scenarios, including variations in video brightness, hue, contrast, video quality, and lens stability. This level of robustness in varied conditions is an advancement in the field.

Results

Detection results

The purpose of this experiment was to evaluate the detection performance of various algorithms for NPC lesions of varying size, shape, and appearance in different datasets. The experimental results are displayed in Table 1. The results demonstrated that the object detection performance of the YOLOv8l model was superior to that of YOLOv6m, YOLOv7, Faster-RCNN, Cascade-RCNN, and SSD (single shot multibox detector) for the internal test set, with precision, recall, F1-score, and mAP (mean average precision) values of 0.977, 0.943, 0.960, and 0.977, respectively. In the comparison test for the external test set, YOLOv7 exhibited higher precision than the remaining five models, with a value of 0.862. Furthermore, the recall of YOLOv6m was the highest among all models, with a value of 0.750. However, the F1-score can more comprehensively reflect the model’s performance. The F1-score of YOLOv8l for the external test set was 0.780, higher than that of YOLOv7, YOLOv6m, and the remaining three models. In terms of inference speed, YOLOv7 demonstrated the fastest speed and the least delay in the comparative experiments among the six models. Except for Faster-RCNN, Cascade-RCNN, and SSD, the inference speed of YOLOv6m, YOLOv7, and YOLOv8l exceeded the average frame rate (25 FPS) of the nasopharyngeal endoscopy videos, enabling real-time detection in endoscopy. Based on the performance comparison of the six models, we selected the top three performing models for further comparative analysis: YOLOv6m, YOLOv7, and YOLOv8l. The detailed results of this comparative analysis can be found in Figure 1. In summary, these comparative experiments provided compelling evidence that the YOLOv8l model exhibited exceptional stability and accuracy across both internal and external datasets. Notably, YOLOv8l demonstrated remarkable accuracy and real-time lesion detection capabilities.

Table 1.

Performance Evaluation of various models

Model	Parameters of the Model (Millions)	Internal Test Set				External Test Set				Frame Rate (FPS)	Delay (ms)
Model	Parameters of the Model (Millions)	P@.5iou	R@.5iou	F1@.5iou	mAP@.5	P@.5iou	R@.5iou	F1@.5iou	mAP@.5	Frame Rate (FPS)	Delay (ms)
YOLOv8l	43.7	0.977	0.943	0.960	0.977	0.825	0.743	0.780	0.814	52.9	18.9
YOLOv7	36.9	0.944	0.924	0.930	0.944	0.862	0.621	0.730	0.634	57.1	17.5
YOLOv6m	35.9	0.946	0.937	0.941	0.946	0.746	0.750	0.758	0.705	43.0	23.3
Faster-RCNN	41.3	0.563	0.391	0.406	0.563	0.235	0.377	0.290	0.235	8.0	125.0
Cascade-RCNN	69.2	0.930	0.621	0.742	0.930	0.676	0.415	0.544	0.676	6.3	158.7
SSD	24.4	0.858	0.683	0.759	0.858	0.779	0.454	0.573	0.779	10.4	96.2

Open in a new tab

RCNN = Region Convolutional neural network; SSD = Single Shot MultiBox Detector; P@.5iou = Precision with an Intersection over Union threshold of 0.5; R@.5iou = Recall with an Intersection over Union threshold of 0.5; F1@.5iou = F1 Score with an Intersection over Union threshold of 0.5; mAP@.5 = Mean Average Precision with an Intersection over Union threshold of 0.5; FPS = Frames Per Second; ms = Millisecond.

YOLO models performance metrics on the internal test set and the external test set for NPC detection

(A and E) Precision curves.

(B and F) Recall curves.

(C and G) F1-score curves.

(D and H) Precision-Recall curves.

Visualization of DL model prediction

In the field of medical image processing, ensuring the interpretability of a model is of utmost importance. Providing doctors with an understanding of the reasoning behind the model’s predictions allows for enhanced trust and utilization in clinical practice. To address this, we incorporated gradient-weighted class activation mapping (Grad-CAM) into our methodology.²⁰ Grad-CAM aids to generate activation maps specific to the predicted class by producing a weighted linear sum of visual patterns across different spatial locations. By employing Grad-CAM, we are able to determine which regions of an image the model relies on for its predictions. Our analysis of the Grad-CAM results revealed that the model consistently directs its attention toward lesion areas, with a particular focus on regions exhibiting distinct elevation and more intricate blood vessels, as visually depicted in Figure 2. This observation suggests that these areas hold crucial information for identifying and diagnosing potential cases of NPC. The focus on these regions indicates their significance in contributing to the model’s accurate predictions. As a result, the Grad-CAM technique shows promise as a reference tool for guiding precise lesion biopsies, potentially improving the diagnostic accuracy and treatment planning in clinical settings.

Examples of automatic NPC prediction provided by the model (YOLOv8l)

The first column on the left contains the original images. The second column contains images with ground truth bounding boxes. The third column contains images with YOLO-predicted bounding boxes. The fourth column contains images with predicted bounding boxes and heat maps. Cases A–E are nasopharyngeal endoscopic images of different NPC patients. Grad-cam: Gradient-weighted Class Activation Mapping.

Verification of real-time detection

To evaluate the model’s suitability for real-time detection in video streams, the focus of this study was primarily on the running time. The YOLOv8l model was chosen for its superior performance, as evidenced by the aforementioned results. As shown in Table 2, the inference speed of a model that processed six videos was greater than the videos' frame rate. To illustrate the detection effect, we selected some original video frames and their corresponding frames processed by the model, which are displayed in Figure 3. Furthermore, to provide a more comprehensive demonstration, three sample videos processed by the model, referred to as Videos S1, S2, and S3 in supplemental information, were made available. The aforementioned experimental results validated the efficacy and feasibility of utilizing YOLOv8l for the real-time NPC detection of nasopharyngeal endoscopy videos. The division of the dataset, the research process, and the definition of IoU are shown in Figures 4, 5 and 6.

Table 2.

Characteristics and Computation Times of the Testing Videos After Applying the Model (YOLOv8l) for NPC Detection

Video ID	Size (Mb)	Video Format	Video Resolution	Video Frame Rate （FPS）	Total Frame Count	NPC	Average Computation Time Per Frame (s)	Model Frame Rate （FPS）
1	18.3/89.6	MP4/MKV	1920x1080	25.00	306	Y	0.0183	54.64
2	18.9/81.2	MP4/MKV	1920x1080	25.00	272	Y	0.0176	56.82
3	9.92/42.7	MP4/MKV	1920x1080	25.00	148	Y	0.0176	56.82
4	20.2/82.3	MP4/MKV	1920x1080	25.00	286	Y	0.0175	57.14
5	26.6/113	MP4/MKV	1920x1080	25.00	390	Y	0.0174	57.47
6	19.2/89.4	MP4/MKV	1920x1080	25.00	311	N	0.0170	58.82

Open in a new tab

NPC = nasopharyngeal carcinoma; Mb = megabytes; FPS = frame per second.

Panel of testing videoframes extracted from six nasopharyngeal endoscopic videos

Each row represents a different video: the first two pictures of every row are extracted from the original videos, the second two and the last two images are the same frames extracted after the prediction of the model (YOLOv8l).

The flowchart of research & YOLOv8l architecture

Evaluation of whether model was correctly diagnosed. The red rectangle is marked by the model as a predicted NPC

The green rectangle is the location of NPC, which is manually marked by the expert physicians. Intersection over the union (IoU) is the area of overlap divided by area of union.

Video S1.Real-time detection videos of nasopharyngeal carcinoma in endoscopy, related to results section

Download video file^{(140.7MB, mp4)}

Video S2. Real-time detection videos of nasopharyngeal carcinoma in endoscopy

Download video file^{(80.5MB, mp4)}

Video S3. Real-time detection videos of nasopharyngeal carcinoma in endoscopy

Download video file^{(115MB, mp4)}

Validation of robustness

The robustness of a model may have significant implications for its practical application in the real world, manifested by its ability to exhibit outstanding detection performance in diverse environments. A series of rigorous tests were conducted to validate the robustness of the proposed model. We applied image corruption methods to the test sets to simulate various image characteristics in different scenes, including noise, blur, fog, and changes in brightness.²¹ The experimental results are displayed in Table 3. The experiments demonstrated that the model exhibited reliable detection performance when dealing with different video qualities, blurry visuals caused by lens shake, foggy scenes resulting from patient respiration, and changes in device brightness for both the internal and external datasets. It is worth mentioning that, in terms of the F1-score, in addition to the zoom blur and fog methods, which significantly decreased the model performance, the average impact on the performance metrics of the other methods for the internal and external test sets was −0.017 and −0.018, respectively. In summary, the proposed model demonstrated exceptional performance and reasonable generalizability for a diverse range of clinical scenarios.

Table 3.

Robustness validation of the model

Model and Methods of Image corruption	Internal Test Set				External Test Set
Model and Methods of Image corruption	P@.5iou	R@.5iou	F1@.5iou	mAP@.5	P@.5iou	R@.5iou	F1@.5iou	mAP@.5
YOLOv8l	0.977	0.943	0.960	0.977	0.825	0.743	0.780	0.814
Gaussian noise	0.956	0.911	0.933	0.959	0.805	0.711	0.755	0.795
Shot noise	0.948	0.911	0.929	0.960	0.792	0.710	0.749	0.797
Impulse noise	0.954	0.889	0.920	0.949	0.805	0.684	0.739	0.787
Defocus blur	0.965	0.932	0.948	0.971	0.812	0.735	0.771	0.807
Zoom blur	0.810	0.758	0.783	0.837	0.680	0.587	0.630	0.705
Motion blur	0.966	0.966	0.966	0.969	0.818	0.732	0.772	0.803
Fog	0.945	0.661	0.778	0.830	0.796	0.493	0.610	0.689
Brightness +	0.977	0.943	0.960	0.977	0.824	0.743	0.780	0.815
Brightness -	0.956	0.932	0.944	0.971	0.804	0.731	0.766	0.803

Open in a new tab

P@.5iou = Precision with an Intersection over Union threshold of 0.5; R@.5iou = Recall with an Intersection over Union threshold of 0.5; F1@.5iou = F1 Score with an Intersection over Union threshold of 0.5; mAP@.5 = Mean Average Precision with an Intersection over Union threshold of 0.5.

Discussion

In this study, we successfully employed a YOLO network to develop an NPC diagnostic model for video nasopharyngeal endoscopy. The model demonstrated outstanding accuracy and inference speed with both internal and external datasets. By providing real-time and precise lesion localization during the early screening of NPC, our model has the potential to significantly assist clinicians in their decision-making processes. Furthermore, the model shows promising prospects in guiding biopsy procedures for NPC, ultimately contributing to more accurate diagnoses and improving patient outcomes.

In areas where NPC is endemic, early screening for high-risk individuals currently includes serum Epstein-BarrVirus (EBV) DNA testing combined with nasopharyngeal endoscopy and magnetic resonance imaging (MRI). NPC exhibits characteristics that are distinct from those of most other tumors, and nasopharyngeal endoscopy plays a vital role in the early screening and auxiliary diagnosis of NPC that MRI cannot replace.²² Presently, there are two main types of nasopharyngeal endoscopy: white light imaging (WLI) and narrow-band imaging (NBI). The former mainly identifies the overall characteristics of a lesion, and the latter mainly identifies the microvascular morphology of a lesion.²³ While NBI is more beneficial in identifying occult NPC, its clinical applicability is hampered by the intensive training and expertise required for optical image interpretation. Furthermore, NBI endoscopic equipment is more expensive than WLI endoscopic equipment. Given that many areas with high NPC prevalence in China and Southeast Asia are situated in rural or remote locations, WLI endoscopes are the most prevalent endoscopic equipment in local hospitals. Thus, the prevalence of NBI endoscopy is limited.²⁴ After considering these factors, we primarily focused on detecting and diagnosing NPC in the WLI mode of nasopharyngeal endoscopy.

For physicians, identifying NPC by nasopharyngeal endoscopy is a significant challenge. Li et al. reported that the accuracy, sensitivity, specificity, and positive prediction value (PPV) of experts with five years of experience in identifying nasopharyngeal malignant and benign lesions in WLI images were 80.5%, 89.5%, 70.8%, and 76.6%, respectively, and these metrics were significantly lower for less experienced physicians.¹⁵ As many areas with high NPC prevalence in China and Southeast Asia are located in rural or remote locations, local doctors might lack sufficient experience and advanced endoscopic equipment for NPC detection. Consequently, we believe that primary care hospitals need a CADe model based on DL to assist physicians in the early diagnosis of NPC more than national hospitals, and this model can help bridge the cancer diagnosis gap between them. Li et al.'s DL model demonstrated higher recognition accuracy and faster recognition speed than those of professional clinicians, proving that CADe technology can serve as a powerful assistant for clinicians.¹⁵

Over the past five years, the advent of the big data era has driven rapid development in the DL network represented by CNNs. In the processes of data acquisition, preprocessing, feature extraction, and data classification, CNNs have been widely used in tumor classification, detection, and segmentation due to their outstanding spatial feature extraction function and classification accuracy.¹¹^,¹³^,²⁵^,²⁶ Currently, numerous studies have developed object detection models based on YOLO, particularly in the field of video endoscopy, which requires real-time lesion recognition. With the exceptional accuracy and speed of the YOLO network, dynamic, real-time, and precise lesion recognition can be achieved in video endoscopy.¹¹^,²⁷^,²⁸^,²⁹^,³⁰^,³¹

To the best of our knowledge, only a few studies have employed artificial intelligence networks to construct NPC endoscopic diagnosis models. Li et al. retrospectively used 28,966 white light images of nasopharyngeal endoscopy to train and develop a CNN-based diagnostic model to identify endoscopic nasopharyngeal malignant tumors and guide biopsies.¹⁵ The accuracy, sensitivity, specificity, and PPV values of the model for the retrospective test set were 88.7%, 91.3%, 83.1%, and 92.2%, respectively. Simultaneously, for the prospective test set, these values were 88.0%, 90.2%, 85.5%, and 86.9%, respectively. It took approximately 40 s to process 1,430 images. The model exhibited excellent segmentation performance and could accurately outline tumor boundaries. However, since the model can only make diagnoses based on preacquired endoscopic images rather than real-time video, it was challenging to achieve real-time detection in video endoscopy. Mohammed et al. used 381 endoscopic images of NPC to construct an ML model that can handle classification and segmentation tasks. By employing a genetic algorithm for feature selection and an artificial neural network (ANN) for image classification, the model achieved precision, sensitivity, and specificity values of 95.15%, 94.80%, and 95.20%, respectively. In addition, the segmentation accuracy of the model was 92.65%.¹⁸ Moreover, Mohammed’s research team used the same NPC dataset to develop a diagnostic model using support vector machine (SVM)-based decision-level fusion of three image texture (local binary patterns, first-order statistics histogram properties, and grayscale histograms) schemes. The classifier approaches achieved an accuracy of 94.07%, a sensitivity of 92.05%, and a specificity of 93.07%.¹⁹ For the detection of endoscopic images of NPC, Mohammed’s team then developed an NPC detection model using a genetic algorithm and ANN based on Haar feature fear. The proposed model achieved accuracy, sensitivity, and specificity values of 96.22%, 95.35%, and 94.55%, respectively.¹⁷ Xu et al. developed a CNN-based NPC diagnostic model using 4,783 nasopharyngeal endoscopic images by combining the optical characteristics of NPC with WL and NB images and used cross-validation to expand the test set sample size. They reported an accuracy of 94.9%, a sensitivity of 94.8%, a specificity of 95.0%, a PPV of 95.2%, and an AUC of 0.986. Additionally, the processing time for 2,000 images was 39.04 s, enabling real-time diagnosis during nasopharyngeal endoscopy.¹⁶ However, their models primarily perform image classification tasks and cannot accurately obtain the location, size, and other important information of an NPC lesion. While some of the aforementioned models exhibit fast inference speeds, there has been no verification of their performance in video-based applications. Therefore, further investigation is necessary to evaluate and optimize the real-time performance of these models. Moreover, since these studies use single-center datasets, the generalizability and applicability of the proposed models in the real world remain to be further discussed. The comparison of our proposed model with previous studies is summarized in Table 4.

Table 4.

A review of AI diagnosis of NPC based on endoscopic images

Authors, Year and Country	Site, No. of Cases (Data Type)	AI Subfield (Application)	AI Methods and its Application	Performance Metric (s)
Li et al.¹⁵ (2018) (China)	NPC 28966 (Endoscopic images, white light imaging)	Deep learning (Auto-contouring/Diagnosis)	1. Detection: Fully CNN	1. Detection performance - AUC: 0.930 - Sensitivity: 0.902 [CI:0.878–0.922] - Specificity: 0.855 [CI: 0.827–0.880] - Accuracy: 0.880 [CI: 0.861–0.896] - PPV: 0.869 [CI: 0.843–0.892] - NPV: 0.892 [CI: 0.865–0.914] - Time taken: 0.67 min (1430 images) 2. Segmentation performance - DSC: 0.75 ± 0.26
Mohammed et al.¹⁸ (2018) (Malaysia, Iraq and India)	NPC 381 (Endoscopic images, white light imaging)	Machine learning (Auto-contouring/Diagnosis)	1. Feature selection: Genetic algorithm 2. Classification: ANN & SVM	1. Segmentation performance - Accuracy: 0.9265 2. Classification performance - Sensitivity: 0.9480 - Specificity: 0.9520 - Precision: 0.9515
Abd Ghani MK et al.¹⁹ (2018) (Malaysia, Iraq and India)	NPC 381 (Endoscopic images, white light imaging)	Machine learning (Diagnosis)	1. Classification: SVM, ANN, KNN	1. Classification performance - Sensitivity: 0.9205 - Specificity: 0.9307 - Accuracy: 0.9407
Mohammed et al.¹⁷ (2018) (Malaysia, Iraq and India)	NPC 381 (Endoscopic images, white light imaging)	Machine learning (Diagnosis)	1. Feature selection: Genetic algorithm 2. Classification: ANN	1. Classification performance - Sensitivity: 0.9535 - Specificity: 0.9455 - Accuracy: 0.9622
Xu et al.¹⁶ (2021) (China)	NPC 4783 (Endoscopic images, white light imaging & narrow-band imaging)	Deep learning (Diagnosis)	1. Feature extraction: Xception 2. Classification: Deep CNN	1. Classification performance - AUC: 0.986 [CI:0.982–0.992] - Sensitivity: 0.948 [CI:0.930–0.966] - Specificity: 0.950 [CI: 0.937–0.964] - Accuracy: 0.949 [CI: 0.933–0.965] - PPV: 0.952 [CI: 0.936–0.968] - NPV: 0.946 [CI: 0.933–0.960] 2. Inference Time: 39.4 s (1000 pairs of images)
Our proposed model	NPC 2429 (Endoscopic images, white light imaging)	Deep learning (Object detection/Diagnosis)	1. Object detection: YOLOv6, YOLOv7, YOLOv8, Faster-RCNN, Cascade-RCNN, SSD	1. Object detection performance Internal dataset -Precision: 0.977 -Recall: 0.943 -F1-score: 0.960 -mAP: 0.977 External dataset -Precision: 0.825 -Recall: 0.743 -F1-score: 0.780 -mAP: 0.814 2. Inference speed: 52.9 FPS

Open in a new tab

NPC = Nasopharyngeal carcinoma; AI = Artificial intelligence; CNN = Convolutional neural network; AUC = Area under curve; PPV = Positive prediction value; NPV = Negative prediction value; DSC = Dice similarity coefficient; ANN = Artificial neural network; SVM = Support vector machines; KNN = k-nearest neighbors’ algorithm; RCNN = Region Convolutional neural network; SSD = Single Shot MultiBox Detector; mAP = Mean average precision; FPS = Frames per second.

To the best of our knowledge, we are the first to develop a DL model for real-time NPC object detection in video endoscopy. We used YOLOv6m, YOLOv7, YOLOv8l, Faster-RCNN, Cascade-RCNN, and SSD to construct object detection models and compared their results, ultimately finding that YOLOv8l provided the best accuracy and speed for tumor detection. The model demonstrated a good balance between precision and recall, with F1-scores of 0.960 and 0.780 for the internal and external test sets, respectively, which maximized the detection rate and reduced the misdiagnosis rate. The CADe’s ability to detect small lesions has been shown to be comparable or even superior to that of professional doctors, which can help less experienced doctors accurately detect lesions during endoscopy.²⁸ The YOLOv8l model required only 17 ms to analyze a video frame. Additionally, the model’s average frame rate was 57.6 FPS, while the average frame rate of nasopharyngeal endoscopy video was 25–30 FPS, indicating that the model was fully capable of real-time lesion detection in nasopharyngeal endoscopy. In addition, our endoscopic video verification experiment also confirmed the real-time performance of the model. Most importantly, the proposed model demonstrated outstanding stability during robustness testing. We believe that the model can accurately detect NPC across various settings of nasopharyngoscopy examinations, differences in hospital equipment, and instances where the examination field may be blurred due to inexperienced doctor’s operation. Similar to those used by Xu et al., we used interpretive tools to visually explain which regions of images the model focuses on to make predictions, which had important implications.¹⁶ On the one hand, when the model is significantly weaker than endoscopists in terms of NPC detection, the goal of explanations is to identify the failure modes, thereby helping researchers focus their efforts on the most fruitful research directions. On the other hand, when the model is significantly stronger than endoscopists in terms of NPC detection, the goal of explanations is in machine teaching, i.e., a machine teaching an endoscopist about how to make better decisions in detecting NPC during nasopharyngeal endoscopy. Unlike CAM, Grad-CAM can extract the heatmap of any layer of the feature map without modifying the network structure of the model. It can be applied to the network structure of nonglobal average pooling connections to provide more accurate visualization results.²⁰ Furthermore, due to the lightweight structure of the YOLO network, this model can be widely used and promoted in grassroots or community hospitals.

Since DL models typically perform well for internal datasets and poorly for extrapolation, we incorporated data from other medical centers as external test sets to validate the models' generalizability across various patient populations and healthcare systems. The dataset from multiple clinical medical centers included characteristics of different populations and different endoscopic systems, which could make our sample population more consistent with the actual population and more accurately reflect the model’s performance in practical applications. Often, the metric values of a model for an external test set may be less than or equal to those for an internal test set because the external test set contains more unknown data, which may have different distributions from the data in the internal test set. This could be due to a variety of factors, including differences in the equipment used at different centers, variations in the skills of the technical staff, and differences in the types and stages of diseases among patient populations. In the process of model training, the model will attempt to adapt to the data distribution of the training set and the validation set but may overfit these data distributions, resulting in degraded performance with the external test set. Additionally, there may be some selection bias in the internal test set data, resulting in slightly higher model performance for the internal test set than in reality. However, despite these differences, our model demonstrated good generalizability, maintaining reasonable performance across different centers and patient populations. This outcome gives us confidence in the application of our model in the real world and provides directions for improvements to our model. Consequently, future research should focus on further validating and optimizing the model’s practicality and generalizability in real-world scenarios. Increasing the diversity and quantity of data used for training or employing cross-validation methodologies in limited data should be considered. Furthermore, it is crucial to test the model’s performance across different sex, age, region, and disease stage subgroups and assess its robustness when dealing with various video qualities, lighting conditions, and lens angles. However, it should be noted that achieving perfect generalization across all datasets is often unattainable, so it may be necessary to strike a balance between increased performance with additional data and the possibility of overfitting the model.

Limitations of the study

There are limitations in this study. First, the model only recognizes NPC and non-NPC. Non-NPC includes benign and malignant lesions, such as hypertrophic adenoids, tuberculosis, lymph node hyperplasia, cysts, lymphoma, olfactory neuroblastoma, malignant melanoma, and adenoid cystic carcinoma. Among them, the number of video frames of lymphoma is small, and its endoscopic morphology is similar to that of NPC, so the model is prone to mistakenly consider lymphoma as NPC. Although NPC is the most common malignant tumor in the nasopharynx, further research is necessary. The next stage of this study will focus on expanding the number of other pathological types to enrich the dataset and build a reliable algorithm. Second, the number of video frames of NPC growing under the mucosa in the dataset, that is, atypical lesions and small lesions, is small, which leads to insufficient training of the model for this type of NPC. Therefore, the model is prone to missed diagnosis. In nasopharyngeal endoscopy, the local resolution can be improved by making the lens close to the lesion, thereby improving the detection rate. Third, the dataset consists of manually selected video frames rather than continuous video frames. Continuous video frames provide a more comprehensive representation of the dynamic changes and progression of lesions over time. This may cause the model to fail to learn effectively and adapt to a wider range of scenarios. Fourth, the whole dataset was collected retrospectively, which might have led to a certain level of selection bias. Finally, the accuracy of our model’s detection needs to be compared with different ranks of doctors, which in turn validates the model’s suitability for true clinical practice.

STAR★Methods

Key resources table

REAGENT or RESOURCE	SOURCE	IDENTIFIER
Deposited data

Nasopharyngeal Endoscopic Image Datasets	This paper	N/A

Software and algorithms

YOLOv6	Li et al.³²	https://github.com/meituan/YOLOv6
YOLOv7	Wang et al.³³	https://github.com/WongKinYiu/yolov7
YOLOv8	Ultralytics company	https://github.com/ultralytics/ultralytics
Single Shot MultiBox Detector (SSD)	Liu et al.³⁴	https://github.com/weiliu89/caffe
Faster-RCNN	Girshick et al.³⁵	https://github.com/rbgirshick/py-faster-rcnn
Cascade-RCNN	Cai et al.³⁶	https://github.com/zhaoweicai/cascade-rcnn
Image Corruptions	Hendrycks et al.²¹	https://github.com/bethgelab/imagecorruptions
Gradient-weighted Class Activation Mapping (Grad-CAM)	Selvaraju et al.²⁰	https://github.com/jacobgil/pytorch-grad-cam
PyTorch	Version 1.11.0	https://pytorch.org/docs/1.11/
Matplotlib	Version 3.7.1	https://pypi.org/project/matplotlib/
Python	Version 3.8.13	https://www.python.org/downloads/release/python-3813/

Open in a new tab

Resource availability

Lead contact

Further information and requests for resources should be directed to and will be fulfilled by the lead contact, Jian Li (lijianent@hotmail.com).

Materials availability

This study did not generate new unique reagents.

Experimental model and study participant details

This study was conducted in accordance with the principles of the Declaration of Helsinki and was approved by the Ethics Committees of the First Affiliated Hospital of Sun Yat-sen University, the Kiang Wu Hospital and the Guangzhou First People's Hospital. Due to the retrospective nature of the study and the negligible risk to subjects, informed consent was waived. All patients underwent examination in the endoscopy room using a high-definition video nasopharyngeal endoscope (KARL STORZ-endoskope, Tuttlingen, Germany) in white light mode after local anesthesia with bupivacaine and mucosal shrinkage with epinephrine.

All images were video frames captured from nasopharyngeal endoscopy videos. The inclusion criteria for images were: (1) a minimum resolution of 400x400 pixels; (2) a minimum size of 60 kb; (3) acquired during the initial diagnosis; (4) nasopharyngeal images without nasal structure; (5) clearly visible nasopharyngeal mucosa without overlying material; (6) clear focus; (7) standard white light used during inspection and image capture, with white balance correction performed before inspection; (8) definitive pathological diagnosis. The exclusion criteria for images included: (1) missing pathological information; (2) missing endoscopic images; (3) images that were out of focus, too low in brightness, or had motion artifacts.

We retrospectively collected 2,429 nasopharyngeal endoscopic video frame images, clinicopathological data, imaging reports, and medical records from 690 patients at three medical centers from January 1, 2020, to December 1, 2021. This included 2,000 images from 519 patients at the First Affiliated Hospital of Sun Yat-sen University and 429 images from 171 patients at the Kiang Wu Hospital in Macau and the First People's Hospital in Guangzhou. All patients were Chinese. In the internal dataset, we recorded a total of 369 male patients and 150 female patients. Similarly, in the external dataset, our findings showed 112 male patients and 59 female patients. The average age of the entire dataset was 41.05 years old. All images were anonymized and reconstructed in random order. Images pathologically confirmed as other than NPC, according to the World Health Organization histopathological classification, were considered to be in the non-NPC category, which included nasopharyngeal cysts, lymphomas, tuberculosis, fibrovascular tumors, malignant melanoma, etc. The ratio of NPC to non-NPC was approximately 1:1.

Subsequently, three expert physicians manually labeled the images using LabelImg software. In an image with histopathological evidence of NPC, a bounding box, defined as Ground Truth (GT), was outlined along the largest boundary of the tumor that surrounded the entire tumor area and was labeled as "NPC" according to the label. The accuracy of the GT bounding boxes profile was cross-checked by the three expert physicians. We classified the NPC images into three categories based on relative bounding-box size of lesion in proportion to the image. The classification criteria were as follows: small: GT bounding box of lesion occupying equal or less than 10% of the image; medium: GT bounding box of lesion occupying more than 10% but equal or less than 30% of the image; large: GT bounding box of lesion occupying more than 30% of the image. Among them, small, medium and large bounding boxes accounted for 3.6 %, 36.2 % and 60.2 % respectively. The size distribution of ground-truth bounding boxes for different datasets is shown in the Table S1 and Figure S1 in supplemental information.

We randomly divided the dataset of the main center into a training set, a validation set, and an internal test set in an 8:1:1 ratio. The dataset from the remaining two centers was used as an external test set to evaluate the model's generalization ability. Lastly, six unedited videos of nasopharyngeal endoscopy were selected to validate the real-time NPC detection performance of the model. The dataset division is shown in Figure 4.

Method details

Data augmentation

Data augmentation is a technique to improve model generalization and reduce model overfitting, aiming to increase the number and diversity of data in the training set by transforming and expanding the original data to enhance the model's generalization and robustness. The data augmentation techniques we employed included adjusting image brightness, contrast, saturation, noise, random scaling of images, cropping, flipping, rotating, copy-paste, mixup, and mosaic data augmentation.³³^,³⁷^,³⁸

DL model training and testing

We utilized the YOLO network of open-source CNNs as the object detection model. YOLO is a single-stage DL object detector capable of identifying objects by framing them in a bounding box while simultaneously classifying the object based on probability. At the time of our analysis, the latest version of YOLO was YOLOv8, which demonstrated excellent accuracy and inference speed. In the backbone network of YOLOv8, additional branches have been introduced during feature extraction. These additional branches help enhance the model's training accuracy by capturing and leveraging more diverse and informative features from the input images. Importantly, during the inference process, these additional branches are not involved in the computations. This optimization ensures that the inference speed is not compromised, allowing for efficient real-time or near-real-time object detection. The head section adopts the popular decoupled head structure, separating the classification and regression heads. Moreover, it transitions from an anchor-based approach to an anchor-free approach. In the classification head, YOLOv8 utilizes the Binary Cross Entropy (BCE) Loss for efficient and accurate object classification. The regression head, on the other hand, incorporates the concepts from the Distribution Focal Loss (DFL) and Complete Intersection over Union (CIoU) Loss. DFL is a loss function proposed to address the class imbalance issue in object detection tasks while CIoU Loss is a localization-based loss function that measures the geometric similarity between predicted and GT bounding boxes.

YOLOv8 consists of five different models that vary in terms of the number of parameters, trainable weight sizes, and computation time. Models range from small to extra-large versions (YOLOv8n, YOLOv8s, YOLOv8m, YOLOv8l, and YOLOv8x). In this study, we chose YOLOv8l as the target algorithm model. The flowchart of the research and YOLOv8l architecture are depicted in Figure 5. Additionally, to verify the NPC detection performance of the YOLOv8l model, we selected the other five algorithms, YOLOv7, YOLOv6m, Faster-RCNN, Cascade-RCNN and SSD for comparison tests.³²^,³³^,³⁴^,³⁵^,³⁶

To further assess the model's robustness in various environmental conditions, including different video brightness, hue, contrast, video quality, and lens stability, we conducted a series of rigorous robustness tests. We employed the image corruption methods in test sets to simulate various image characteristics under different scenes, such as noise, blur, fog and brightness changes.

Gaussian noise was added to the images, simulating random variations with a Gaussian distribution. This noise introduces randomness and decreases in intensity as the distance from the center increases. It can be represented mathematically as follows: