Abstract
A major rise in the prevalence and influence of colorectal cancer (CRC) leads to substantially increasing healthcare costs and even death. It is widely accepted that early detection and removal of colonic polyps can prevent CRC. Detection of colonic polyps in colonoscopy videos is problematic because of complex environment of colon and various shapes of polyps. Currently, researchers indicate feasibility of Convolutional Neural Network (CNN)-based detection of polyps but better feature extractors are needed to improve detection performance. In this paper, we investigated the potential of the single shot detector (SSD) framework for detecting polyps in colonoscopy videos. SSD is a one-stage method, which uses a feed-forward CNN to produce a collection of fixed-size bounding boxes for each object from different feature maps. Three different feature extractors, including ResNet50, VGG16, and InceptionV3 were assessed. Multi-scale feature maps integrated into SSD were designed for ResNet50 and InceptionV3, respectively. We validated this method on the 2015 MICCAI polyp detection challenge datasets, compared it with teams attended the challenge, YOLOV3 and two-stage method, Faster-RCNN. Our results demonstrated that the proposed method surpassed all the teams in MICCAI challenge and YOLOV3 and was comparable with two-stage method. Especially in detection speed aspect, our proposed method outperformed all the methods, met real-time application requirement. Meanwhile, we also indicated that among all the feature extractors, InceptionV3 obtained the best result of precision and recall. In conclusion, SSD- based method achieved excellent detection performance in polyp detection and can potentially improve diagnostic accuracy and efficiency.
INDEX TERMS: Colonic polyp detection, convolutional neural network, single shot detector (SSD)
I. INTRODUCTION
Colorectal cancer(CRC) is one of the major causes of death all over the world [1]. Therefore, CRC screening is widely utilized to detect CRC to reduce the morbidity. To be specific, endoscopic screening contributes to early detection of cancer, surveillance of CRC high-risk patients, and removal of pre-cancerous adenomatous polyps [2]. It is reported that a polyps detection miss-rate up to 20% – 24% brings patient into a high risk of death from CRC [3], [4]. Thus, several efforts have been made to improve polyp detection. However, accurate polyps detection is difficult since 1) various sizes and shapes of polyps, bowel preparation quality, the time dedicated to mucosal inspection during withdrawal, and 2) experience of the colonoscopist.
Computer-aided detection (CAD) technology for automated polyps detection is aimed to assist clinicians in medical diagnosis [5]–[20]. The methods for detection of polyp are composed of two groups, traditional methods and Convolutional Neural Network (CNN) based method. Conventional polyp detection methods were based on primarily hand-crafted features followed by a separate classifier. The features used for polyp detection are mainly geometric including shape, size and texture information [5]–[7]. For example, geometric and texture features have been utilized in the literature for polyp detection. The authors in [5] used edge detection via canny filter to obtain the image contour. Researchers in [8] used Gabor texture features with K-means clustering methods to provide an automated marker selection for the polyps. Geometric information from the resulting segments was extracted to identify polyp candidates. Moreover, the author of [9] proposed an automated polyp detection by using edge cross-section profiles, and Supported Vector Machine(SVM) was used for the classification.
Besides, other methods have exploited features for polyp detection with the assumption that polyp have rich texture information. For example, the author combined the Local Binary Pattern (LBP) with average intensity to create a two-dimensional histogram feature for classification [14]. The authors [15] represented the image regions by using multi-size patches simultaneously. They trained a SVM classifier with the combination of multiple-size patches which can recognize abnormal regions more effectively than only using single-size patches.
Recently, CNN has brought large attention to the computer vision community and made great progress in various tasks, such as classification [21], segmentation [22] and detection [23]–[25]. Moreover, CNN has also been applied in medical image analysis. For instance, Bejnordi demonstrated that better diagnostic performance can be achieved by deep learning algorithms when compared with pathologists [16]. Algorithm based on deep machine learning has proved high sensitivity and specificity for detecting diabetic retinopathy [17].
The authors of [10] have suggested a two-level approach for automatic polyp detection. In the earlier articles, color and texture features, temporal features, and shape in context were selected to get candidate regions of polyps. Later, CNN was adopted for feature learning and classification. With the advantage of CNN, the authors utilized all the information available in the image including color, texture, shape. The authors in [11] have conducted an analysis for CNN and compared the results of model trained from scratch with the model fine-tuned for different cancer types. They illustrated that the fine-tuning approach outperformed the full training approach from scratch when comparing the ROC curves.
Yu et al. [18] proposed an online and offline three-dimensional deep learning integration framework, which was used to reduce the number of false positives and further improve the discrimination capability of the network. Zhang et al. [12] proposed an automatic detection and classification model by transferring low-level features learned from various nonmedical datasets. The proposed method first identified polyp images and nonpolyp images and predicted the polyp histology in the later. This proposed method showed that transfer learning could achieve great success in medical task. In [13], the authors summarized all the teams’ methods in the MICCAI sub-challenge. CUMED proposed an end-to-end CNN based detection method. The architecture of the proposed network includes a downsampling path and an upsampling path [19]. The highlight of this approach was that it explored multi-level feature representations with fully CNNs in an end-to-end way. OUS proposed a model based on the AlexNet with a pretrained model on the ILSVRC 2012 [20]. UNS-UCLAN used three CNN models trained at different images scales followed by a MLP classifier. SNU proposed a two-step approach: detection and localization. For both steps, GoogleNet pretrained on ImageNet dataset was used. Zhang et al. [26] presented a novel regression-based CNN pipeline for polyp detection during colonoscopy. The proposed pipeline was constructed in two parts. A fast object detection model ResYOLO was used to learn the spatial features of colorectal polyps, and then temporal information was incorporated via a tracker for refining the detection results. Shin et al. [27] applied a region based deep-CNN model for automatic polyp detection. They used different image augmentation during training to overcome the polyp detection obstacles and the small number of polyp images. They further proposed two efficient post-learning methods to improve the detection results. Recently, SSD was introduced into polyp detection field. Ozawa et al. [28] did not alter the algorithm of SSD, but trained the CNN system using much more images, some of which were histologically proven including colorectal cancers (CRCs). They showed the system have future potential for real-time detection of colonoscopy. Zhang et al. [29] re-used the max-pooling layers in VGG network and concatenated the deconvolved lower layers in the feature map. The results showed their method could improve the performance of the SSD method and meet real-time applicaiton requirement. The above examples show that CNN based methods have achieved good results for polyp detection, but there are still some problems needed to be investigated further: (i) Two-stage based detection methods usually achieve better results but slower detection speed. One-stage based methods show opposite performance. It’s crucial to make a trade-off between accuracy and speed. (ii) More feature extractors are needed to be explored. Efficient and effective feature extractors can bring performance boost of the detector. Therefore, we investigated the powerful single shot detector (SSD), which was a fast, accurate, one-stage detection framework, specially easy to integrate with different feature extractors in an end-to-end way.
In this paper, we presented the improvement of SSD method for polyp detection. Moreover, we validated our method in three different datasets CVC-ClinicDB, ETIS-Larib, and CVC-ColonDB in the 2015 MICCAI sub-challenge on automatic polyp detection. Our main contributions include:
We integrated three different state-of-art CNN networks, namely, InceptionV3, Resnet50, and VGG16 as feature extractors into SSD detection framework. SSD requires multiple feature maps to produce sets of detection predictions, and it is initially designed for VGG16. We further designed the multi-scale feature maps for ResNet50 and InceptionV3 respectively.
Compared with conventional and CNN based polyp detection method, our system provided an excellent balance between efficiency and accuracy. With a very high detection speed, our system achieved the best detection numbers of true positives. To be specific, the proposed SSD method is 12.x times faster than Faster RCNN. Also the accuracy of our system is much higher than the same one-stage method YOLOV3 [30].
Finally, from the large amount of experiments using public polyp images, we demonstrated that our detection model showed improved detection performance compared with other recent CNN based studies in colonoscopy image dataset.
II. METHODOLOGY
With the development of deep learning in computer vision field, many excellent models, such as Faster RCNN [24], YOLO [25], and SSD [23], achieved breakthrough performance in object detection with natural images. Different with Faster RCNN, YOLO and SSD don’t require region proposals because they completely eliminate proposal generation and subsequent feature resampling stages and encapsulate all computation in a single network. According to this, SSD and YOLO as one-stage methods are distinguished from Faster RCNN as two-stage method. Usually, two-stage methods always achieve higher accuracy, but one-stage methods perform much better on detection speed. But among all the detecion models, SSD method achieves much better accuracy and excellent speedup. Thus it has a potential application in real-time detection.
A. SINGLE SHOT MULTIBOX DETECTOR
SSD, introduced by liu et al. [23], uses a feed-forward Convolutional Neural Network that produces fixed-size collections of bounding boxes for each object with different aspect ratios and scales in multi-level feature maps. SSD scores the presence of object in those boxes and then a non-maximum suppression (NMS) step produces the still image detections. Unlike the region proposals methods, SSD encapsulates all computation in a single network, which makes SSD easy to train and integrate into systems that require a detection component. Moreover, SSD combines predictions for all default boxes with different scales and aspect ratios from multi-level feature maps. This character makes it natural to handle objects of various sizes and shapes, thus improve the detection performance. From the architecture Fig. 2, we can see that SSD extracts features from multiple layers with different spatial and texture information. With the combination of features, SSD predicts the offsets to default boxes of different scales and aspect ratios and their associated confidences.
FIGURE 2.

Proposed SSD polyp detection framework architecture. Feature extractor(VGG16) is used to generate features from different spatial resolution layers.
B. FEATURE EXTRACTOR
The choice of feature extractor in SSD is crucial as the number of parameters and types of layers directly affect memory, speed, and performance of detector. In this paper, we investigated three conventional yet representative feature extractors, namely ResNet50 [21], VGG16 [31], and InceptionV3 [32]. The VGG16 [31] is originally designed for classification task. It contains five CNN blocks with 3 × 3 convolutional filters and three fully connection (FC) layers followed after the last max-pooling layer. We extracted featues from layers Conv4_3 and FC7, added one extra layer on the basis of original SSD framework. We named this layer as Conv12_2, with channel of 256, stride of 2 and kernel size of 3×3 to the truncated base network.
The residual module [21] uses the shortcut connection to reformulate neural network layers to keep the identity mapping. This enables convolutional neural network to go deeper to get extremely deep representations of image. The inception module [32] projects features into dimensions to learn different representations with a set of filters (3 × 3, 1×1, etc.). It reduces the computational complexity and enables the network to extract features in a wider receptive field. ResNet50 and InceptionV3 building blocks were shown in Fig 1.
FIGURE 1.

Left part is the building block for ResNet and right part is for InceptionV3.
To be integrated into SSD framework, multiple feature maps were needed to be designed. We extracted features from layers of “plus12” and “plus15” from the standard ResNet50 with the definition of MxNet [33] library. We added four extra layers on top of “plus15” with channel of 512, 256, 256, 128 respectively, stride of 2 and kernel size of 3 × 3 for all of the extra layers. For InceptionV3, the selected feature layers were “ch_concat_mixed_2_chconcat”, “ch_concat_mixed_7_ chconcat”, “ch_concat_mixed_10_chconcat” from a MxNet [33] definition. Three extra feature layers were added on top of “ch_concat_mixed_10_chconcat” with channel of 256, 256, 128 respectively, and a stride of 2 and kernel size of 3 × 3 for all of the extra layers. These extra layers decreased in size progressively and allowed predictions of detections at multiple scales.
C. LOSS FUNCTION
The objective loss function of SSD method contained two parts: classification loss Lcls and localization loss Lloc [23], it was defined as follow:
| (1) |
where N was denoted by the number of matched default boxes. Let be an indicator for matching the i-th default box to the j-th ground truth box of category p. The classification loss was the softmax loss over polyp and background, was defined as following:
| (2) |
where, .
Let be the predicted box, g be the groud truth box. The localization loss can treated as a Smooth L1 loss between and g. Let be the center of default bounding box d, w, h be the width and height. The Localization loss can be written as:
| (3) |
where
| (4) |
The smooth function S is definded as bellow:
| (5) |
III. EXPERIMENT
A. DATASETS
We evaluated the SSD model using CVC-ColonDB and CVC-ClinicDB datasets, which contained 912 images obtained from 44 video sequences acquired from 36 patients in MICCAI 2015 challenge. We also evaluated SSD in ETIS-Larib dataset. The details of these datasets were listed below.
CVC-ColonDB [34] contains 300 images with associated polyp masks obtained from 13 polyp video sequences acquired from 13 patients.
CVC-ClinicDB [35] contains 612 images with associated polyps, background (here, mucosa, and lumen), and segmentation masks obtained from 31 polyp video sequences acquired from 23 patients.
ETIS-Larib Polyp DB [36] is a database of frames extracted from colonoscopy videos. These frames contain several examples of polyps. In addition to the frames, ground truth is provided for the polyps. This ground truth consists of a mask corresponding to the region covered by polyp in image.
All the annotations provided by the datasets were pixel-wise mask, whereas SSD require the bounding boxes of ground truth. So in order to feed the data with SSD, we converted the mask to rectangle box and ensure that all the polyps were within the box, as shown in Fig. 3.
FIGURE 3.

Shadow in left column images is the pixel-wise notation, bounding box with red rectangle is the notation used during SSD training.
B. EVALUATION METRICS
The detection accuracy was evaluated as true positive(TP), false positive(FP), false negative(FN), and true negative(TN). To be specific, if a detected region was within polyp area, only one TP would be considered for per polyp; Any detection that falls outside polyp was considered as a false positive(FP). The absence of alarm in images with a polyp was considered as a false negative(FN), counting one per each polyp in the image that had not been detected. Regarding images without polyps, we defined as a true negative(TN) whenever any output did not provided for this particular image. The details of different performance metrics (Precision, Recall, Specificity, F1-measure, F2-measure) were listed in Table 1.
TABLE 1.
Performance metrics for polyp detection.
| Metric | Addreviation | Calculation |
|---|---|---|
| Precision | Prec | |
| Recall | Rec | |
| Specificity | Spec | |
| F1-measure | F1 | |
| F2-measure | F2 |
C. IMPLEMENTATION
Our model was implemented on Mxet library [33] with a single NVIDIA P40 GPU. Feature extractor models pretrained on PascalVOC dataset were applied due to the lack of data. During training process, SGD optimizer with batch size 4, and learning rate of 0.0001 were used. Mean average metric (mAP) [37] was used to evaluate the model. To increase robustness and reduce overfitting problem, following data argumentation were applied:
Random cropping, mirror, color changing.
All the images were resized to 512 × 512, but sampled a patch so that the minimum jaccard overlap with the objects was 0.1, 0.3, 0.5, 0.7, or 0.9.
Randomly sampled a patch.
IV. RESULTS AND DISCUSSION
Detection of colonic polyps is a challenging task since they appears various sizes and shapes. Meanwhile in clinical applicaiton, high demanding on the real-time speed is also necessary. In this paper, we investigated the SSD framework which has relatively high accuracy and fast detection speed. To explore the power of SSD method, we also investigated three state-of-art feature extraction backbones, namely, VGG16, ResNet50, and InceptionV3. Finally, we compared the SSD based method with teams from MICCAI 2015 challenge, YOLOV3 and Faster RCNN on polyp detection with the standard evaluation datasets. http://endovis.grand-challenge.org.
A. ACCURACY AND COMPARISON WITH OTHER METHODS
1). COMPARISON WITH TEAMS IN MICCAI 2015
The performance of proposed SSD models with three different feature extractors was compared with other methods in Table 2. In general, we can observe that the proposed SSD methods significantly improved detection on true positives, and relatively produced less false positives than the others teams attended MICCAI challenge, thus acchieved much better scores of precision, recall, F1, and F2 respectively, as shown in Fig. 4. The methods in [13] can be divided into three groups: end-to-end learning, hybrid approach, and handcraft feature based methods. CUMED, UNS-UCLAN, OUS, and SNU proposed the end-to-end learning. PLS, ETIS-LARIB, and ASU used hybrid approaches. CVC-CLINIC was the only team which used the hand-craft features. Among all the teams CUMED got the best performance over all the metrics. They proposed a CNN based detection method included a downsampling and upsamling path for feature extraction. Compared with it, our SSD methods used multi-level features for boxes generation from different spatial resolution layers, therefore captured the polyp features in a multi-scale way. Especially, we employed three different state-of-art CNN as the feature extractors, which strengthened the power of SSD further on. From Table 2, we also noted that compared with CUMED, SSD generated a little more false positives. There is common sense that low-level feature of CNN contains more geometric information whereas the high-level feature contains more semantic information. The model structure of CUMDE contained a downsampling and upsampling path, which transfered the features from low-level to high-level, and reduced the FP.
TABLE 2.
SSD results compared with MICCIA endoscopic vision challenge.
| TP | FP | FN | Prec | Rec | F1 | F2 | |
|---|---|---|---|---|---|---|---|
| SSD-ResNet50 | 167 | 63 | 41 | 72.6 | 80.3 | 76.3 | 78.6 |
| SSD-InceptionV3 | 167 | 60 | 41 | 73.6 | 80.3 | 76.8 | 78.9 |
| SSD-VGG16 | 158 | 96 | 50 | 62.2 | 75.9 | 68.4 | 72.7 |
| YOLOV3 | 120 | 37 | 88 | 76.4 | 57.7 | 65.8 | 60.7 |
| CUMED | 144 | 55 | 64 | 72.3 | 69.2 | 70.7 | 69.8 |
| CVC-CLINIC | 102 | 920 | 106 | 10.0 | 49.0 | 16.5 | 27.5 |
| ETIS-LARIB | 103 | 1373 | 105 | 6.9 | 49.5 | 12.2 | 22.3 |
| OUS | 131 | 57 | 77 | 69.7 | 63.0 | 66.1 | 64.2 |
| PLS | 119 | 630 | 89 | 15.8 | 57.2 | 24.9 | 37.6 |
| SNU | 20 | 176 | 188 | 10.2 | 9.6 | 9.9 | 9.7 |
| UNS-UCLAN | 110 | 226 | 98 | 32.7 | 52.8 | 40.4 | 47.1 |
| Faster RCNN [27] | 167 | 26 | 41 | 86.5 | 80.3 | 83.3 | 81.5 |
FIGURE 4.

Precision-Recall curves for all the methods. The performance of SSD with three backbones is much better than the teams that attended the MICCAI challenge, but slightly lower than two-stage method. It is obvious that InceptionV3 performed the best among ResNet50 and VGG16.
2). COMPARISON WITH FASTER-RCNN
We also compared the proposed method with the two-stage method reported in [27]. As shown in Table 2, the results of Faster RCNN were slightly better than ours in terms of precision, F1 and F2 scores. The best of their model detected 167 polyps out of a total of 208 polyps in the ETIS-LARIB dataset, which were comparable with our SSD-ResNet50 and SSD-InceptionV3 models. Specifically, our SSD-ResNet50 even achieved a much larger TP, correctly detecting 180 polyps as shown in Table 4. During the training process, they used a post processing step with additional datasets to reduce false positives. They trained the detector initially by CVC dataset. During the post processing, they tested the detector with 5 additional videos to collect false positives with high scores, and then sent these collected false positives from the videos to re-train the detector.
TABLE 4.
SSD-ResNet50 results with different threshold values.
| Threshold | TP | FP | FN | Prec | Rec | F1 | F2 |
|---|---|---|---|---|---|---|---|
| 0.1 | 180 | 168 | 28 | 51.7 | 86.5 | 64.7 | 76.37 |
| 0.2 | 176 | 91 | 32 | 65.9 | 84.6 | 74.1 | 80 |
| 0.25 | 173 | 78 | 35 | 68.9 | 83.2 | 75.4 | 79.9 |
| 0.3 | 167 | 63 | 41 | 72.6 | 80.3 | 76.3 | 78.6 |
| 0.35 | 159 | 59 | 49 | 72.9 | 76.4 | 74.6 | 75.7 |
| 0.4 | 156 | 54 | 52 | 74.3 | 75 | 74.6 | 74.9 |
| 0.5 | 148 | 45 | 60 | 76.7 | 71.2 | 73.8 | 72.2 |
| 0.6 | 139 | 36 | 69 | 79.4 | 66.8 | 72.6 | 69 |
| 0.7 | 129 | 28 | 79 | 82.2 | 62 | 70.7 | 65.2 |
| 0.8 | 122 | 22 | 86 | 84.7 | 58.7 | 69.3 | 62.5 |
| 0.9 | 100 | 8 | 108 | 92.6 | 48.1 | 63.3 | 53.2 |
3). COMPARISON WITH YOLOV3
We also compared our SSD system with the other one-stage detection method, YOLOV3. From the results shown in Table 2, we could observe that even YOLOV3 generated less FPs than SSD, but it also produced less TPs than SSD. Training mAP was shown in Fig. 5, it also indicated that YOLOV3 performed more poorly than SSD methods. The highest mAP of YOLOV3 was still lower than the others. The main reason is that YOLOV3 generates candidate bounding boxes from one final feature map, while SSD generates from multiple different feature layers with different spatial resolution. This way of generating bounding boxes fullfils the fact that polyps varies in shape and size naturally, thus improves the detection performance.
FIGURE 5.

Training mAP for SSD-ResNet50, SSD-Inceptionv3, SSD-VGG16 and YOLOV3. The InceptionV3 model shows a better and stable learning curve among the three feature extractors, while YOLOV3 shows poor performance compared with the SSD kind of methods.
B. DETECTION SPEED
We computed an averaged detection time for all test images. The detection speed for other methods listed in Table 3 were based on [13] and [27]. It suggested that SSD methods with different backbones had the maximum frame per seconds (FPS) values over the state-of-art methods in spite of the very deep feature extractor networks (30 FPS, 32 FPS, 33 FPS for SSD-VGG16, SSD-InceptionV3, and SSD-ResNet50, respectively), and surpassed the minimum requirement for clinical application (25 FPS [13]). Compared with the two-stage method, our SSD based method was 12.x times faster than Faster RCNN under the condition of the same true positive detections.
TABLE 3.
Detection speed(fps) comparison with different method.
| Method | Speed(Frame Per Seconds) |
|---|---|
| CUMED | 5 |
| CVC-CLINIC | 0.1 |
| ETIS-LARIB | 0.46 |
| OUS | 0.2 |
| PLS | 6.9 |
| SNU | 1.25 |
| UNS-UCLAN | 0.2 |
| SSD-ResNet50 | 33 |
| SSD-VGG16 | 30 |
| SSD-InceptionV3 | 32 |
| Faster RCNN [27] | 2.5 |
| YOLOV3 | 46 |
Detection speed plays an important role in clinical application. Our results showed SSD based methods with fast speed has the potential in clinical application. One of the major reasons is that SSD eliminates the region proposal step. Two-stage detection method, for instance Faster R-CNN, first generates regions of interest by blob detection and then performed box regression and classification on those regions. Different from the two-stage method, SSD as one-stage method detects objects in an uniform framework without region proposing. The reason we didn’t choose another one-stage method, YOLO, is that SSD performed significantly better on accuracy, by using default boxes generated from multiple feature maps and training strategy [23].
C. INFLUENCE OF DIFFERENT FEATURE EXTRACTORS
Detailed results of different feature extractors(backbones) applied in SSD method were listed in Table 4, 5, 6. Threshold in the tables demonstrated confidence level of the detected region with polyps. Fig. 5 showed the training evaluation metrics for the three feature extractors respectively. Following conclusions can be drawn:
The lower threshold value the higher TP and FP detected. TP and FP decreased along the increased threshold.
Under the same confidence level, ResNet50 produced the maximum TP while VGG16 was the worst. But for FP, InceptionV3 achieved the best among them.
Overall, InceptionV3 reached the best performance considering the trade-off between true and false positives. And it produced relatively more TP and the lowest FP.
InceptionV3 and ResNet50 converged quite quickly, both reached their maximum mAP within 50 epochs, while VGG16 was a little slower and got relatively lower mAP.
TABLE 5.
SSD-InceptionV3 results with different threshold values.
| Threshold | TP | FP | FN | Prec | Rec | F1 | F2 |
|---|---|---|---|---|---|---|---|
| 0.1 | 167 | 60 | 41 | 73.6 | 80.3 | 76.8 | 78.9 |
| 0.15 | 164 | 48 | 44 | 77.4 | 78.9 | 78.1 | 78.5 |
| 0.2 | 161 | 45 | 47 | 78.2 | 77.4 | 77.8 | 77.6 |
| 0.25 | 159 | 36 | 49 | 81.5 | 76.4 | 78.9 | 77.4 |
| 0.3 | 154 | 34 | 54 | 81.9 | 74 | 77.8 | 75.5 |
| 0.4 | 144 | 26 | 64 | 84.7 | 69.2 | 76.2 | 71.9 |
| 0.5 | 138 | 23 | 70 | 85.7 | 66.4 | 74.8 | 69.5 |
| 0.6 | 134 | 21 | 74 | 86.5 | 64.4 | 73.8 | 67.9 |
| 0.7 | 125 | 18 | 83 | 87.4 | 60.1 | 71.2 | 64.1 |
| 0.8 | 107 | 13 | 101 | 89.2 | 51.4 | 65.2 | 56.2 |
| 0.9 | 96 | 9 | 112 | 91.4 | 46.2 | 61.3 | 51.2 |
TABLE 6.
SSD-VGG16 results with different threshold values.
| Threshold | TP | FP | FN | Prec | Rec | F1 | F2 |
|---|---|---|---|---|---|---|---|
| 0.1 | 164 | 210 | 44 | 43.9 | 78.9 | 56.4 | 67.9 |
| 0.2 | 158 | 96 | 50 | 62.2 | 75.9 | 68.4 | 72.7 |
| 0.3 | 142 | 53 | 66 | 72.8 | 68.3 | 70.5 | 69.1 |
| 0.4 | 132 | 32 | 76 | 80.5 | 63.5 | 55.9 | 66.3 |
| 0.5 | 119 | 24 | 89 | 83.2 | 57.2 | 67.8 | 61 |
| 0.6 | 113 | 15 | 95 | 88.3 | 54.3 | 67.3 | 58.9 |
| 0.7 | 96 | 9 | 112 | 91.4 | 46.2 | 61.3 | 51.2 |
| 0.8 | 84 | 6 | 124 | 93.3 | 40.4 | 56.4 | 45.6 |
| 0.9 | 60 | 2 | 148 | 96.8 | 28.9 | 44.4 | 33.6 |
Polyps size and shape vary across one to another, which make it very difficult to be detected. To overcome this, SSD used a multi-level feature maps from different spatial resolution layers. Inception module extracted input image features using a different combination of the filters or receptive field. The structure of InceptionV3 enables the network to get features covering different sizes and shapes of polyps by using different convolutional kernel sizes parallelly and reduced the huge variation in size and shape, thus produced less FP. For ResNet50, the shortcut connection for reformating the convolutional layers enables the network to keep more polyp features when goes deeper. This structure helps SSD a lot to detect area that contains polyp. Compared with InceptionV3, ResNet50 and VGG16 performed worse. This indicated that different filter size and shortcut connection are very important for extracting features with polyps. Some detection samples by different feature extractors were shown in Fig. 6.
FIGURE 6.

Some results of our SSD polyp detector system, the number above box indicates the probility of area to be a polyp. The first line is the results for InceptionV3, the second line for ResNet50 and the last line for VGG16.
This is the first work of investigating SSD for polyp detection by exploring three state-of-art feature extraction backbone structures. It should be noted that further improvement can be made by exploring different way of feature fusion as we observed above. For instance, low-level feature contains more geometric information while the high-level feature contains more semantic information. Combination of them in an efficient way could improve the detection accuracy.
D. LIMITATIONS
This study has several limitations, most of them related to the complicated environment of lumen. Detectors were specifically trained to discriminate between normal mucosa and colonic polyps in the background of lumen environment. But it is still difficult for them to identify intestinal content, plica and hemorrhagic foci, as shown in Fig. 7. In a real clinical setting, these could be solved with endoscopic flushing. In addition, they might miss small, flat and distant polyps. It would be benefit to collect a large amount of data for training and improving the detection accuracy. The test dataset was enriched with cases containing more colonic polyps, and, thus, is not directly comparable with the cases endoscopists encounter in real clinical practice. Given the reality that most lumen do not contain polyps, a well-rounded representation of data set is needed. Further prospective study is required to validate the performance of algorithm.
FIGURE 7.

Failure detection samples. The three images of first row show FP and the three of second row show FN.
V. CONCLUSION
In this paper, we investigated a single shot detector(SSD) framework based method with three different feature extractors for polyp detection in colonoscopy videos/images when limited training data is available. A novel SSD architecture is proposed, which utilizes a feed-forward Convolutional Neural Network to produce a collection of fixed-size bounding boxes for each object from different feature maps. Extensive experiment conducted on EITS-LARIB dataset showed that SSD models achieved the best detection results of true positives and comparable F1 and F2 score. Especially its detection speed outperformed all the existing methods dramatically. We introduced three state-of-art networks as the feature extractors integrated into SSD framework, and designed the multi-scale feature maps used in SSD system for ResNet50 and InceptionV3. Among them, InceptionV3 showed excellent performance balance between true positive and false positive detection. Our results indicated the feasibility of SSD based method, especially with InceptionV3 as a feature extractor, to provide complementary information for endoscopist. The development of an automatic polyp detection platform would has a positive impact on future CRC management.
ACKNOWLEDGMENT
We would like to thank The Computer Vision Centre (CVC) for providing CVC-ColonDB and CVC-ClinicDB dataset, Lariboisière Hospital-APHP and ETIS laboratory, ENSEA, University of Cergy-Pontoise, for providing ETIS-Larib dataset. (Ming Liu and Jue Jiang are co-first authors.)
This work was supported by the Digestive Medical Coordinated Development Center of Beijing Hospitals Authority, No. XXT12.
Biographies

MING LIU was born in Heyang, Shanxi, China. He received the B.S. degree in computational mathematics from Beijing Jiaotong University, in 2008, and the M.S. and Ph.D degrees from Chinese Academy of Sciences, in 2010 and 2013, respectively, both in computational mathematics. After graduation, he joined the Statoil ASA as an Engineer. In 2016, he joined the JD Finance and then served as an Algorithm Engineer, where he developed fully convolutional neural network for crowd counting. His research interests include computer vision and medical images analysis with deep learning method.

JUE JIANG was born in Heyang, Shanxi, China. He received the B.S. and M.S. degrees in electrical engineering from Xi’an Jiaotong University, in 2008 and 2011, respectively. He obtained the Ph.D. degree from the University of Tokyo, Japan, in 2016. He is currently working as a Research Scholar at Memorial Sloan Kettering Cancer Center (MSKCC), New York, USA. His research areas include machine learning, deep learning, 3D reconstruction, and image processing.

ZENAN WANG was born in Chaozhou, Guangdong, China. She received the B.S., M.S., and Ph.D degrees in clinical medicine from Southern Medical University, in 2011, 2012, and 2014, respectively. She is currently working as an Endoscopist with the Department of Gastroenterology, Beijing Chaoyang Hospital, the Third Clinical Medical College of Capital Medical University, China. Her research interests include early gastrointestinal cancer detection and endoscopy image analysis.
Footnotes
The associate editor coordinating the review of this manuscript and approving it for publication was Vicente Alarcon-Aquino.
REFERENCES
- [1].Hassan C, Quintero E, Dumonceau J-M, Regula J, Brandão C, Chaussade S, Dekker E, Dinis-Ribeiro M, Ferlitsch M, Gimeno-García A, Hazewinkel Y, Jover R, Kalager M, Loberg M, Pox C, Rembacken B, and Lieberman D, “Post-polypectomy colonoscopy surveillance: European society of gastrointestinal endoscopy (ESGE) guideline,” Endoscopy, vol. 45, no. 10, pp. 842–851, 2013. [DOI] [PubMed] [Google Scholar]
- [2].Løberg M, Kalager M, Holme O, Hoff G, Adami H-O, and Bretthauer M, “Long-term colorectal-cancer mortality after adenoma removal,” England J. Med, vol. 371, no. 9, pp. 799–807, 2014. [DOI] [PubMed] [Google Scholar]
- [3].Leufkens AM, Van Oijen MGH, Vleggaar FP, and Siersema PD, “Factors influencing the miss rate of polyps in a back-to-back colonoscopy study,” Endoscopy, vol. 44, no. 5, pp. 470–475, 2012. [DOI] [PubMed] [Google Scholar]
- [4].Van Rijn JC, Reitsma JB, Stoker J, Bossuyt PM, Van Deventer SJ, and Dekker E, “Polyp miss rate determined by tandem colonoscopy: A systematic review,” Amer. J. Gastroenterol, vol. 101, no. 2, p. 343, February 2006. [DOI] [PubMed] [Google Scholar]
- [5].Krishnan SM, Yang X, Chan KL, Kumar S, and Goh PMY, “Intestinal abnormality detection from endoscopic images,” in Proc. 20th Annu. Int. Conf. IEEE Eng. Med. Biol. Soc, vol. 2, October 1998, pp. 895–898. [Google Scholar]
- [6].Van Wijk C, Van Ravesteijn VF, Vos FM, and Van Vliet LJ, “Detection and segmentation of colonic polyps on implicit isosurfaces by second principal curvature flow,” IEEE Trans. Med. Imag, vol. 29, no. 3, pp. 688–698, March 2010. [DOI] [PubMed] [Google Scholar]
- [7].Dhandra BV, Hegadi R, Hangarge M, and Malemath VS, “Analysis of abnormality in endoscopic images using combined hsi color space and watershed segmentation,” in Proc. 18th Int. Conf. Pattern Recog- nit. (ICPR), vol. 4, August 2006, pp. 695–698. [Google Scholar]
- [8].Hwang S and Celebi ME, “Polyp detection in wireless capsule endoscopy videos based on image segmentation and geometric feature,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Process. (ICASSP), March 2010, pp. 678–681. [Google Scholar]
- [9].Wang Y, Tavanapong W, Wong J, Oh J, and de Groen PC, “Part-based multiderivative edge cross-sectional profiles for polyp detection in colonoscopy,” IEEE J. Biomed. Health Informat, vol. 18, no. 4, pp. 1379–1389, July 2014. [DOI] [PubMed] [Google Scholar]
- [10].Tajbakhsh N, Gurudu SR, and Liang J, “Automatic polyp detection in colonoscopy videos using an ensemble of convolutional neural networks,” in Proc. IEEE 12th Int. Symp. Biomed. Imag. (ISBI), April 2015, pp. 79–83. [Google Scholar]
- [11].Tajbakhsh N, Shin JY, Gurudu SR, Hurst RT, Kendall CB, Gotway MB, and Liang J, “Convolutional neural networks for medical image analysis: Full training or fine tuning?” IEEE Trans. Med. Imag, vol. 35, no. 5, pp. 1299–1312, May 2016. [DOI] [PubMed] [Google Scholar]
- [12].Zhang R, Zheng Y, Mak TWC, Yu R, Wong SH, Lau JYW, and Poon CCY, “Automatic detection and classification of colorectal polyps by transferring low-level CNN features from nonmedical domain,” IEEE J. Biomed. Health Inform, vol. 21, no. 1, pp. 41–47, January 2017. [DOI] [PubMed] [Google Scholar]
- [13].Bernal J et al. , “comparative validation of polyp detection methods in video colonoscopy: Results from the miccai 2015 endoscopic vision challenge,” IEEE Trans. Med. Imag, vol. 36, no. 6, pp. 1231–1249, June 2017. [DOI] [PubMed] [Google Scholar]
- [14].Wang P, Krishnan SM, Kugean C, and Tjoa MP, “Classification of endoscopic images based on texture and neural network,” in Proc. Conf. 23rd Annu. Int. Conf. IEEE Eng. Med. Biol. Soc, vol. 4, October 2001, pp. 3691–3695. [Google Scholar]
- [15].Li P, Chan KL, and Krishnan SM, “Learning a multi-size patch-based hybrid kernel machine ensemble for abnormal region detection in colonoscopic images,” in Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit. (CVPR), vol. 2, June 2005, pp. 670–675. [Google Scholar]
- [16].Bejnordi BE et al. , “Diagnostic assessment of deep learning algorithms for detection of lymph node metastases in women with breast cancer,” JAMA, vol. 318, no. 22, pp. 2199–2210, December 2017. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [17].Gulshan V, Peng L, Coram M, Stumpe M, Wu D, Narayanaswamy A, Venugopalan S, Widner K, Madams T, Cuadros J, Kim R, Raman R, Nelson P, Mega J, and Webster D, “Development and validation of a deep learning algorithm for detection of diabetic retinopathy in retinal fundus photographs,” J. Amer. Med. Assoc, vol. 316, no. 22, pp. 2402–2410, 2016. [DOI] [PubMed] [Google Scholar]
- [18].Yu L, Chen H, Dou Q, Qin J, and Heng PA, “Integrating online and offline three-dimensional deep learning for automated polyp detection in colonoscopy videos,” IEEE J. Biomed. health Informat, vol. 21, no. 1, pp. 65–75, January 2017. [DOI] [PubMed] [Google Scholar]
- [19].Chen H, Qi XJ, Cheng JZ, and Heng PA, “Deep contextual networks for neuronal structure segmentation,” in Proc. AAAI, 2016, pp. 1167–1173. [Google Scholar]
- [20].Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, Huang Z, Karpathy A, Khosla A, Bernstein M, Berg AC, and Fei-Fei L, “ImageNet large scale visual recognition challenge,” Int. J. Comput. Vis, vol. 115, no. 3, pp. 211–252, December 2015. [Google Scholar]
- [21].He K, Zhang X, Ren S, and Sun J, “Deep residual learning for image recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit, June 2016, pp. 770–778. [Google Scholar]
- [22].Shelhamer E, Long J, and Darrell T, “Fully convolutional networks for semantic segmentation,” IEEE Trans. Pattern Anal. Mach. Intell, vol. 39, no. 4, pp. 640–651, April 2017. [DOI] [PubMed] [Google Scholar]
- [23].Liu W, Anguelov D, Erhan D, Szegedy C, Reed S, Fu C-Y, and Berg AC, “SSD: Single shot multibox detector,” in Proc. 14th Eur. Conf. Comput. Vis. (ECCV) Amsterdam, The Netherlands: Springer, October 2016, pp. 21–37. [Google Scholar]
- [24].Ren S, He K, Girshick R, and Sun J, “Faster R-CNN: Towards real-time object detection with region proposal networks,” in Proc. Adv. Neural Inf. Process. Syst, 2015, pp. 91–99. [DOI] [PubMed] [Google Scholar]
- [25].Redmon J, Divvala S, Girshick R, and Farhadi A, “You only look once: Unified, real-time object detection,” June 2016, pp. 779–788. [Google Scholar]
- [26].Zhang R, Zheng Y, Poon CCY, Shen D, and Lau JYW, “Polyp detection during colonoscopy using a regression-based convolutional neural network with a tracker,” Pattern Recognit, vol. 83, pp. 209–219, November 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [27].Shin Y, Qadir HA, Aabakken L, Bergsland J, and Balasingham I, “Automatic colon polyp detection using region based deep cnn and post learning approaches,” IEEE Access, vol. 6, pp. 40950–40962, 2018. [Google Scholar]
- [28].Ozawa T, Ishihara S, Fujishiro M, Aoyama K, and Tada T, “Sa1971 automated endoscopic detection and classification of colorectal polyps using convolutional neural networks,” Gastrointestinal Endoscopy, vol. 87, no. 6, p. AB271, 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [29].Zhang X, Chen F, Yu T, An J, Huang Z, Liu J, Hu W, Wang L, Duan H, and Si J, “Real-time gastric polyp detection using convolutional neural networks,” PLoS One, vol. 14, no. 3, 2019, Art. no. e0214133. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [30].Redmon J and Farhadi A, “YOLOv3: An incremental improvement,” April 2018, arXiv:1804.02767 [Online]. Available: https://arxiv.org/abs/1804.02767 [Google Scholar]
- [31].Simonyan K and Zisserman A, “Very deep convolutional networks for large-scale image recognition,” September 2014, arXiv:1409.1556 [Online]. Available: https://arxiv.org/abs/1409.1556 [Google Scholar]
- [32].Szegedy C, Vanhoucke V, Ioffe S, Shlens J, and Wojna Z, “Rethinking the inception architecture for computer vision,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2016, pp. 2818–2826. [Google Scholar]
- [33].Chen T, Li M, Li Y, Lin M, Wang N, Wang M, Xiao T, Xu B, Zhang C, and Zhang Z, “MXNet: A flexible and efficient machine learning library for heterogeneous distributed systems,” December 2015, arXiv:1512.01274 [Online]. Available: https://arxiv.org/abs/1512.01274 [Google Scholar]
- [34].Bernal J, Sánchez J, and Vilariño F, “Towards automatic polyp detection with a polyp appearance model,” Pattern Recognit, vol. 45, no. 9, pp. 3166–3182, 2012. [Google Scholar]
- [35].Bernal J and Sánchez FJ, Fernández-Esparrach G, Gil D, Rodríguez C, and Vilariño F, “WM-DOVA maps for accurate polyp highlighting in colonoscopy: Validation vs. Saliency maps from physicians,” Computerized Med. Imag. Graph, vol. 43, pp. 99–111, July 2015. [DOI] [PubMed] [Google Scholar]
- [36].Silva J, Histace A, Romain O, Dray X, and Granado B, “Toward embedded detection of polyps in WCE images for early diagnosis of colorectal cancer,” Int. J. Comput. Assist. Radiol. Surgery, vol. 9, no. 2, pp. 283–293, 2014. [DOI] [PubMed] [Google Scholar]
- [37].Everingham M, Eslami SMA, Van Gool L, Williams CKI, Winn J, and Zisserman A, “The Pascal visual object classes challenge: A retrospective,” Int. J. Comput. Vis, vol. 111, no. 1, pp. 98–136, January 2015. [Google Scholar]
