Application of deep learning for semantic segmentation in robotic prostatectomy: Comparison of convolutional neural networks and visual transformers

Sahyun Pak; Sung Gon Park; Jeonghyun Park; Hong Rock Choi; Jun Ho Lee; Wonchul Lee; Sung Tae Cho; Young Goo Lee; Hanjong Ahn

doi:10.4111/icu.20240159

. 2024 Aug 30;65(6):551–558. doi: 10.4111/icu.20240159

Application of deep learning for semantic segmentation in robotic prostatectomy: Comparison of convolutional neural networks and visual transformers

Sahyun Pak ^1,^✉, Sung Gon Park ¹, Jeonghyun Park ², Hong Rock Choi ², Jun Ho Lee ², Wonchul Lee ³, Sung Tae Cho ¹, Young Goo Lee ¹, Hanjong Ahn ⁴

PMCID: PMC11543645 PMID: 39505514

Abstract

Purpose

Semantic segmentation is a fundamental part of the surgical application of deep learning. Traditionally, segmentation in vision tasks has been performed using convolutional neural networks (CNNs), but the transformer architecture has recently been introduced and widely investigated. We aimed to investigate the performance of deep learning models in segmentation in robot-assisted radical prostatectomy (RARP) and identify which of the architectures is superior for segmentation in robotic surgery.

Materials and Methods

Intraoperative images during RARP were obtained. The dataset was randomly split into training and validation data. Segmentation of the surgical instruments, bladder, prostate, vas and seminal vesicle was performed using three CNN models (DeepLabv3, MANet, and U-Net++) and three transformers (SegFormer, BEiT, and DPT), and their performances were analyzed.

Results

The overall segmentation performance during RARP varied across different model architectures. For the CNN models, DeepLabV3 achieved a mean Dice score of 0.938, MANet scored 0.944, and U-Net++ reached 0.930. For the transformer architectures, SegFormer attained a mean Dice score of 0.919, BEiT scored 0.916, and DPT achieved 0.940. The performance of CNN models was superior to that of transformer models in segmenting the prostate, vas, and seminal vesicle.

Conclusions

Deep learning models provided accurate segmentation of the surgical instruments and anatomical structures observed during RARP. Both CNN and transformer models showed reliable predictions in the segmentation task; however, CNN models may be more suitable than transformer models for organ segmentation and may be more applicable in unusual cases. Further research with large datasets is needed.

Keywords: Artificial intelligence, Computer vision systems, Deep learning, Prostatectomy

Graphical Abstract

INTRODUCTION

In the past decade, artificial intelligence (AI) and deep learning have provided major advances in computer vision and natural language processing [1]. In particular, computer vision using deep neural networks has emerged as a very actively researched part of the machine learning field [1,2,3]. Semantic segmentation refers to the ability to segment an unknown image into different parts and objects, and is one of the most complex tasks in computer vision, alongside classification or object recognition [2]. As deep learning algorithms have developed, numerous neural networks for semantic segmentation have emerged, and as such, a new “state-of-the-art” model is continuously being introduced.

Since the introduction of deep neural networks, convolutional neural networks (CNNs) have dominated the computer vision field. However, an important current issue in computer vision is the emergence of transformers. After the success of transformer structures in natural language processing, the application of transformers to computer vision has begun. A landmark paper in 2020 showed that the transformer structure can be used successfully for image recognition [4], with the vision transformer showing that a pure transformer can perform very well in classification tasks when applied directly to a sequence of image patches. Following this initial study, several further models demonstrated that transformers can be successfully trained on typical-scale image datasets, such as ImageNet-1K [5], and that transformers can also be effective in medical image segmentation tasks [6].

Image segmentation has been extensively studied in the medical domain, especially in applications for AI-aided diagnosis in radiology [7]. In addition, semantic segmentation has also been explored in vision system-based surgical procedures. However, the utility of semantic segmentation in robotic surgery remains largely unknown.

Prostate cancer is one of the most commonly diagnosed cancers worldwide [8]. Radical prostatectomy is considered a key treatment in the management of non-metastatic prostate cancer [9,10], and the robot-assisted laparoscopic approach has recently become the main approach [11].

This study investigated the performance of semantic segmentation in robotic prostate surgery and compared various CNN and transformer architectures for segmentation during robotic surgery.

MATERIALS AND METHODS

1. Study population

The records of 150 men with prostate cancer who underwent robot-assisted radical prostatectomy (RARP) using a Da Vinci robotic platform at Hallym University Medical Center between January 2022 and September 2023 were reviewed. Patient data, including demographic characteristics, clinicopathologic features, and intraoperative videos, were analyzed retrospectively. The study protocol was approved by the Institutional Review Board (IRB) of Hallym University Kangnam Sacred Heart Hospital (approval number: 2022-11-017). The written informed consent was waived by the IRB due to the retrospective nature of the study.

2. Dataset

RARP was performed with an anterior approach, and intraoperative videos were recorded. The videos were converted into MPEG-4 video format with a 720-pixel resolution and a 30 frames per second rate. For each surgical case, 20 to 30 images were extracted. All images extracted from each video were manually annotated at a per-pixel level by two expert urologists (Fig. 1). Annotation was performed for each of the classes of surgical instruments, the bladder, prostate, and vas deferens/seminal vesicle. Interrater agreement between the two urologists was excellent, with a Dice coefficient of 0.85. The dataset was randomly split into training and validation data in a 4:1 ratio. The dataset, organized into separate folders for each patient, was divided using a per-patient approach. This patient-level stratification ensured that images from the same patient were not split between training and validation sets. The data was then randomly allocated to training and validation sets at a ratio of 4:1, while maintaining the integrity of patient-specific information and preventing potential data leakage between the sets.

3. Preprocessing

Before proceeding with training, preprocessing steps were applied to the dataset to adjust the computational workload. This preprocessing consisted of three stages: normalization, image resizing, and label conversion. Normalization was performed to center the distribution of the pixel values of the input image. Subsequently, the input images of varying sizes were resized to a consistent size of 640×640 pixels using linear interpolation to minimize pixel value loss. Finally, label conversion was performed. Since the label data were provided in RGB (red, green, and blue) format, they underwent a mapping and conversion process to produce class indices. Each RGB value was mapped to pre-defined class indices, converting it to a format suitable for training. In this study, to address a potential overfitting problem due to the limited dataset configuration, we introduced dynamic transformation techniques instead of traditional data augmentation [12]. This method consisted of three stages and was implemented with the intent of more accurately detecting complex-shaped non-uniform objects. The methods used for this dynamic transformation are described in the Supplementary Material and Supplementary Table 1.

4. Network architecture and training

In this research, we evaluated three representative CNN models: DeepLabv3 [13], MANet [14], and U-Net++ [15]; and three transformer models: SegFormer [16], BEiT [17], and DPT [18].

The Adam optimization algorithm was used in the training, with a learning rate of 0.0001. The entire training was carried out over 40 epochs. Although an early stop feature was implemented to halt training if there was no improvement in validation loss over a certain period, this feature was not activated in this research.

Training and validation losses were obtained at every epoch, and the progress of the training was monitored through visualization. The model’s weights were recorded every time if there was an improvement in the validation loss, and intermediate model weights were also recorded separately every 10 epochs.

5. Evaluation metrics and statistical analysis

The Dice coefficient and intersection over union (IoU) were calculated for each image to evaluate the performance of the CNN and transformer models used for the segmentation task. A confusion matrix method was used to measure the performance of the segmentation models. The following formulas were used for these metrics:

Dice coefficient=(2×TP)/[(TP+FN)+(TP+FP)];
IoU=TP/(TP+FN+FP);
Accuracy=(TP+TN)/(TP+FN+FP+TN);

where TP=true-positive, FN=false-negative, FP=false-positive, and TN=true-negative.

To compare the performance of deep learning segmentation models, we utilized bootstrap confidence intervals for key performance metrics. Bootstrap sampling is a non-parametric method that allows estimation of the distribution of a statistic by resampling with replacement from the observed data. Bootstrap sampling was performed for each evaluation metric, and 200 bootstrap samples were generated from the observed data by sampling with replacement. The performance metric was calculated for each bootstrap sample. The 95% confidence intervals were calculated by determining the 2.5th and 97.5th percentiles of the bootstrap sample means, and t-tests were performed to calculate p-values. Statistical analysis was performed using Python 3.9 and SciPy 1.1. A p-value <0.05 was considered statistically significant.

RESULTS

In the 150 patients with prostate cancer who underwent RARP, the mean age was 66.2 years, the mean prostate-specific antigen level was 9.0 ng/mL, and the mean prostate volume was 32.6 mL. Fifty patients (33.3%) had locally-advanced prostate cancer (pathologic T stage 3 or higher) and 25 patients (16.7%) had high-grade (Gleason score 8 or higher) disease. The clinicopathological characteristics of the enrolled patients are summarized in Table 1.

Table 1. Baseline demographic and clinical characteristics of the study patients (n=150).

		Value
Age (y)		66.2
Body mass index (kg/m²)		25.1
Prostate-specific antigen (ng/mL)		9.0
Prostate volume (mL)		32.6
Extracapsular extension		50 (33.3)
Seminal vesicle invasion		17 (11.3)
Pathologic Gleason score
	6	21 (14.0)
	3+4	63 (42.0)
	4+3	41 (27.3)
	8	7 (4.7)
	9–10	18 (12.0)

Open in a new tab

Values are presented as mean only or number (%).

Our dataset consisted of 3,000 ground truth images from videos. After preprocessing, all images were split into training and validation sets in a ratio of 4:1. A total of 2,400 images were used for training and 600 for validation test data.

Fig. 2 illustrates the training and validation loss curves for the Dice coefficients of all six models examined in this study, while Table 2 provides a comprehensive overview of the Dice scores. The mean overall Dice coefficients for the CNN-based architectures—DeepLabV3, MANet, and U-Net++—were 0.938, 0.944, and 0.930, respectively. For the transformer-based models—SegFormer, BEiT, and DPT—the mean overall Dice coefficients were 0.919, 0.916, and 0.940, respectively. In terms of mean IoU, the CNN-based models DeepLabv3, MANet, and U-Net++ achieved scores of 0.856, 0.876, and 0.835, respectively. The transformer-based architectures SegFormer, BEiT, and DPT demonstrated mean IoU scores of 0.761, 0.690, and 0.750, respectively.

Table 2. Comparative results of the different methods on dataset.

		Overall	Surgical instruments	Bladder	Prostate	Vas	Seminal vesicle
CNN models
	DeepLabV3	0.938	0.970	0.960	0.910	0.857	0.912
	MANet	0.944	0.970	0.963	0.928	0.901	0.915
	U-Net++	0.930	0.968	0.957	0.878	0.850	0.868
Transformer-based models
	SegFormer	0.919	0.957	0.943	0.825	0.689	0.691
	BEiT	0.916	0.956	0.941	0.753	0.547	0.456
	DPT	0.940	0.972	0.957	0.842	0.644	0.459

Open in a new tab

CNN, convolutional neural network.

The mean Dice scores used for the identification of classes of surgical instruments, bladder, prostate, vas deferens, and seminal vesicle are shown in Table 2. These results showed that both the CNN and transformer models were accurate in making predictions in the segmentation task. However, when the results were analyzed by organ, the CNN models were superior to the transformer models in segmenting the prostate, vas deferens, and seminal vesicle.

Based on the bootstrap 95% confidence intervals, there were significant differences in the performance metrics between CNN and transformer models (Supplementary Material, Supplementary Table 1). The average performance metrics obtained from bootstrap sampling appeared to be better than the metrics obtained from the original validation set. Notably, CNN models demonstrated non-overlapping and higher confidence intervals than those of transformer-based models, and they particularly excelled in IoU comparisons and organ segmentation performance. To assess the overall difference between the two architectural approaches, t-tests were conducted for both overall IoU and Dice scores, encompassing all three models from each category simultaneously. This analysis yielded p-values of less than 0.001 for both Dice scores and IoU, indicating statistically significant differences in overall performance between the two sets of architectural models. These results further support the superior performance of CNN-based models in this specific application, demonstrating their advantage in terms of both similarity (Dice) and overlap (IoU) measures.

DISCUSSION

In the present study, we explored the application of segmentation models in robot-assisted laparoscopic surgery. In robotic surgery, semantic segmentation could possibly play important roles in assessment of surgical technical skill, detection of anatomy, and navigation [19]. Robot-assisted prostatectomy is the main treatment for non-metastatic prostate cancer, but it has a steep learning curve [20] and the organs involved in the surgery, including the bladder, prostate, Denonvilliers’ fascia, vas deferens, and seminal vesicle, are often difficult to distinguish intraoperatively because of complex anatomy and similar tissue textures [21,22].

In deep learning for computer vision, semantic segmentation is an important area of research [23]. As deep learning algorithms have developed, numerous neural networks for semantic segmentation have been introduced, and as such, the “state-of-the-art” model is continuously being replaced.

In the past decade, most computer vision research focused on improving the design of CNNs, and U-Net was the most representative model in the medical domain. U-Net was proposed on the basis of fully convolutional networks [24] and has seen considerable use in medical imaging analysis. U-Net consists of an encoder-decoder structure: the encoder continuously merges layers to extract features and gradually reduce the spatial dimension, while the decoder gradually re-establishes the spatial dimension and the target detail according to the extracted features [24]. Differing from traditional CNNs, U-Net adopts a skip connection strategy, which directly utilizes shallow features, to use of the features of the downsampling part of the encoder to be used for upsampling. To achieve a more refined reduction, this strategy is applied to shallow feature information at all scales. Furthermore, many studies exploring the potential of U-Net models in the medical domain have made modifications to the U-Net architecture, including structural changes [6].

However, although its convolutional layers can capture image details well, CNN-based model has relative shortcomings when it comes to accessing global and long-range semantic information [6]. Transformer is one of the most widely used models in the natural language processing field [25], and Dosovitskiy et al. [4] proposed a pioneering model that applied a transformer model directly to sequences of image patches, demonstrating that it can be successfully applied to image classification. With the development of vision transformer [4], computer vision researchers actively began to apply transformers to medical image segmentation. In the early days, TransUNet with a transformer showed good segmentation performance on abdominal organs and heart [26]. Following this, numerous transformer-based models have been introduced for medical image segmentation of body organs, including abdominal organs, heart, and lung [6]. However, little is known about whether CNN or transformer models are the most appropriate for surgical image segmentation.

In this study, we demonstrate the potential effectiveness of CNN- and transformer-based segmentation models in robotic surgery procedures. The performance of the models was similar between the two architectures, with overall Dice scores of over 90%. When analyzing the accuracy by class, the segmentation of the surgical instruments and bladder had accuracy of approximately 98% in both CNN- and transformer-based models. However, the CNN model had an accuracy of 90%–95% for the prostate, vas, and seminal vesicle, while the transformer-based models had an accuracy of 50%–90%.

Considering that segmentation in the surgical domain requires much more sophistication and precision than in other domains, our results suggest that further improvement in segmentation of specific organs is needed and CNNs may be more superior to transformer-based models for surgical segmentation for the prostate, vas, and seminal vesicle. In addition, the results for the prostate, vas deferens, and seminal vesicle which are the key organs in prostate surgery, indicate that the accuracy of segmentation decreases when they are adjacent to each other and it is difficult to clearly distinguish the boundary.

In surgery, unusual anatomy and unexpected events often occur, such as anatomical variations and massive bleeding. Interestingly, we found that the performance of CNN-based models was better than transformer-based models when severe bleeding (n=7) or unusual prostate morphology (n=5) was present in our data (Fig. 3). Because the number of samples was small, it is difficult to draw firm conclusions on this, but CNN-based models showed more accurate recognition of organs than transformer-based models (Dice scores of 0.870 and 0.712). Although the transformer architecture has recently received more attention from AI researchers, these results suggest that conventional CNNs may be more appropriate for surgical image segmentation. In addition, considering that the models ultimately need to be applied to real-time videos, a significant advantage of the CNN architecture in surgical segmentation is its lower computational demand. This reduced need for computing power not only makes the model more efficient but also leads to decreased transmission latency. As a result, the CNN-based approach may process and transmit segmented images more quickly, which is crucial for maintaining the real-time responsiveness required in surgical applications.

There are several limitations to this study. The primary limitation of this study is its relatively small sample size. Visual transformers, as documented since their inception, typically require substantially large datasets to achieve their full performance potential [4,6]. Consequently, our comparison between transformer-based and U-Net models, based on this limited dataset, carries inherent constraints, and any conclusions should be interpreted cautiously. To address this limitation and gain more definitive insights, future research should focus on utilizing significantly larger datasets. Such expanded studies would allow for a more comprehensive and accurate assessment of the differences in performance between these architectural approaches in medical imaging tasks. However, it is important to acknowledge the considerable challenges in acquiring such extensive datasets, particularly in the medical domain. The most significant obstacle in this endeavor is likely to be the procurement of expertly annotated data. The process of obtaining precise, high-quality annotations from medical specialists is both time-consuming and resource-intensive. The scarcity of qualified experts, combined with the meticulous nature of the annotation process, presents a substantial hurdle in scaling up datasets to the level required for a thorough evaluation of visual transformers in medical imaging applications. This limitation underscores the need for innovative approaches in data collection and annotation, as well as potential collaborations across medical institutions to pool resources and expertise for future large-scale studies. Second, this study used data from anterior approach prostatectomy performed by four surgeons. Therefore, whether this study’s findings can be applied to other methods of surgery was not investigated.

Despite these limitations, we believe that this study provides a valuable contribution because it is the first to investigate the effectiveness of segmentation for anatomical organs in robotic prostate surgery and compare the performance of CNN and transformer models in semantic segmentation. While the direct clinical application of these particular segmentation models may be limited at present, our research demonstrates their significant potential for deep learning in surgical contexts. The performance and capabilities of deep learning models revealed in our study serve as a proof of concept, highlighting the promise of AI in robotic surgical applications. In the long term, the deep learning techniques explored here could be utilized in various fields, such as surgeon skill assessment tools, advanced training programs, AI-enhanced surgical procedures, and even autonomous surgery. As the field progresses, the integration of these deep learning approaches with other emerging technologies such as robotics, augmented reality, and generative AI could potentially revolutionize surgical practices.

CONCLUSIONS

Deep learning models demonstrated relatively accurate segmentation of surgical instruments and anatomical structures imaged during robotic prostate surgery, with both CNN and transformer models showing strong performance. However, the CNN-based models may be more robust than transformer methods when it comes to segmenting the prostate, vas deferens and seminal vesicle, and recognizing anatomical organs in unusual cases. Considering that visual transformers require large amounts of data, further research with larger datasets is needed.

Footnotes

CONFLICTS OF INTEREST: The authors have nothing to disclose.

FUNDING: This work was supported by the Technology Development Program (S3288928) funded by the Ministry of SMEs and Startups (MSS, Korea), the Korean Society for Urological Ultrasonography (KSUU 2023-01), and Hallym University Research Fund (HURF-2023-24).

AUTHORS’ CONTRIBUTIONS:

Research conception and design: Sahyun Pak.
Data acquisition: Sahyun Pak, Sung Gon Park, Wonchul Lee, Sung Tae Cho, Young Goo Lee, and Hanjong Ahn.
Statistical analysis: Sahyun Pak and Jeonghyun Park.
Data analysis and interpretation: Sahyun Pak and Jeonghyun Park.
Drafting of the manuscript: Sahyun Pak.
Critical revision of the manuscript: Sahyun Pak and Hanjong Ahn.
Obtaining funding: Sahyun Pak and Jun Ho Lee.
Administrative, technical, or material support: Sahyun Pak, Hong Rock Choi, and Jun Ho Lee.
Supervision: Sahyun Pak.
Approval of the final manuscript: all authors.

SUPPLEMENTARY MATERIALS

Supplementary materials can be found via https://doi.org/10.4111/icu.20240159.

Supplementary Material

icu-65-551-s001.pdf^{(37.9KB, pdf)}

Supplementary Table 1

Bootstrapped confidence intervals of performance metrics

icu-65-551-s002.pdf^{(15.2KB, pdf)}

References

1.LeCun Y, Bengio Y, Hinton G. Deep learning. Nature. 2015;521:436–444. doi: 10.1038/nature14539. [DOI] [PubMed] [Google Scholar]
2.Guo Y, Liu Y, Georgiou T, Lew MS. A review of semantic segmentation using deep neural networks. Int J Multimed Info Retr. 2018;7:87–93. [Google Scholar]
3.Chai J, Zeng H, Li A, Ngai EWT. Deep learning in computer vision: a critical review of emerging techniques and application scenarios. Mach Learn Appl. 2021;6:100134 [Google Scholar]
4.Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T. An image is worth 16x16 words: transformers for image recognition at scale. arXiv:2010.11929 [Preprint] 2020. [cited 2024 May 11]. Available from: [DOI]
5.Deng J, Dong W, Socher R, Li LJ, Li K, Fei-Fei L. ImageNet: a large-scale hierarchical image database; 2009 IEEE conference on computer vision and pattern recognition (CVPR); IEEE; 2009. pp. 248–255. [Google Scholar]
6.Xiao H, Li L, Liu Q, Zhu X, Zhang Q. Transformers in medical image segmentation: a review. Biomed Signal Process Control. 2023;84:104791 [Google Scholar]
7.Chen X, Wang X, Zhang K, Fung KM, Thai TC, Moore K, et al. Recent advances and clinical applications of deep learning in medical image analysis. Med Image Anal. 2022;79:102444. doi: 10.1016/j.media.2022.102444. [DOI] [PMC free article] [PubMed] [Google Scholar]
8.Siegel RL, Miller KD, Wagle NS, Jemal A. Cancer statistics, 2023. CA Cancer J Clin. 2023;73:17–48. doi: 10.3322/caac.21763. [DOI] [PubMed] [Google Scholar]
9.Würnschimmel C, Wenzel M, Wang N, Tian Z, Karakiewicz PI, Graefen M, et al. Radical prostatectomy for localized prostate cancer: 20-year oncological outcomes from a German high-volume center. Urol Oncol. 2021;39:830.e17–830.e26. doi: 10.1016/j.urolonc.2021.04.031. [DOI] [PubMed] [Google Scholar]
10.Obek C, Doganca T, Argun OB, Kural AR. Management of prostate cancer patients during COVID-19 pandemic. Prostate Cancer Prostatic Dis. 2020;23:398–406. doi: 10.1038/s41391-020-0258-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
11.Kim SP, Karnes RJ, Gross CP, Meropol NJ, Van Houten H, Abouassaly R, et al. Contemporary national trends of prostate cancer screening among privately insured men in the United States. Urology. 2016;97:111–117. doi: 10.1016/j.urology.2016.06.067. [DOI] [PubMed] [Google Scholar]
12.Park SG, Park J, Choi HR, Lee JH, Cho ST, Lee YG, et al. Deep learning model for real-time semantic segmentation during intraoperative robotic prostatectomy. Eur Urol Open Sci. 2024;62:47–53. doi: 10.1016/j.euros.2024.02.005. [DOI] [PMC free article] [PubMed] [Google Scholar]
13.Chen LC, Papandreou G, Schroff F, Adam H. Rethinking atrous convolution for semantic image segmentation. arXiv: 1706.05587 [Preprint] 2017. [cited 2024 May 11]. Available from: [DOI]
14.Chen B, Xia M, Qian M, Huang J. MANet: a multi-level aggregation network for semantic segmentation of high-resolution remote sensing images. Int J Remote Sens. 2022;43:5874–5894. [Google Scholar]
15.Zhou Z, Siddiquee MMR, Tajbakhsh N, Liang J. UNet++: redesigning skip connections to exploit multiscale features in image segmentation. IEEE Trans Med Imaging. 2020;39:1856–1867. doi: 10.1109/TMI.2019.2959609. [DOI] [PMC free article] [PubMed] [Google Scholar]
Xie E, Wang W, Yu Z, Anandkumar A, Alvarez JM, Luo P. In: Advances in neural information processing systems 34. Ranzato M, Beygelzimer A, Dauphin Y, Liang PS, Wortman Vaughan J, editors. NeurIPS; 2021. Seg-Former: simple and efficient design for semantic segmentation with transformers. [Google Scholar]
17.Bao H, Dong L, Piao S, Wei F. BEiT: BERT pre-training of image transformers. arXiv:2106.08254 [Preprint] 2021. [cited 2024 May 11]. Available from: [DOI]
18.Ranftl R, Bochkovskiy A, Koltun V. Vision transformers for dense prediction. arXiv:2103.13413 [Preprint] 2021. [cited 2024 May 11]. Available from: [DOI]
19.Chadebecq F, Vasconcelos F, Mazomenos E, Stoyanov D. Computer vision in the surgical operating room. Visc Med. 2020;36:456–462. doi: 10.1159/000511934. [DOI] [PMC free article] [PubMed] [Google Scholar]
20.Takeshita N, Sakamoto S, Kitaguchi D, Takeshita N, Yajima S, Koike T, et al. Deep learning-based seminal vesicle and vas deferens recognition in the posterior approach of robot-assisted radical prostatectomy. Urology. 2023;173:98–103. doi: 10.1016/j.urology.2022.12.006. [DOI] [PubMed] [Google Scholar]
21.Bravi CA, Tin A, Vertosick E, Mazzone E, Martini A, Dell'Oglio P, et al. The impact of experience on the risk of surgical margins and biochemical recurrence after robot-assisted radical prostatectomy: a learning curve study. J Urol. 2019;202:108–113. doi: 10.1097/JU.0000000000000147. [DOI] [PMC free article] [PubMed] [Google Scholar]
22.Brunckhorst O, Volpe A, van der Poel H, Mottrie A, Ahmed K. Training, simulation, the learning curve, and how to reduce complications in urology. Eur Urol Focus. 2016;2:10–18. doi: 10.1016/j.euf.2016.02.004. [DOI] [PubMed] [Google Scholar]
23.Cai L, Gao J, Zhao D. A review of the application of deep learning in medical image classification and segmentation. Ann Transl Med. 2020;8:713. doi: 10.21037/atm.2020.02.44. [DOI] [PMC free article] [PubMed] [Google Scholar]
24.Ronneberger O, Fischer P, Brox T. In: Medical image computing and computer-assisted intervention – MICCAI 2015. Navab N, Hornegger J, Wells W, Frangi A, editors. Springer; 2015. U-net: convolutional networks for biomedical image segmentation. [Google Scholar]
25.Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, et al. In: Advances in neural information processing systems 30. Guyon I, Von Luxburg U, Bengio S, Wallach H, Fergus R, Vishwanathan S, et al., editors. NeurIPS; 2017. Attention is all you need. [Google Scholar]
26.Chen J, Lu Y, Yu Q, Luo X, Adeli E, Wang Y, et al. TransUNet: transformers make strong encoders for medical image segmentation. arXiv:2102.04306 [Preprint] 2021. [cited 2024 May 11]. Available from: [DOI]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Material

icu-65-551-s001.pdf^{(37.9KB, pdf)}

Supplementary Table 1

Bootstrapped confidence intervals of performance metrics

icu-65-551-s002.pdf^{(15.2KB, pdf)}

[B1] 1.LeCun Y, Bengio Y, Hinton G. Deep learning. Nature. 2015;521:436–444. doi: 10.1038/nature14539. [DOI] [PubMed] [Google Scholar]

[B2] 2.Guo Y, Liu Y, Georgiou T, Lew MS. A review of semantic segmentation using deep neural networks. Int J Multimed Info Retr. 2018;7:87–93. [Google Scholar]

[B3] 3.Chai J, Zeng H, Li A, Ngai EWT. Deep learning in computer vision: a critical review of emerging techniques and application scenarios. Mach Learn Appl. 2021;6:100134 [Google Scholar]

[B4] 4.Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T. An image is worth 16x16 words: transformers for image recognition at scale. arXiv:2010.11929 [Preprint] 2020. [cited 2024 May 11]. Available from: [DOI]

[B5] 5.Deng J, Dong W, Socher R, Li LJ, Li K, Fei-Fei L. ImageNet: a large-scale hierarchical image database; 2009 IEEE conference on computer vision and pattern recognition (CVPR); IEEE; 2009. pp. 248–255. [Google Scholar]

[B6] 6.Xiao H, Li L, Liu Q, Zhu X, Zhang Q. Transformers in medical image segmentation: a review. Biomed Signal Process Control. 2023;84:104791 [Google Scholar]

[B7] 7.Chen X, Wang X, Zhang K, Fung KM, Thai TC, Moore K, et al. Recent advances and clinical applications of deep learning in medical image analysis. Med Image Anal. 2022;79:102444. doi: 10.1016/j.media.2022.102444. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B8] 8.Siegel RL, Miller KD, Wagle NS, Jemal A. Cancer statistics, 2023. CA Cancer J Clin. 2023;73:17–48. doi: 10.3322/caac.21763. [DOI] [PubMed] [Google Scholar]

[B9] 9.Würnschimmel C, Wenzel M, Wang N, Tian Z, Karakiewicz PI, Graefen M, et al. Radical prostatectomy for localized prostate cancer: 20-year oncological outcomes from a German high-volume center. Urol Oncol. 2021;39:830.e17–830.e26. doi: 10.1016/j.urolonc.2021.04.031. [DOI] [PubMed] [Google Scholar]

[B10] 10.Obek C, Doganca T, Argun OB, Kural AR. Management of prostate cancer patients during COVID-19 pandemic. Prostate Cancer Prostatic Dis. 2020;23:398–406. doi: 10.1038/s41391-020-0258-7. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B11] 11.Kim SP, Karnes RJ, Gross CP, Meropol NJ, Van Houten H, Abouassaly R, et al. Contemporary national trends of prostate cancer screening among privately insured men in the United States. Urology. 2016;97:111–117. doi: 10.1016/j.urology.2016.06.067. [DOI] [PubMed] [Google Scholar]

[B12] 12.Park SG, Park J, Choi HR, Lee JH, Cho ST, Lee YG, et al. Deep learning model for real-time semantic segmentation during intraoperative robotic prostatectomy. Eur Urol Open Sci. 2024;62:47–53. doi: 10.1016/j.euros.2024.02.005. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B13] 13.Chen LC, Papandreou G, Schroff F, Adam H. Rethinking atrous convolution for semantic image segmentation. arXiv: 1706.05587 [Preprint] 2017. [cited 2024 May 11]. Available from: [DOI]

[B14] 14.Chen B, Xia M, Qian M, Huang J. MANet: a multi-level aggregation network for semantic segmentation of high-resolution remote sensing images. Int J Remote Sens. 2022;43:5874–5894. [Google Scholar]

[B15] 15.Zhou Z, Siddiquee MMR, Tajbakhsh N, Liang J. UNet++: redesigning skip connections to exploit multiscale features in image segmentation. IEEE Trans Med Imaging. 2020;39:1856–1867. doi: 10.1109/TMI.2019.2959609. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B16] Xie E, Wang W, Yu Z, Anandkumar A, Alvarez JM, Luo P. In: Advances in neural information processing systems 34. Ranzato M, Beygelzimer A, Dauphin Y, Liang PS, Wortman Vaughan J, editors. NeurIPS; 2021. Seg-Former: simple and efficient design for semantic segmentation with transformers. [Google Scholar]

[B17] 17.Bao H, Dong L, Piao S, Wei F. BEiT: BERT pre-training of image transformers. arXiv:2106.08254 [Preprint] 2021. [cited 2024 May 11]. Available from: [DOI]

[B18] 18.Ranftl R, Bochkovskiy A, Koltun V. Vision transformers for dense prediction. arXiv:2103.13413 [Preprint] 2021. [cited 2024 May 11]. Available from: [DOI]

[B19] 19.Chadebecq F, Vasconcelos F, Mazomenos E, Stoyanov D. Computer vision in the surgical operating room. Visc Med. 2020;36:456–462. doi: 10.1159/000511934. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B20] 20.Takeshita N, Sakamoto S, Kitaguchi D, Takeshita N, Yajima S, Koike T, et al. Deep learning-based seminal vesicle and vas deferens recognition in the posterior approach of robot-assisted radical prostatectomy. Urology. 2023;173:98–103. doi: 10.1016/j.urology.2022.12.006. [DOI] [PubMed] [Google Scholar]

[B21] 21.Bravi CA, Tin A, Vertosick E, Mazzone E, Martini A, Dell'Oglio P, et al. The impact of experience on the risk of surgical margins and biochemical recurrence after robot-assisted radical prostatectomy: a learning curve study. J Urol. 2019;202:108–113. doi: 10.1097/JU.0000000000000147. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B22] 22.Brunckhorst O, Volpe A, van der Poel H, Mottrie A, Ahmed K. Training, simulation, the learning curve, and how to reduce complications in urology. Eur Urol Focus. 2016;2:10–18. doi: 10.1016/j.euf.2016.02.004. [DOI] [PubMed] [Google Scholar]

[B23] 23.Cai L, Gao J, Zhao D. A review of the application of deep learning in medical image classification and segmentation. Ann Transl Med. 2020;8:713. doi: 10.21037/atm.2020.02.44. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B24] 24.Ronneberger O, Fischer P, Brox T. In: Medical image computing and computer-assisted intervention – MICCAI 2015. Navab N, Hornegger J, Wells W, Frangi A, editors. Springer; 2015. U-net: convolutional networks for biomedical image segmentation. [Google Scholar]

[B25] 25.Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, et al. In: Advances in neural information processing systems 30. Guyon I, Von Luxburg U, Bengio S, Wallach H, Fergus R, Vishwanathan S, et al., editors. NeurIPS; 2017. Attention is all you need. [Google Scholar]

[B26] 26.Chen J, Lu Y, Yu Q, Luo X, Adeli E, Wang Y, et al. TransUNet: transformers make strong encoders for medical image segmentation. arXiv:2102.04306 [Preprint] 2021. [cited 2024 May 11]. Available from: [DOI]

PERMALINK

Application of deep learning for semantic segmentation in robotic prostatectomy: Comparison of convolutional neural networks and visual transformers

Sahyun Pak

Sung Gon Park

Jeonghyun Park

Hong Rock Choi

Jun Ho Lee

Wonchul Lee

Sung Tae Cho

Young Goo Lee

Hanjong Ahn

Abstract

Purpose

Materials and Methods

Results

Conclusions

Graphical Abstract

INTRODUCTION

MATERIALS AND METHODS

1. Study population

2. Dataset

Fig. 1. Prostatectomy image annotation.

3. Preprocessing

4. Network architecture and training

5. Evaluation metrics and statistical analysis

RESULTS

Table 1. Baseline demographic and clinical characteristics of the study patients (n=150).

Fig. 2. Training and validation loss curves.

Table 2. Comparative results of the different methods on dataset.

DISCUSSION

Fig. 3. Examples of segmentation results in cases with severe bleeding and a large prostate with median lobe adenoma.

CONCLUSIONS

Footnotes

SUPPLEMENTARY MATERIALS

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases