Machine learning analyses of automated performance metrics during granular sub-stitch phases predict surgeon experience

Andrew B Chen; Siqi Liang; Jessica H Nguyen; Yan Liu; Andrew J Hung

doi:10.1016/j.surg.2020.09.020

. Author manuscript; available in PMC: 2022 May 1.

Published in final edited form as: Surgery. 2020 Nov 5;169(5):1245–1249. doi: 10.1016/j.surg.2020.09.020

Machine learning analyses of automated performance metrics during granular sub-stitch phases predict surgeon experience

Andrew B Chen ¹, Siqi Liang ², Jessica H Nguyen ¹, Yan Liu ², Andrew J Hung ¹

PMCID: PMC8093318 NIHMSID: NIHMS1632073 PMID: 33160637

Abstract

Automated performance metrics (APMs) objectively measure surgeon performance during a robot-assisted radical prostatectomy (RARP). Machine learning (ML) has shown that APMs, especially during the vesico-urethral anastomosis (VUA) of the RARP, are predictive of long-term outcomes such as continence recovery time. This study focuses on APMs during the VUA, specifically on stitch versus sub-stitch levels, to distinguish surgeon experience.

During the VUA, APMs, recorded by a systems data recorder (Intuitive Surgical), were reported for each overall stitch (C^total) and its individual components: needle handling/targeting (C¹), needle driving (C²), and suture cinching (C³) (Figure 1A). These metrics were organized into three datasets (GlobalSet [whole stitch], RowSet [independent sub-stitches], and ColumnSet [associated sub-stitches] (Figure 1B) and applied to three ML models (AdaBoost, Gradient Boosting, and Random Forest) in order to solve two classifications tasks: experts (≥ 100 cases) vs. novices (<100 cases); and ordinary-experts (≥100 and <2000 cases) vs. super-experts (≥ 2000 cases). Classification accuracy was determined using analysis of variance (ANOVA). Input features were evaluated through a Jaccard index.

From 68 VUAs, we analyzed 1,570 stitches broken down into 4,708 sub-stitches. For both classification tasks, ColumnSet best distinguished experts (n=8) vs. novices (n=9) and ordinary-experts (n=5) vs. super-experts (n=3) at an accuracy of 0.774 and 0.844, respectively. Feature ranking highlighted Endowrist articulation and needle handling/targeting as most important in classification.

Surgeon performance measured by APMs on a granular sub-stitch level more accurately distinguishes expertise when compared to summary APMs over whole stitches.

TOC Statement- 20-AI-04

Our study demonstrates that surgeon experience can be predicted via machine learning through analysis of detailed and granular sub-stitch data. The importance of this finding highlights future directions for surgical education as well as evaluation.

Introduction

Surgical skill and technique have been shown to correlate with post-operative clinical outcomes^1–4. Effective surgical evaluation and instruction of surgical trainees is critical in order achieve excellent healthcare outcomes. At the same time, adequate evaluation of surgical skill and expertise requires extensive supervision and remains a challenge in surgical education^5,6.

For robotic surgery, various technical assessment tools can be used to measure surgical performance. Manual assessment, which includes general and procedure-specific evaluations, are limited by difficulties in scaling, time, and limited interrater reliability^7–9. Automated assessment, a developing area of research, can measure and quantify surgical performance directly. Automated performance metrics (APMs) are one such tool derived directly from kinematic data and robotic systems data¹⁰.

In our previous work, APMs summarized across a whole procedure or whole steps have been shown to differentiate surgeon expertise and case volume. During the vesico-urethral anastomosis (VUA) of a robotic assisted radical prostatectomy (RARP), for instance, experts (caseload ≥100 cases) and novices (<100 cases) demonstrated statistically different APM profiles¹¹. In fact, super-experts (≥2000 cases) demonstrated significantly different APMs than ordinary-experts (≥100 cases and ≤750 cases) along with statistically significantly improved perioperative outcomes¹². Further application using machine learning (ML) has shown that APMs during the VUA are top features in deep learning models to predict clinical outcomes such as time to continence after RARP¹³.

As ML becomes more prevalent in applications in medicine, optimizing datasets for improved analysis is paramount. More data in ML has been associated with improved accuracy^14,15. In addition, improved label granularity in supervised learning, that is, a more detailed label (e.g. Persian cat vs cat), can improve classification accuracy¹⁶. We apply this principle of improved data granularity to our present analysis of surgical experience in RARP. Previous studies have focused on APMs summated over an entire step (e.g. VUA) or an entire procedure. We now break down a step of the RARP to the sub-stitch level.

Herein, we evaluate with ML algorithms whether more granular APMs (over a sub-stitch level) improves classification of surgeon experience. We report APMs during the VUA for each overall stitch (C^total) and its individual sub-stitch components: needle handling/targeting (C¹), needle driving (C²), and suture cinching (C³) (Figure 1A).

Figure 1. — Suturing can generally be broken into 3 components, including a) “needle handling” with needle driver instruments, b) “needle driving” through tissue, and c) “suture cinching” **B.) Data set organization.** Automated performance metrics during the VUA are organized into 3 datasets which differ in granularity. GlobalSet contains the least amount of data as whole stitches are analyzed. RowSet follows with inclusion of substitch components. ColumnSet contains the greatest information per data point as substitch components are evaluated in context of the same stitch.

Materials and Methods

Study Design

Under an institutional review board-approved protocol, synchronized surgical video and systems events data during the VUA step of the RARP were recorded directly from da Vinci Si and Xi systems consecutively from 2016–2017 using a custom video and data recorder (Intuitive Surgical). RARP cases performed without the da Vinci system data recorder were excluded.

Participants

Participants in the present study were 17 faculty surgeons, fellows, and residents. Surgeons classified a priori as experts (n=8) (≥ 100 prior case experience) were compared to novices (n=9) (<100 cases). Experts were further subdivided into super-experts (≥ 2000 cases (n=3)) and ordinary-experts ((n=5) ≥ 100 cases but ≤750 cases).

Data Collection

Previously developed and validated APMs were derived from kinematic data (e.g. instrument travel time, path length, velocity, EndoWrist® movements) and system events data (e.g. camera movements, third arm usage). Video review of each case synchronized the timestamp of each individual stitch and sub-stitch with corresponding systems data to derive the APMs.

These metrics were organized into three datasets which differed in granularity: GlobalSet [whole stitch with no sub-stitch designation], RowSet [sub-stitches reported as independent events], and ColumnSet [sub-stitches associated to a whole stitch] (Figure 1B) were applied to three ML models (AdaBoost, Gradient Boosting, and Random Forest) in order to solve two classification tasks: Comparison 1: experts versus novices performance; and Comparison 2: ordinary-experts versus super-experts. We utilized three ML models based on ensemble learning, a method combining predictions from multiple simple classifiers to obtain a final prediction with better classification performance. Random Forest uses bagging strategy generating individual independent decision trees contributing to a final majority vote. AdaBoost is a boosting model that updates weights of data points, weighting each weak classifier based on corresponding errors in sequence, with the final prediction being a weighted majority. Gradient Boosting is a boosting model that trains the current weak classifier to learn the residual error from the last former step, with the final prediction being a sum of all predictions. We randomly selected 80% of each dataset as the training set with the remainder used as the test set, maintaining the partition with a fixed random seed. Mean classification accuracies from different model/dataset combinations for the two classification tasks were compared using analysis of variance. The stability of feature importance ranking from every machine learning model on each classification task was evaluated using the Jaccard index and weighted to output the final feature importance rank.

Results

The median number of cases performed by a novice was 20 (IQR: 5–40) The median number of cases performed by an expert was 275 (IQR: 150–2000). The subcategories of ordinary-experts and super-experts featured a median case load of 150 and 2,000, respectively. We analyzed 68 VUA’s, which consisted of 1,570 stitches further divided into 4,708 sub-stitches. Of the sub-stitches, 1,571 were needle handling/targeting (C¹), 1,568 were needle driving (C²), and 1,569 were suture cinching (C³). 30 APM’s were analyzed per sub-stitch to classify experts vs novices and ordinary-experts vs super-experts.

Comparison 1

When attempting to differentiate between experts and novices, we found a distinct hierarchy of accuracy between datasets consistent in each ML model. ColumnSet, providing sub-stitch details along with association to specific stitches, performed with the highest accuracy when comparing within each ML model (Table 1). RowSet followed in accuracy and GlobalSet had the lowest accuracy. The one exception was GlobalSet outperforming RowSet with the Random Forest model. Overall, the best performing combination of dataset and model was ColumnSet by Random Forest model with a prediction accuracy of 0.733 ± 0.005. The worst performing combination of dataset and model was GlobalSet as analyzed by Gradient Boosting with an accuracy of 0.672 ± 0.001.

Table 1.

Performance accuracy when distinguishing experts vs. novices (Comparison 1).

Machine learning model	Performance accuracy comparison between varying datasets				p Value
Adaboost	ColumnSet	0.72747 ± 0.01641	RowSet	0.71218 ± 0.00863	p = 0.003
Adaboost	RowSet	0.71218 ± 0.00863	GlobalSet	0.69879 ± 0.01829	p < 0.001
Gradient Boosting	ColumnSet	0.72675 ± 0.01033	RowSet	0.72094 ± 0.00619	p = 0.13
Gradient Boosting	RowSet	0.72094 ± 0.00619	GlobalSet	0.67241± 0.00111	p < 0.001
Random Forest	ColumnSet	0.73274 ± 0.00528	RowSet	0.71592 ± 0.00300	p < 0.001
Random Forest	RowSet	0.71592 ± 0.00300	GlobalSet	0.72751 ± 0.00908	p < 0.001

Open in a new tab

Comparison 2

In the second comparison, ColumnSet as analyzed by AdaBoost labeled ordinary-experts and super-experts the most accurately with a prediction of 0.801 ± 0.014 (Table 2). The worst performing model was GlobalSet as analyzed by Gradient Boosting with an accuracy of 0.769 ± 0.009.

Table 2.

Performance accuracy when distinguishing super-experts vs. ordinary-experts (Comparison 2).

Machine learning model	Performance accuracy comparison between varying datasets				p Value
Adaboost	ColumnSet	0.801 ± 0.014	RowSet	0.772 ± 0.009	p < 0.001
Adaboost	RowSet	0.772 ± 0.009	GlobalSet	0.774 ± 0.009	p = 0.14
Gradient Boosting	ColumnSet	0.770 ± 0.006	RowSet	0.784 ± 0.006	p < 0.001
Gradient Boosting	RowSet	0.784 ± 0.006	GlobalSet	0.759 ± 0.002	p < 0.001
Random Forest	ColumnSet	0.761 ± 0.007	RowSet	0.761 ± 0.004	p = 0.959
Random Forest	RowSet	0.761 ± 0.004	GlobalSet	0.769 ± 0.009	p = < 0.001

Open in a new tab

We further found that the ML algorithms were more accurate in differentiating between super-experts and ordinary-experts, compared to distinguishing novices versus experts (p<0.001).

Finally, feature ranking was then performed on the ColumnSet data to identify which APM’s contributed most to prediction accuracy. We noted two specific trends. Seven of the top 10 features differentiating experts and novices involved Endowrist® articulation as opposed to other kinematic metrics (Figure 2). In comparing super-experts and ordinary-experts, we find that seven of the top 10 features are APMs during needle handling/targeting phase of suturing. (Figure 3).

Figure 2. — Wrist articulation metrics rank highly in differentiating novices and experts

Figure 3. — *Needle handling* (C¹) metrics rank highly in differentiating ordinary-experts and super-experts

Discussion

The present study demonstrated that machine learning can accurately classify surgeon experience based on individual stitches and sub-stitches in the VUA of a RARP. We not only found a difference between experts and novices, but also between super-experts and ordinary-experts.

Interestingly, the ML model was able to classify super-experts vs ordinary-experts more accurately than experts and novices. This held true for every ML model constructed on each dataset (Table 2). Without reading into this outcome excessively, we believe the results confirm that the evolution of surgeon performance continues beyond the ordinary-expert level.

We found that the best performing models with the greatest accuracy in Comparison 1 and 2 used the ColumnSet dataset (individual sub-stitch data and further association with whole stitches). The worst performing model in Comparison 1 and Comparison 2 used the GlobalSet dataset (no sub-stitch data). These results confirm our hypothesis that more granularity in data would improve classification accuracy. GlobalSet provided the least information of the datasets. Compared to Rowset, ColumnSet provided additional context by grouping associated sub-stitches.

The feature rank derived from Comparison 1 showed that Endowrist articulation metrics throughout C¹, C², and C³ were a major contributory factor in differentiating experts and novices. On the other hand, feature ranking showed that the top APM’s differentiating super-experts and ordinary-experts involve metrics during the needle handling phase of suturing. Interpretation of the top ranking APM’s may help guide education and learning. One possible takeaway is that novices should focus on usage of instrument articulation throughout all aspects of suturing. As skills mature, attention should also be paid to refining needle handling at the start of every stitch.

The following study limitations should be acknowledged. Our study is based on the experience of a single center with surgeons who may share a similar surgical style. External validation with an outside dataset is required. While the large datasets were analyzed by ML algorithms, this data was derived from manual segmentation of the sub-stitch phases. Scaling this task to analyze every surgeon’s VUA is still time-consuming at present.

Future directions include automatic segmentation of stitches into sub-stitches with the assistance of ML and advances in computer vision. We also look to correlate sub-stitch metrics to clinical outcomes (i.e., anastomotic leak).

In summary, our study demonstrates that surgeon performance measured by APMs, when reported on a detailed and granular sub-stitch level, more accurately distinguishes surgeon experience when compared to summary APMs over whole stitches. Further investigation is warranted to translate these findings into formidable feedback for surgeons and surgical trainees.

Highlights.

Topic	Application of machine learning algorithms to predict surgeon experience
Purpose	To differentiate experts (≥100 cases) and novices (<100 cases), as well as super-experts (≥2000 cases) and ordinary-experts (≥100 cases and <2000 cases)
State-of-the-Art	Utilizing automated performance metrics (APMs; robotic kinematic and system events data) on stitch/sub-stitch levels
Knowledge Gaps	Explore the value of detailed APMs during suturing sub-stitch maneuvers in contrast with previous APMs reported over specific steps of a procedure
Technology Gaps	Compare the performance of different machine learning models when presented with datasets of increasing granularity
Future Directions	This is foundational work to provide meaningful feedback to surgeons and learners in training.

Open in a new tab

Acknowledgements

We would like to acknowledge Anthony Jarc¹ for processing of automated performance metrics.

1. Intuitive Surgical Inc. Clinical Research, Norcross, United States.

Funding/Support: This study is supported in part by the National Institute Of Biomedical Imaging And Bioengineering of the National Institutes of Health under Award Number K23EB026493 and an Intuitive Surgical Clinical Research Grant.

Footnotes

COI/Disclosure: Andrew J. Hung has financial disclosures with Quantgene, Inc. (consultant), Mimic Technologies, Inc. (consultant), and Johnson & Johnson (consultant).

Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

References

1.Birkmeyer JD, Finks JF, O’Reilly A, et al. Surgical skill and complication rates after bariatric surgery. N Engl J Med 2013;369 (15):1434–1442. [DOI] [PubMed] [Google Scholar]
2.Goldenberg MG, Goldenberg L, Grantcharov TP. Surgeon performance predicts early continence after robot-assisted radical prostatectomy. Journal of endourology. 2017;31(9):858–63. [DOI] [PubMed] [Google Scholar]
3.Hogg ME, Zenati M, Novak S, et al. Grading of surgeon technical performance predicts postoperative pancreatic fistula for pancreaticoduodenectomy independent of patient-related variables. Annals of surgery. 2016;264(3):482–91. [DOI] [PubMed] [Google Scholar]
4.Fecso AB, Szasz P, Kerezov G, et al. The effect of technical performance on patient outcomes in surgery. Annals of surgery. 2017;265(3):492–501. [DOI] [PubMed] [Google Scholar]
5.Scott DJ, Rege RV, Bergen PC, et al. Measuring operative performance after laparoscopic skills training: edited videotape versus direct observation. Journal of laparoendoscopic & advanced surgical techniques. 2000;10(4):183–90. [DOI] [PubMed] [Google Scholar]
6.Deal SB, Lendvay TS, Haque MI, et al. Crowd-sourced assessment of technical skills: an opportunity for improvement in the assessment of laparoscopic surgical skills. The American Journal of Surgery. 2016;211(2):398–404. [DOI] [PubMed] [Google Scholar]
7.Raza SJ, Field E, Jay C, et al. Surgical competency for urethrovesical anastomosis during robot-assisted radical prostatectomy: development and validation of the robotic anastomosis competency evaluation. Urology. 2015;85(1):27–32. [DOI] [PubMed] [Google Scholar]
8.Goh AC, Goldfarb DW, Sander JC, et al. Global evaluative assessment of robotic skills: validation of a clinical assessment tool to measure robotic surgical skills. The Journal of urology. 2012;187(1):247–52. [DOI] [PubMed] [Google Scholar]
9.Prebay ZJ, Peabody JO, Miller DC, et al. Video review for measuring and improving skill in urological surgery. Nature Reviews Urology. 2019;16(4):261–7. [DOI] [PubMed] [Google Scholar]
10.Chen J, Cheng N, Cacciamani G, et al. Objective assessment of robotic surgical technical skill: Asystemic review. The Journal of urology. 2019;201(3):461–9. [DOI] [PubMed] [Google Scholar]
11.Chen J, Oh PJ, Cheng N, et al. Use of automated performance metrics to measure surgeon performance during robotic vesicourethral anastomosis and methodical development of a training tutorial. The Journal of urology. 2018;200(4):895–902. [DOI] [PubMed] [Google Scholar]
12.Hung AJ, Oh PJ, Chen J, et al. Experts vs super‐experts: differences in automated performance metrics and clinical outcomes for robot‐assisted radical prostatectomy. BJU International. 2019;123(5):861–8. [DOI] [PubMed] [Google Scholar]
13.Hung AJ, Chen J, Ghodoussipour S, et al. A deep‐learning model using automated performance metrics and clinical features to predict urinary continence recovery after robot‐assisted radical prostatectomy. BJU international. 2019;124(3):487–95. [DOI] [PMC free article] [PubMed] [Google Scholar]
14.Banko M, Brill E. Scaling to very large corpora for natural language disambiguation. In Proceedings of the 39th annual meeting on association for computational linguistics 2001(pp. 26–33). Association for Computational Linguistics. [Google Scholar]
15.Halevy A, Norvig P, Pereira F. The unreasonable effectiveness of data. IEEE Intelligent Systems 2009; 24: 8–12. [Google Scholar]
16.Chen Z, Ding R, Chin TW, et al. Understanding the impact of label granularity on cnn-based image classification. In2018 IEEE International Conference on Data Mining Workshops (ICDMW) 2018(pp. 895–904). IEEE. [Google Scholar]

[R1] 1.Birkmeyer JD, Finks JF, O’Reilly A, et al. Surgical skill and complication rates after bariatric surgery. N Engl J Med 2013;369 (15):1434–1442. [DOI] [PubMed] [Google Scholar]

[R2] 2.Goldenberg MG, Goldenberg L, Grantcharov TP. Surgeon performance predicts early continence after robot-assisted radical prostatectomy. Journal of endourology. 2017;31(9):858–63. [DOI] [PubMed] [Google Scholar]

[R3] 3.Hogg ME, Zenati M, Novak S, et al. Grading of surgeon technical performance predicts postoperative pancreatic fistula for pancreaticoduodenectomy independent of patient-related variables. Annals of surgery. 2016;264(3):482–91. [DOI] [PubMed] [Google Scholar]

[R4] 4.Fecso AB, Szasz P, Kerezov G, et al. The effect of technical performance on patient outcomes in surgery. Annals of surgery. 2017;265(3):492–501. [DOI] [PubMed] [Google Scholar]

[R5] 5.Scott DJ, Rege RV, Bergen PC, et al. Measuring operative performance after laparoscopic skills training: edited videotape versus direct observation. Journal of laparoendoscopic & advanced surgical techniques. 2000;10(4):183–90. [DOI] [PubMed] [Google Scholar]

[R6] 6.Deal SB, Lendvay TS, Haque MI, et al. Crowd-sourced assessment of technical skills: an opportunity for improvement in the assessment of laparoscopic surgical skills. The American Journal of Surgery. 2016;211(2):398–404. [DOI] [PubMed] [Google Scholar]

[R7] 7.Raza SJ, Field E, Jay C, et al. Surgical competency for urethrovesical anastomosis during robot-assisted radical prostatectomy: development and validation of the robotic anastomosis competency evaluation. Urology. 2015;85(1):27–32. [DOI] [PubMed] [Google Scholar]

[R8] 8.Goh AC, Goldfarb DW, Sander JC, et al. Global evaluative assessment of robotic skills: validation of a clinical assessment tool to measure robotic surgical skills. The Journal of urology. 2012;187(1):247–52. [DOI] [PubMed] [Google Scholar]

[R9] 9.Prebay ZJ, Peabody JO, Miller DC, et al. Video review for measuring and improving skill in urological surgery. Nature Reviews Urology. 2019;16(4):261–7. [DOI] [PubMed] [Google Scholar]

[R10] 10.Chen J, Cheng N, Cacciamani G, et al. Objective assessment of robotic surgical technical skill: Asystemic review. The Journal of urology. 2019;201(3):461–9. [DOI] [PubMed] [Google Scholar]

[R11] 11.Chen J, Oh PJ, Cheng N, et al. Use of automated performance metrics to measure surgeon performance during robotic vesicourethral anastomosis and methodical development of a training tutorial. The Journal of urology. 2018;200(4):895–902. [DOI] [PubMed] [Google Scholar]

[R12] 12.Hung AJ, Oh PJ, Chen J, et al. Experts vs super‐experts: differences in automated performance metrics and clinical outcomes for robot‐assisted radical prostatectomy. BJU International. 2019;123(5):861–8. [DOI] [PubMed] [Google Scholar]

[R13] 13.Hung AJ, Chen J, Ghodoussipour S, et al. A deep‐learning model using automated performance metrics and clinical features to predict urinary continence recovery after robot‐assisted radical prostatectomy. BJU international. 2019;124(3):487–95. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R14] 14.Banko M, Brill E. Scaling to very large corpora for natural language disambiguation. In Proceedings of the 39th annual meeting on association for computational linguistics 2001(pp. 26–33). Association for Computational Linguistics. [Google Scholar]

[R15] 15.Halevy A, Norvig P, Pereira F. The unreasonable effectiveness of data. IEEE Intelligent Systems 2009; 24: 8–12. [Google Scholar]

[R16] 16.Chen Z, Ding R, Chin TW, et al. Understanding the impact of label granularity on cnn-based image classification. In2018 IEEE International Conference on Data Mining Workshops (ICDMW) 2018(pp. 895–904). IEEE. [Google Scholar]

PERMALINK

Machine learning analyses of automated performance metrics during granular sub-stitch phases predict surgeon experience

Andrew B Chen, MD

Siqi Liang, BE

Jessica H Nguyen, BS

Yan Liu, PhD

Andrew J Hung, MD

Abstract

TOC Statement- 20-AI-04

Introduction

Figure 1. A.) Sub-stitch components.

Materials and Methods

Study Design

Participants

Data Collection

Results

Comparison 1

Table 1.

Comparison 2

Table 2.

Figure 2. Top-ranked features distinguishing expert vs. novice (Comparison 1).

Figure 3. Top-ranked features distinguishing super-expert vs. ordinary-expert (Comparison 2).

Discussion

Highlights.

Acknowledgements

Footnotes

References

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Machine learning analyses of automated performance metrics during granular sub-stitch phases predict surgeon experience

Andrew B Chen, MD

Siqi Liang, BE

Jessica H Nguyen, BS

Yan Liu, PhD

Andrew J Hung, MD

Abstract

TOC Statement- 20-AI-04

Introduction

Figure 1. A.) Sub-stitch components.

Materials and Methods

Study Design

Participants

Data Collection

Results

Comparison 1

Table 1.

Comparison 2

Table 2.

Figure 2. Top-ranked features distinguishing expert vs. novice (Comparison 1).

Figure 3. Top-ranked features distinguishing super-expert vs. ordinary-expert (Comparison 2).

Discussion

Highlights.

Acknowledgements

Footnotes

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases