Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2020 Sep 1.
Published in final edited form as: BJU Int. 2019 Mar 20;124(3):487–495. doi: 10.1111/bju.14735

Deep learning on automated performance metrics and clinical features to predict urinary continence recovery after robot-assisted radical prostatectomy

Andrew J Hung 1, Jian Chen 1, Saum Ghodoussipour 1, Paul J Oh 1, Zequn Liu 2, Jessica Nguyen 1, Sanjay Purushotham 3, Inderbir S Gill 1, Yan Liu 4
PMCID: PMC6706286  NIHMSID: NIHMS1014603  PMID: 30811828

Abstract

Objectives

To predict urinary continence recovery after robot-assisted radical prostatectomy (RARP) utilizing a deep learning (DL) model, which was then used to evaluate surgeon’s historical patient outcomes.

Subjects and Methods

Robotic surgical automated performance metrics (APMs) during RARPs, patient clinicopathological and continence data were captured prospectively from 100 contemporary RARPs. We trained a DL model (DeepSurv) to predict post-operative urinary continence. Model features were ranked based on their importance in prediction. We stratified eight surgeons based on the five top-ranked features. The top four surgeons were categorized in “Group 1/APMs” versus “Group 2/APMs”. A separate historical cohort of RARPs (January 2015 to August 2016) performed by these two surgeon groups were then compared. Concordance Index (CI) and Mean Absolute Error (MAE) were used to measure the model’s prediction performance. Outcomes of historical cases were compared using the Kruskal-Wallis test, Chi-squared and Fisher’s exact test.

Results

Continence was attained in 79 patients (79%) after a median of 126 days. The DL model achieved CI of 0.6, MAE of 85.9 in predicting continence. APMs were ranked higher by the model than clinicopathological features. In the historical cohort, “Group 1/APM” patients had superior rates of urinary continence at three and six months postoperatively (47.5 vs 36.7%, p=0.034, and 68.3 vs 59.2%, p=0.047, respectively).

Conclusion

Utilizing APMs and clinicopathological data the DeepSurv model was able to predict continence after RARP. In this feasibility study, surgeons with more efficient APMs had higher continence rates at three and six months after RARP.

Keywords: robotic surgical procedures, prostatectomy, artificial intelligence, urinary incontinence, quality of life

Introduction

Urinary continence is one of the key functional outcomes after radical prostatectomy [1]. Factors affecting postoperative urinary continence have been extensively investigated, including clinicopathological characteristics and nerve-sparing and reconstruction techniques [1]. Although recognized as influential, the impact of surgeon experience and skill on continence has not been well-established [1]. Goldenberg et al. found that during select steps of robot-assisted radical prostatectomy (RARP), surgical skills impact continence recovery [2]. Elsewhere in surgery, mounting evidence suggests that surgical technique impacts clinical outcomes [3].

To date, the gold standard method of weighing surgeon performance has been through determining prior surgical experience (caseload) or manual surgical evaluation by peer surgeons [4]. However, peer evaluation, typically utilizing validated assessment tools with pre-set objective criteria, is limited by subjectivity and high inter-observer variability [5, 6].

A fledgling alternative to manual assessment is automated performance metrics (APMs). These instrument motion tracking and events metrics are derived from computer-based data recording devices. Preliminary studies have shown that APMs can differentiate expert and novice surgeon performance in both lab and clinical settings [710]. Our group has demonstrated that APMs differentiate surgeon experience during steps of the RARP [7, 8]. Furthermore, we have demonstrated that APMs, with the aid of machine learning (ML) algorithms, can predict perioperative short-term clinical outcomes after RARP [11, 12].

In this study, we applied conventional regression analysis, ML algorithms, and deep learning (DL) techniques to APMs collected during RARPs and patient clinicopathological features to predict postoperative urinary continence. Based on high importance features selected through DL algorithms, we categorized and ranked our institution’s faculty surgeons by both prior RARP experience and efficiency of APMs. We then sought to determine if experience and APMs correlated with clinical outcomes congruently or independently of each other.

Subjects and Methods

Step 1 Urinary continence recovery prediction modeling and feature ranking

Following an institutional review board (IRB) approved protocol, we prospectively collected robot instrument motion tracking (i.e., moving time, distance traveled, wrist angulation) and system events (i.e., camera movement, energy use) data from eight surgeons during 161 RARPs (September 2016 to October 2017) utilizing a custom recording device attached to the da Vinci Surgical System (Intuitive Surgical). From this, we derived a set of 41 previously validated APMs (Table 1) [7, 8]. Each RARP was segmented into 12 steps based on our institution’s structured learning curriculum [13]. APMs were then reported for each step.

Table 1.

Automated performance metrics and patient clinicopathological features

Automated performance metrics
Time related metrics
 Time to complete the task
 Moving time of the right instrument
 Moving time of the left instrument
 Moving time of third instrument
 Moving time of the camera
 Time of no instrument or camera movement
 Time of the right instrument not moving during the task
 Time of the left instrument not moving during the task
 Time of the third instrument not moving during the task
 Time of the camera not moving during the task
Instrument kinematic metrics
 Path length of the right instrument
 Moving velocity of the right instrument
 Path length of the left instrument
 Moving velocity of left instrument
 Path length of the third instrument
 Path length of all three instruments
 Ratio of path length of right and left instruments
Camera movement metrics
 Path length of the camera
 Moving velocity of the camera
 Number of camera adjustments during the task
 Frequency of camera adjustment
 Mean of time of each camera movement
 Mean path length of each camera movement
 Mean of straight path length of each camera movement
System event metrics
 Master clutch usage during task
 Third arm swap during task
 Energy usage during task
 Frequency of master clutch usage
 Frequency of third arm swap
 Frequency of energy application
 Number of times surgeon’s head out of the console
EndoWrist® articulation metrics
 The total radians of the right instrument shaft rotation during the task
 The total radians of the right instrument wrist movement during the task
 The total radians of the right instrument jaw opening during the task
 The total radians of the left instrument shaft rotation during the task
 The total radians of the left instrument wrist movement during the task
 The total radians of the left instrument jaw opening during the task
 Right instrument articulation during the task
 Left instrument articulation during the task
 Angular velocity of the right instrument articulation
 Angular velocity of the left instrument articulation
Clinicopathological features

 Age
 BMI
 Pre-operative PSA
 Pre-operative biopsy Gleason score
 ASA
 Surgery time
 Lymph node dissection template (standard vs extended)
 Urethropexy
 Nerve sparing
 Prostatic median lobe
 Final pathology Gleason score
 Pathological stage
 Extracapsular extension
 Prostate volume
 Positive margins
 Radiation received

BMI: Body mass index; PSA: Prostate specific antigen; ASA: American Society of Anesthesiologists

From the original cohort of 161 cases, we included 100 patients with complete and comprehensive follow up at our institution. We excluded patients who were followed-up at outside facilities. Continence was assessed at each follow up visit using the self-administered EPIC questionnaire [14]. Throughout this study, continence was defined as the use of no pads. Patients with artificial urinary sphincters were considered incontinent. We utilized three data sets to predict continence recovery: 1) a set of 16 clinicopathological features (Table 1); 2) 492 APMs (41 APMs during each of 12 RARP steps); 3) a combined set of 16 clincopathological features and 492 APMs (total 508 parameters).

We utilized three predicting models (Cox Proportional Hazards (CPH) [15], Random Survival Forests (RSF) [16], DL Models based Survival analysis (DeepSurv)) [17] to predict urinary continence after RARP. Preprocessing of the data included imputing missing values of any feature by its median value, and five-fold splitting for cross validation. We used our dataset of 100 cases to train, validate and test the model. The dataset was split into five folds, with 20 cases in each fold. We also stratified each fold based on patient continence status to make sure the continence rates in each fold were similar. All models were trained on three folds, validated on the fourth fold, and tested on the remaining held-out fifth fold. Concordance Index (CI) and Mean Absolute Error (MAE) measured prediction performance. The statistical significance of the CI relative to chance performance was evaluated utilizing hypothesis testing; p-value was obtained using one-sample t-test. Based on these parameters, we selected the model with the highest CI and lowest MAE. From that top-performing model, we feature-ranked the parameters (APMs and clinicopathological) based on importance in predicting urinary continence (Figure 1).

Figure 1.

Figure 1.

Study design of Step 1 and Step 2

Step 2 Faculty surgeon grouping and historical cases comparison

Using the model with the highest CI in Step 1 (DeepSurv), features were ranked and assigned a weighted score based on their importance in predicting continence as defined above (Figure 2). We then used the top five ranked features to differentiate the eight faculty surgeons into two groups. For each of the five features, the four surgeons with more efficient APMs were provided with a score equal to the weight of that particular feature. For example, if a surgeon had an average “Time of the third instrument not moving during anterior VUA (#1 ranked feature)” value in the top half of the eight surgeons, he/she receive a score of 0.041. If a surgeon had the feature value in the lower half of surgeons, he/she would receive a score of zero. A final score was generated for each surgeon by summing the five scores across each feature (Table 2). The four surgeons with the highest scores as predicted by the DeepSurv model were categorized in “Group 1/APMs” (more efficient APMs) versus “Group 2/APMs” (less efficient APMs). Separately, surgeons were grouped according to their self-reported RARP caseload: “Group 1/Experience” (more experience) and “Group 2/Experience” (less experience).

Figure 2.

Top-10 feature ranking by DeepSurv utilizing Data set 3 (automated performance metrics and clincopathological features) with weighting

Figure 2.

VUA: Vesicourethral anastomosis

Table 2.

Surgeon grouping based on top five ranked features (APMs) by deep learning algorithm

RankFeatures and weighted scores 1 2 3 4 5
0.041 0.029 0.022 0.016 0.014
1 H E F E A
2 E G A F F
3 D D C D H
4 G A E C D
5 C C B H C
6 B B G B E
7 A F D A B
8 F H H G G

APMs: Automated performance metrics

Final score calculation

A: 0.029+0.022+0.014=0.064

B: 0.000

C: 0.022+0.016=0.038

D: 0.041+0.029+0.016+0.014=0.099

E: 0.041+0.029+0.022+0.016=0.107

F: 0.022+0.016+0.014=0.051

G: 0.041+0.029=0.069

H: 0.041+0.014=0.054

The four surgeons with more efficient APMs were provided with a score equal to the weight of that particular feature. A final score was generated for each surgeon by summing the five scores across each feature. The four surgeons with the highest total were categorized in “Group 1/APMs” (E, D, G, A) versus “Group 2/APMs” (H, F, C, B).

Continence outcomes from a separate set of consecutive historical RARPs (January 2015 to August 2016) from these eight surgeons were then compared according to ranking by APMs (“Group 1/APMs” vs “Group 2/APMs”), and separately according to experience (“Group 1/Experience” vs “Group 2/Experience”). Of note, no APMs were available for analysis in this cohort of historical cases.

Data collection and analysis

Both prospective RARP (model training cohort) and historical patient data (model testing cohort) were captured based on a separate IRB-approved protocol. Clinical parameters included baseline characteristics, surgical sub-procedures performed (nerve-sparing, standard/extended lymph node dissection, urethropexy), intraoperative outcomes, perioperative outcomes, final pathologic results, and long-term oncologic and functional outcomes. Postoperative complications were graded by the Clavien-Dindo classification system [18]. Prostate-specific antigen (PSA) was checked at three, six, nine, and 12 months; and then every six months after. Biochemical recurrence was defined as two consecutive PSA levels of >0.2 ng/ml [19]. Potency was evaluated at each follow up visit using the Sexual Health Inventory for Men (SHIM) questionnaire [20]. Potency was defined as the ability to achieve and maintain satisfactory erections for sexual intercourse in >50% of attempts, with or without the use of Phosphodiesterase 5 inhibitors. If patients required vacuum erection device, penile injections, or transurethral alprostadil for intercourse, they were considered not potent. The number of cases by each surgeon available for analysis in Steps 1 and 2 is shown (Table 3).

Table 3.

Contribution of cases from surgeons to the DL algorithm training cohort and testing cohort

Step 1 Step 2
DL algorithm training cohort (n=100) DL algorithm testing cohort (n=493)
Surgeon Cases Surgeon Cases
A 35 A 177
B 5 B 18
C 20 C 151
D 10 D 25
E 15 E 73
F 3 F 15
G 10 G 28
H 2 H 6

CPH was performed with Lifelines python package (https://lifelines.readthedocs.io/en/latest/). RSF and DeepSurv were trained with scikit-learn 0.19.0 package (http://scikit-learn.org/stable/). For comparison of historical case outcomes, continuous variables were evaluated using Kruskal-Wallis test. Categorical variables were compared using Chi-squared and Fisher’s exact test. The rate of complications and readmission were analyzed by Chi-squared test. Data were analyzed using SPSS 21.0 software.

Results

Step 1 Urinary continence recovery prediction modeling and feature ranking

Of the 100 RARPs included in Step 1, continence was attained in 79/100 (79%) patients with median time to urinary continence 126 days (16–553 days). Among the three models, DeepSurv constructed on data set 3 (clinicopathological features and APMs) had the highest CI and lowest MAE, thus the highest prediction accuracy (Table 4). Incidentally, in the feature ranking by the DeepSurv model, only APMs (no clinicopathological features) ranked in the top 10 (Figure 2). Three of the top-ranked features were APMs measured during the vesico-urethral anastomosis and one was during the prostatic apical dissection.

Table 4.

Continence recovery prediction by three models

Data set 1 (Clinicopathological features only)

Models CI (Mean, SD) p-value MAE (Mean, SD)
CPH 0.583 (0.05) 0.048 137.58 (17.25)
RSF 0.579 (0.06) 0.094 104.56 (21.17)
DeepSurv 0.562 (0.03) 0.019 104.63 (28.28)
Data set 2 (APMs only)

Models CI (Mean, SD) p-value MAE (Mean, SD)

CPH * * *
RSF 0.551 (0.05) 0.154 98.39 (23.03)
DeepSurv 0.578 (0.07) 0.133 97.51 (27.94)
Data set 3 (APMs + clinicopathological features)

Models CI (Mean, SD) p-value MAE (Mean, SD)

CPH 0.544 * 134.73
RSF 0.580 (0.07) 0.127 101.22 (23.34)
DeepSurv 0.599 (0.06) 0.049 85.9 (24.53)

CPH: Cox proportional hazards; RSF: Random survival forests; DeepSurv: Deep learning models based survival analysis; CI: Concordance Index; MAE: Mean Absolute Error; SD: Standard deviation; APMs: Automated performance metrics

*

Convergence failed during the estimation of the co-efficients in the Cox proportional hazard model due to the use of Newton-Raphson algorithm. (Source: https://lifelines.readthedocs.io/en/latest/Examples.html#problems-with-convergence-in-the-cox-proportional-hazard-model (accessed 12/10/2018))

P value reflects statistical significance relative to chance performance

Step 2 Faculty surgeon stratification and historical cases comparison

In Step 2 of the study we included a total of 493 historical RARPs from the eight surgeons. Each surgeon had a median of 26.5 (6–177) cases available for analysis and outcomes comparison with median follow-up 18 months (4 to 24 months).

Surgeon grouping by APMs and prior RARP experience.

Four surgeons with the top summative scores in the DL model were assigned to “Group 1/APMs,” and the rest to “Group 2/APMs” (Table 2). The “Group 1/APMs” median weighted score was 0.084 (0.064–0.11) and the “Group 2/APMs” median score was 0.044 (0–0.054) (p=0.029).

Re-grouping of surgeons according to prior RARP experience revealed a re-ranked order of surgeons (Table 5). Notably, two surgeons moved from one group to the other. “Group 1/Experience” had a median 3000 (2100–3500) prior RARP cases. “Group 2/Experience” had a median 350 (100–500) cases (Table 5).

Table 5.

Surgeon grouping by APMs and by prior RARP experience

Grouping by APMs
Grouping by experience
Group 1 / APMs Group 2 / APMs Group 1 / Experience Group 2 / Experience
Rank Surgeon (APMs score) Rank Surgeon (APMs score) Rank Surgeon (Caseload) Rank Surgeon (Caseload)
1 E (0.107) 5 H (0.054) 1 E (3500) 5 D (500)
2 D (0.099) 6 F (0.051) 2 A (3000) 6 G (450)
3 G (0.069) 7 C (0.038) 3 C (3000) 7 B (250)
4 A (0.064) 8 B (0.000) 4 H (2100) 8 F (100)

APMs: Automated performance metrics; RARP: Robot-assisted radical prostatectomy

The eight surgeons were randomly labeled with an alphabetic character: A through H. These letters are not indictive of any ranking. Surgeons ranked in “Group 1 / APMs” are bold-faced to track their group assignment by experience.

Historical cases (testing cohort) comparison between “Group 1/APMs” and “Group 2/APMs”

Of the 493 historical patients: 303 had their surgery performed by surgeons in “Group 1/APMs”, and 190 patients had their surgery performed by surgeons in “Group 2/APMs”. Baseline characteristics were similar amongst historical patients of surgeons categorized by APMs (Table 6). There was no significant difference in final pathologic features (all p>0.05). There was no significant difference in length of stay (p=0.063), readmissions (p=0.736) or complications (p=0.664). “Group 1/APMs” cases had superior rates of urinary continence at 3 and 6 months postoperatively (47.5 vs 36.7%, p=0.034, and 68.3 vs 59.2%, p=0.047, respectively). Continence rates were similar for both groups at 12 months (75.1 vs 70.0%, p=0.298). No difference in PSA recurrence or erectile function was found (p>0.05).

Table 6.

Patient demographics, surgical data, surgical outcomes, functional and oncological outcomes comparison between surgeons grouped by APMs

Group1/APMs Group2/APMs
N=303 N=190
Patient demographics Median (IQR) Median (IQR) p value
Age 66 (60–71) 65 (60–70) 0.295
BMI 27.4 (24.9–30.8) 27.5 (25.2–31.3) 0.564
CCI 4 (4–5) 4 (4–5) 0.742
Pre-operative IPSS 5 (3–13) 5 (3–10) 0.237
Pre-operative ESI (%) 36.3 (110/303) 29.5 (56/190) 0.118
Prostate volume (g) 50 (40–59) 49 (39–65) 0.511
Pre-biopsy PSA 6.7 (4.9–10) 6.9 (4.9–10.4) 0.805
D'Amico risk classification (%) 0.340
 1 18.8 (56/298) 14.8 (28/190)
 2 46.0 (137/298) 52.6 (100/190)
 3 35.2 (105/298) 32.6 (62/190)
Pre-operative Gleason score 7 (7–8) 7 (7–8) 0.730
Pre-operative ADT (%) 6.4 (19/299) 4.7 (9/190) 0.347
Surgery data Median (IQR)/ % (n/N) Median (IQR)/ % (n/N)
Surgery time 238 (209–272) 240 (204–274) 0.924
EBL 100 (75–150) 100 (100–150) 0.901
Intra-operative water tight leak (%) 0.4 (1/284) 0 0.419
Intra-operative complication (%) 1 (3/299) 0 0.168
Bladder neck reconstruction (%) 12.9 (39/301) 10.0 (19/190) 0.199
Nerve sparing (%) 82.2 (244/297) 79.7 (149/187) 0.497
Pelvic lymph node dissection (%) 0.537
  Standard 58.1 (169/291) 55.2 (101/183)
  Extended 41.9 (122/291) 44.8 (82/183)
Rocco stitch (%) 87 (262/301) 84.2 (160/190) 0.453
Pathologic data Median (IQR)/ % (n/N) Median (IQR)/ % (n/N)
Final pathology Gleason score 7 (7–7) 7 (7–7) 0.827
Extra prostatic extension (%) 49.2 (147/299) 53.4 (101/189) 0.357
Positive surgical margin (%) 18.1 (54/298) 20.6 (39/189) 0.492
Lymph node yield 16 (10–25) 15 (11–23) 0.748
Positive lymph node density (%) 14.5 (43/296) 10.8 (20/185) 0.150
Surgical outcomes Median (IQR)/ % (n/N) Median (IQR)/ % (n/N)
Length of hospital stay (day) 1 (1–2) 1 (1–2) 0.063
Readmission within 90 days (%) 6.8 (20/296) 7.6 (14/185) 0.736
Foley catheter duration (day) 8 (7–9) 8 (7–10) 0.250
Pelvic drainage tube duration (day) 2 (1–7) 2 (1–7) 0.554
Post-operative complications (%) 0.664
  Clavien-Dindo (I, II) 7.8 (23/296) 8.1 (15/186)
  Clavien-Dindo (III-V) 2.4 (7/296) 3.8 (7/186)
Post-operative ADT (%) 21.0 (48/229) 24.3 (36/148) 0.434
Post-operative XRT (%) 20.1 (46/229) 23.3 (34/146) 0.461
Functional and oncological outcomes
Continence % (n/N) % (n/N)
3 months (%) 47.5 (116/244) 36.7 (55/150) 0.034
6 months (%) 68.3 (157/230) 59.2 (84/142) 0.047
12 months (%) 75.1 (163/217) 70.0 (91/130) 0.298
18 months (%) 77.4 (151/195) 75.0 (90/120) 0.414
24 months (%) 85.5 (106/124) 81.0 (68/84) 0.386
ESI (adjusted for pre-operative ESI) % (n/N) % (n/N)
3 months (%) 12.4 (12/97) 10.9 (5/46) 0.7966
6 months (%) 22.3 (21/94) 18.2 (8/44) 0.576
12 months (%) 27.0 (24/89) 38.1 (16/42) 0.197
18 months (%) 36.9 (31/84) 42.5 (17/40) 0.550
24 months (%) 51.8 (29/56) 60.7 (17/28) 0.438
Biochemical recurrence % (n/N) % (n/N)
3 months (%) 5.2 (21/231) 9.7 (14/144) 0.838
6 months (%) 12.8 (27/210) 10.9 (15/137) 0.408
12 months (%) 15.1 (31/205) 11.1 (14/126) 0.301
18 months (%) 17.9 (31/173) 16.7 (19/114) 0.784
24 months (%) 19.4 (22/113) 19.2 (14/73) 0.961

APMs: Automated performance metrics; IQR: Interquartile range; BMI: Body mass index; CCI: Charlson comorbidity index; IPSS: International Prostate Symptom Score; ESI: Erection sufficient for intercourse; PSA: Prostate specific antigen; ADT: Androgen deprivation therapy; EBL: Estimated blood loss; LND: Lymph node dissection; XRT: External beam radiation

Historical cases comparison between “Group 1/Experience” and “Group 2/Experience”

Of the 493 patients, 429 had surgeries performed by surgeons in “Group 1/Experience” and 64 by surgeons in “Group 2/Experience”. Baseline differences between patients in “Group 1/Experience” and “Group 2/Experience” included preoperative IPSS (5 vs 4, p=0.004) and prostate volume (50 vs 44 ml, p=0.011) (Table 7). “Group 1/Experience” had a higher proportion of nerve sparing procedures (83.4 vs 66.7%, p=0.002) and extended pelvic lymph node dissections (46.5 vs 19.7%, p<0.001) as reported by the surgeon. Post-operatively, patients in “Group 1/Experience” had a slightly shorter but significant pelvic drainage tube duration (2 (IQR 1–7) vs 2 (IQR 1–9) days, p=0.049), and lower rate of low-grade (Clavien grade I-II) postoperative complications (6.7 vs 15.9%, p=0.038). There was no significant difference in continence, erectile function, or biochemical recurrence at three-, six-, 12-, 18-, or 24-months post-operatively (p>0.05).

Table 7.

Patient demographics, surgical data, surgical outcomes, functional and oncological outcomes comparison between surgeons grouped by surgical experience

Group1/Experience Group2/Experience
N=429 N=64
Patient demographics Median (IQR) Median (IQR) p value
Age 66 (60–71) 66 (60–70) 0.604
BMI 27.4 (25.1–30.9) 27.5 (24.9–30.9) 0.984
CCI 4 (4–5) 4 (4–5) 0.187
Pre-operative IPSS 5 (3–12) 4 (2–6) 0.004
Pre-operative ESI (%) 34.7 (149/429) 26.6 (17/64) 0.197
Prostate volume (g) 50 (40–63) 44 (35–58) 0.011
Pre-biopsy PSA 6.7 (4.8–10) 7.9 (5.4–12.1) 0.064
D’Amico risk classification (%) 0.569
 1 18.2 (77/425) 11.1 (7/63)
 2 48.2 (205/425) 50.8 (32/63)
 3 33.6 (143/425) 38.1 (24/63)
Pre-operative Gleason score 7 (7–8) 7 (7–8) 0.325
Pre-operative ADT (%) 5.9 (25/425) 4.9 (3/64) 0.860
Surgical data Median (IQR)/ % (n/N) Median (IQR)/ % (n/N)
Surgery time 238 (206–272) 247 (214–283) 0.168
EBL 100 (75–150) 100 (95–150) 0.311
Intra-operative water tight leak (%) 0.2 (1/413) 0 0.712
Intra-operative complication (%) 1.7 (7/424) 1.6 (1/63) 0.970
Bladder neck reconstruction (%) 10.8 (46/427) 18.8 (12/64) 0.065
Nerve sparing (%) 83.4 (351/421) 66.7 (42/63) 0.002
Pelvic lymph node dissection (%) <0.001
  Standard 53.5 (221/413) 80.3 (49/61)
  Extended 46.5 (192/413) 19.7 (12/61)
Rocco stitch (%) 86.9 (371/427) 79.7 (51/64) 0.259
Pathologic data Median (IQR)/ % (n/N) Median (IQR)/ % (n/N)
Final pathology Gleason score 7 (7–7) 7 (7–7) 0.888
Extra prostatic extension (%) 50.9 (216/424) 50 (32/64) 0.135
Positive surgical margin (%) 18.4 (78/423) 23.4 (15/64) 0.343
Lymph node yield 16 (11–24) 16 (9–23) 0.267
lymph node positive (%) 12.6 (53/419) 16.1 (10/62) 0.448
Surgical outcomes Median (IQR)/ % (n/N) Median (IQR)/ % (n/N)
Length of hospital stay (day) 1 (1–2) 2 (1–2) 0.117
Readmission within 90 days (%) 6.4 (27/418) 11.1 (7/63) 0.179
Foley catheter duration 8 (7–9) 8 (7–11) 0.199
Pelvic drainage tube duration (day) 2 (1–7) 2 (1–9) 0.049
Post-operative complications (%) 0.036
  Clavien-Dindo (I, II) 6.7 (28/419) 15.9 (10/63)
  Clavien-Dindo (III-V) 3.1 (13/419) 1.6 (1/63)
Post-operative ADT (%) 21.5 (69/321) 26.8 (15/56) 0.380
Post-operative XRT (%) 20.7 (66/319) 25.0 (14/56) 0.468
Functional and oncological outcomes
Continence % (n/N) % (n/N)
3 months (%) 42.5 (143/336) 48.3 (28/58) 0.417
6 months (%) 64.2 (203/316) 67.9 (38/56) 0.602
12 months (%) 72.4 (213/294) 77.4 (41/53) 0.458
18 months (%) 76.4 (204/267) 77.1 (37/48) 0.912
24 months (%) 83.1 (148/178) 86.7 (26/30) 0.630
ESI (adjusted for pre-operative ESI) % (n/N) % (n/N)
3 months (%) 11.7 (15/128) 13.3 (2/15) 0.855
6 months (%) 21.1 (26/123) 20 (3/15) 0.919
12 months (%) 29.3 (34/116) 40 (6/15) 0.398
18 months (%) 38.2 (42/110) 42.8 (6/14) 0.735
24 months (%) 55.4 (39/71) 50.0 (5/10) 0.747
Biochemical recurrence % (n/N) % (n/N)
3 months (%) 9.7 (31/320) 7.3 (4/55) 0.684
6 months (%) 12.7 (37/292) 9.1 (5/55) 0.520
12 months (%) 14.0 (39/278) 11.3 (6/53) 0.598
18 months (%) 16.7 (40/239) 20.8 (10/48) 0.495
24 months (%) 18.4 (29/158) 25.0 (7/28) 0.412

APMs: Automated performance metrics; IQR: Interquartile range; BMI: Body mass index; CCI: Charlson comorbidity index; IPSS: International Prostate Symptom Score; ESI: Erection sufficient for intercourse; PSA: Prostate specific antigen; ADT: Androgen deprivation therapy; EBL: Estimated blood loss; LND: Lymph node dissection; XRT: External beam radiation

Analysis for outliers

An additional analysis was performed to examine the homogeneity of clinical outcomes within each group (“Group 1/APMs”, “Group 2/APMs”, “Group 1/Experience”, “Group 2/Experience”). No differences between patients of each surgeon were noted for urinary continence and biochemical recurrence within each group. However, significant differences were noted for ESI outcomes within “Group 1/APMs” and “Group1/Experience”; one surgeon in “Group 1/APMs” had inferior outcomes compared to his peers (p<0.014) and one surgeon in “Group 1/Experience” had significantly superior outcomes compared to the peers (p<0.005).

Discussion

The present study underscores the potential practical application of APMs in conjunction with DL to predict key functional outcomes (urinary continence) after RARP. This approach allows for efficient processing of a complex dataset while minimizing human bias, with implications in how robotic surgeons may be assessed and trained in the future.

The combination of APMs and clinicopathological data appeared to be a better dataset than clinicopathological data or APMs alone in predicting postoperative continence. This highlights the potential effect of surgical skills on postoperative continence recovery, and the value of APMs for technical skills assessment. It also demonstrates the advantage of DL algorithms in processing large datasets. With more parameters, DL can provide more accurate predictions.

In contrast to manual assessment by content experts, key advantages of APMs include true objectivity, with minimal human processing, automaticity in data capture, and the sustainability of assessing large volumes of surgical procedures. Potential disadvantages include its current limitation to largely efficiency-based metrics [7]. In their present form APMs may merely be an indirect measure of surgical skills that may drive clinical outcomes. As a result, current APMs may not easily direct teaching (for example, a training surgeon is unlikely to find low instrument velocity as helpful feedback). Future APMs may better capture discrete surgical skills.

The fact that the ML (RSF) and DL (DeepSurv) approaches fared superior to the conventional Cox regression model in our dataset is consistent with much of the literature comparing conventional and self-learning artificial intelligence approaches [2124]. ML and DL models have the ability to process large datasets with numerous variables simultaneously [12]. As a subset of ML, DL has the added ability to self-modulate to optimize predictive performance [25]. Furthermore, traditional ML models require a data scientist to manually identify features and hand-code them into the appropriate domains and data types, thus potentially introducing bias; in contrast, DL algorithms attempt to learn from the data directly and organize features automatically [25]. Naturally, results from ML models may be more easily interpreted, but it is harder to interpret the output of DL algorithms [25].

The feature rank derived from the combined clinicopathological and APM dataset with the DeepSurv model interestingly shows APMs at the top, suggesting that surgical technique in this instance may be more contributory than patient factors to determine postoperative urinary continence. Furthermore, many of these APMs incidentally involve wrist articulation during two specific steps of the RARP logically impact continence – prostatic apical dissection and vesico-urethral anastomosis. Although many prior studies have investigated patient and procedural factors that could influence urinary continence after RARP, most of these studies did not control for surgical skill in determining outcomes of the surgery [1]. We objectively demonstrated that surgical skills are associated with patients’ urinary continence recovery.

Surgeon classification by APMs showed differences in key historical clinical outcomes, such as 3-and 6-month urinary continence recovery. This is consistent with our prior work that showed APMs highly ranked by machine learning algorithms could distinguish surgeons with superior perioperative outcomes [26]. However, in that prior work, we did not address the differential weight each APM may contribute to predicting a clinical outcome. In this study we intentionally utilized a weighed score to rank surgeons. Furthermore, given that the DL model highly ranked APMs that specifically measure aspects of the apical dissection and vesicourethral anastomosis as most predictive of urinary continence, it seems confirmatory that their utilization to group surgeons in performance leads to observed differences in early continence outcomes in our historical case analysis. Although there was no significant difference in continence at 12-months, the earlier return to continence by surgeons with superior APMs, allows for a clinically important faster improvement of quality of life, return to work and adjuvant therapy if needed.

Interestingly, surgeon grouping by APMs versus experience did not agree. The ranking sequence was rearranged with two lesser-experienced surgeons (D, G) showing more efficient APMs than two more experienced surgeons (C, H). Surgical experience has been utilized extensively as the gold standard measure of surgical expertise – although imperfect, most validation studies for skills assessment tools have consistently used prior caseload as the benchmark [1, 4]. Re-grouping our surgeons by experience instead of APMs did not show differences in continence rates as seen by APMs grouping.

Our study has a few limitations. Although the combination of APMs and ML/DL are meant to maximize objectivity, it requires limited human handling that may introduce bias. Most of the human handling is in the data collection, organization, and pre-processing rather than the analyses themselves. Such handling is minimized to the extent possible, and it certainly represents progress in eliminating human bias in surgical assessment. While the DL approach utilizing APMs and clinicopathological data yielded the greatest prediction performance, the CI was 0.599. Certainly, there is room for improvement. There may remain other factors predictive of outcomes that are not captured in our model, but we expect improvement with the accumulation of more cases, refinement of DL modeling processes, enrichment of APMs and other parameters. We were also limited by sample size and an unbalanced number of cases performed by each surgeon. Still, we were able to demonstrate the feasibility of APMs and DL in the evaluation of robotic surgical skill and patient outcome prediction. Finally, this is a single-institution study, reflecting the surgical expertise and practice of a group of surgeons that may inherently share some common surgical style and postoperative management patterns. External validation with an outside dataset is necessary; a pending multi-institutional study paralleling that of the present study is in progress.

Our study adds to the growing chorus of evidence that surgical skills impact clinical outcomes. It may be prudent for future robotic surgeon credentialing to include a component of truly objective surgical skill evaluation, such as APMs. As our study suggests, merely relying on case experience as a surrogate for surgical expertise may not be sufficient. The near-term focus of our group will include the analyses of APMs and clinicopathological features predictive of biochemical failure and erectile function recovery after RARP. As a series, these studies would comprehensively evaluate the “trifecta” outcomes after RARP.

Acknowledgements

Devin Stuart1, Daphne Remulla1, Tiffany Chu1, Ryan Lee1, Kartik Aron1 data collection. Jie Cai1 statistical analyses. Anthony Jarc2, Liheng Guo2 processing of automated performance metrics.

1. Center for Robotic Simulation & Education, USC Institute of Urology, Keck School of Medicine, University of Southern California, Los Angeles, United States.

2. Intuitive Surgical Inc. Clinical Research, Norcross, United States.

Disclosure: Andrew J. Hung has a financial disclosure with Ethicon Inc. (consultant). Research reported in this publication was supported in part by the National Institute Of Biomedical Imaging And Bioengineering of the National Institutes of Health under Award Number K23EB026493 and an Intuitive Surgical Clinical Research Grant.

Contributor Information

Jian Chen, Email: jian.chen@med.usc.edu.

Saum Ghodoussipour, Email: saum.ghodoussipour@med.usc.edu.

Paul J. Oh, Email: paul.oh@med.usc.edu.

Zequn Liu, Email: tslzq1997@pku.edu.cn.

Jessica Nguyen, Email: jessica.nguyen@med.usc.edu.

Sanjay Purushotham, Email: sanjayp2005@gmail.com.

Inderbir S. Gill, Email: igill@med.usc.edu.

Yan Liu, Email: yanliu.cs@usc.edu.

References

  • 1.Ficarra V, Novara G, Rosen RC, et al. Systematic review and meta-analysis of studies reporting urinary continence recovery after robot-assisted radical prostatectomy. Eur Urol 2012;62:405–17. [DOI] [PubMed] [Google Scholar]
  • 2.Goldenberg MG, Goldenberg L, Grantcharov TP. Surgeon performance predicts early continence after robot-assisted radical Prostatectomy. J Endourol 2017;31:858–63. [DOI] [PubMed] [Google Scholar]
  • 3.Birkmeyer JD, Finks JF, O’Reilly A, et al. Surgical skill and complication rates after bariatric surgery. N Engl J Med 2013;369:1434–42. [DOI] [PubMed] [Google Scholar]
  • 4.Chen J, Cheng N, Cacciamani G, et al. Objective assessment of robotic surgical technical skill: A systemic review. J Urol 2018. July 24 pii: S0022–5347(18)43589–6. [DOI] [PubMed] [Google Scholar]
  • 5.Lendvay TS, White L, Kowalewski T. Crowdsourcing to assess surgical skill. JAMA Surg 2015;150:1086–7. [DOI] [PubMed] [Google Scholar]
  • 6.Ghani KR, Miller DC, Linsell S, et al. Measuring to improve: peer and crowd-sourced assessments of technical skill with robot-assisted radical prostatectomy. Eur Urol 2016;69:547–50. [DOI] [PubMed] [Google Scholar]
  • 7.Hung AJ, Chen J, Jarc A, Hatcher D, Djaladat H, Gill IS. Development and validation of objective performance metrics for robot-assisted radical prostatectomy: A pilot study. J Urol 2018;199:296–304. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Chen J, Oh PJ, Cheng N, et al. Use of automated performance metrics to measure surgeon performance during robotic vesicourethral anastomosis and methodical development of a training tutorial. J Urol 2018;200:895–902. [DOI] [PubMed] [Google Scholar]
  • 9.Judkins TN, Oleynikov D, Stergiou N. Objective evaluation of expert performance during human robotic surgical procedures. J Robot Surg 2008;1:307–12. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Judkins TN, Oleynikov D, Stergiou N. Objective evaluation of expert and novice performance during robotic surgical training tasks. Surg Endosc 2009;23:590–7. [DOI] [PubMed] [Google Scholar]
  • 11.Hung AJ, Chen J, Che Z. et al. Utilizing machine learning and automated performance metrics to evaluate robot-assisted radical prostatectomy performance and predict outcomes. J Endourol 2018;32:438–44. [DOI] [PubMed] [Google Scholar]
  • 12.Hung AJ, Chen J, Gill IS. Automated performance metrics and machine learning algorithms to measure surgeon performance and anticipate clinical outcomes in robotic surgery. JAMA Surg 2018;153:770–1. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Hung AJ, Bottyan T, Clifford TG, et al. Structured learning for robotic surgery utilizing a proficiency score: a pilot study. World J Urol 2017;35:27–34. [DOI] [PubMed] [Google Scholar]
  • 14.Wei JT, Dunn RL, Litwin MS, Sandler HM, Sanda MG. Development and validation of the expanded prostate cancer index composite (EPIC) for comprehensive assessment of health-related quality of life in men with prostate cancer. Urology 2000;56:899–905. [DOI] [PubMed] [Google Scholar]
  • 15.Lin DY, Wei LJ. The robust inference for the Cox proportional hazards model. Journal of the American statistical Association 1989;84:1074–8. [Google Scholar]
  • 16.Ishwaran H, Kogalur UB, Blackstone EH, Lauer MS. Random survival forests. The annals of applied statistics 2008;2:841–60. [Google Scholar]
  • 17.Katzman JL, Shaham U, Cloninger A, Bates J, Jiang T, Kluger Y. Deep survival: A deep cox proportional hazards network. Stat 2016;1050:2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Dindo D, Demartines N, Clavien P. Classification of surgical complications. Ann Surg 2004;240:205–13. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Heidenreich A, Aus G, Bolla M, et al. EAU guidelines on prostate cancer [in Spanish]. Actas Urol Esp 2009;33:113–26. [DOI] [PubMed] [Google Scholar]
  • 20.Cappelleri JC, Rosen RC. The Sexual Health Inventory for Men (SHIM): a 5-year review of research and clinical experience. Int J Impot Res 2005;17:307–19. [DOI] [PubMed] [Google Scholar]
  • 21.Catto JW, Abbod MF, Wild PJ, et al. The application of artificial intelligence to microarray data: identification of a novel gene signature to identify bladder cancer progression. Eur Urol 2010;57:398–406. [DOI] [PubMed] [Google Scholar]
  • 22.Catto JW, Abbod MF, Linkens DA, et al. Neuro-fuzzy modeling: an accurate and interpretable method for predicting bladder cancer progression. J Urol 2006;175:474–9. [DOI] [PubMed] [Google Scholar]
  • 23.Bassi P, Sacco E, De Marco V, et al. Prognostic accuracy of an artificial neural network in patients undergoing radical cystectomy for bladder cancer: a comparison with logistic regression analysis. BJU Int 2007;99:1007–12. [DOI] [PubMed] [Google Scholar]
  • 24.Poulakis V, Witzsch U, de Vries R, et al. Preoperative neural network using combined magnetic resonance imaging variables, prostate-specific antigen, and Gleason score for predicting prostate cancer biochemical recurrence after radical prostatectomy. Urology 2004;64:1165–70. [DOI] [PubMed] [Google Scholar]
  • 25.Beam AL, Kohane IS. Big data and machine learning in health care. JAMA 2018;319:1317–8. [DOI] [PubMed] [Google Scholar]
  • 26.Hung AJ, Chen J, Oh PJ, et al. PD38–03 Automated performance metrics during robotic-assisted radical prostatectomy can differentiate clinical outcome. J Urol 2018; 199(4s), pp. e736–e737 [Google Scholar]

RESOURCES