Abstract
Objectives
To predict urinary continence recovery after robot-assisted radical prostatectomy (RARP) utilizing a deep learning (DL) model, which was then used to evaluate surgeon’s historical patient outcomes.
Subjects and Methods
Robotic surgical automated performance metrics (APMs) during RARPs, patient clinicopathological and continence data were captured prospectively from 100 contemporary RARPs. We trained a DL model (DeepSurv) to predict post-operative urinary continence. Model features were ranked based on their importance in prediction. We stratified eight surgeons based on the five top-ranked features. The top four surgeons were categorized in “Group 1/APMs” versus “Group 2/APMs”. A separate historical cohort of RARPs (January 2015 to August 2016) performed by these two surgeon groups were then compared. Concordance Index (CI) and Mean Absolute Error (MAE) were used to measure the model’s prediction performance. Outcomes of historical cases were compared using the Kruskal-Wallis test, Chi-squared and Fisher’s exact test.
Results
Continence was attained in 79 patients (79%) after a median of 126 days. The DL model achieved CI of 0.6, MAE of 85.9 in predicting continence. APMs were ranked higher by the model than clinicopathological features. In the historical cohort, “Group 1/APM” patients had superior rates of urinary continence at three and six months postoperatively (47.5 vs 36.7%, p=0.034, and 68.3 vs 59.2%, p=0.047, respectively).
Conclusion
Utilizing APMs and clinicopathological data the DeepSurv model was able to predict continence after RARP. In this feasibility study, surgeons with more efficient APMs had higher continence rates at three and six months after RARP.
Keywords: robotic surgical procedures, prostatectomy, artificial intelligence, urinary incontinence, quality of life
Introduction
Urinary continence is one of the key functional outcomes after radical prostatectomy [1]. Factors affecting postoperative urinary continence have been extensively investigated, including clinicopathological characteristics and nerve-sparing and reconstruction techniques [1]. Although recognized as influential, the impact of surgeon experience and skill on continence has not been well-established [1]. Goldenberg et al. found that during select steps of robot-assisted radical prostatectomy (RARP), surgical skills impact continence recovery [2]. Elsewhere in surgery, mounting evidence suggests that surgical technique impacts clinical outcomes [3].
To date, the gold standard method of weighing surgeon performance has been through determining prior surgical experience (caseload) or manual surgical evaluation by peer surgeons [4]. However, peer evaluation, typically utilizing validated assessment tools with pre-set objective criteria, is limited by subjectivity and high inter-observer variability [5, 6].
A fledgling alternative to manual assessment is automated performance metrics (APMs). These instrument motion tracking and events metrics are derived from computer-based data recording devices. Preliminary studies have shown that APMs can differentiate expert and novice surgeon performance in both lab and clinical settings [7–10]. Our group has demonstrated that APMs differentiate surgeon experience during steps of the RARP [7, 8]. Furthermore, we have demonstrated that APMs, with the aid of machine learning (ML) algorithms, can predict perioperative short-term clinical outcomes after RARP [11, 12].
In this study, we applied conventional regression analysis, ML algorithms, and deep learning (DL) techniques to APMs collected during RARPs and patient clinicopathological features to predict postoperative urinary continence. Based on high importance features selected through DL algorithms, we categorized and ranked our institution’s faculty surgeons by both prior RARP experience and efficiency of APMs. We then sought to determine if experience and APMs correlated with clinical outcomes congruently or independently of each other.
Subjects and Methods
Step 1 Urinary continence recovery prediction modeling and feature ranking
Following an institutional review board (IRB) approved protocol, we prospectively collected robot instrument motion tracking (i.e., moving time, distance traveled, wrist angulation) and system events (i.e., camera movement, energy use) data from eight surgeons during 161 RARPs (September 2016 to October 2017) utilizing a custom recording device attached to the da Vinci Surgical System (Intuitive Surgical). From this, we derived a set of 41 previously validated APMs (Table 1) [7, 8]. Each RARP was segmented into 12 steps based on our institution’s structured learning curriculum [13]. APMs were then reported for each step.
Table 1.
Automated performance metrics and patient clinicopathological features
| Automated performance metrics |
|---|
| Time related metrics |
| Time to complete the task |
| Moving time of the right instrument |
| Moving time of the left instrument |
| Moving time of third instrument |
| Moving time of the camera |
| Time of no instrument or camera movement |
| Time of the right instrument not moving during the task |
| Time of the left instrument not moving during the task |
| Time of the third instrument not moving during the task |
| Time of the camera not moving during the task |
| Instrument kinematic metrics |
| Path length of the right instrument |
| Moving velocity of the right instrument |
| Path length of the left instrument |
| Moving velocity of left instrument |
| Path length of the third instrument |
| Path length of all three instruments |
| Ratio of path length of right and left instruments |
| Camera movement metrics |
| Path length of the camera |
| Moving velocity of the camera |
| Number of camera adjustments during the task |
| Frequency of camera adjustment |
| Mean of time of each camera movement |
| Mean path length of each camera movement |
| Mean of straight path length of each camera movement |
| System event metrics |
| Master clutch usage during task |
| Third arm swap during task |
| Energy usage during task |
| Frequency of master clutch usage |
| Frequency of third arm swap |
| Frequency of energy application |
| Number of times surgeon’s head out of the console |
| EndoWrist® articulation metrics |
| The total radians of the right instrument shaft rotation during the task |
| The total radians of the right instrument wrist movement during the task |
| The total radians of the right instrument jaw opening during the task |
| The total radians of the left instrument shaft rotation during the task |
| The total radians of the left instrument wrist movement during the task |
| The total radians of the left instrument jaw opening during the task |
| Right instrument articulation during the task |
| Left instrument articulation during the task |
| Angular velocity of the right instrument articulation |
| Angular velocity of the left instrument articulation |
| Clinicopathological features |
| Age |
| BMI |
| Pre-operative PSA |
| Pre-operative biopsy Gleason score |
| ASA |
| Surgery time |
| Lymph node dissection template (standard vs extended) |
| Urethropexy |
| Nerve sparing |
| Prostatic median lobe |
| Final pathology Gleason score |
| Pathological stage |
| Extracapsular extension |
| Prostate volume |
| Positive margins |
| Radiation received |
BMI: Body mass index; PSA: Prostate specific antigen; ASA: American Society of Anesthesiologists
From the original cohort of 161 cases, we included 100 patients with complete and comprehensive follow up at our institution. We excluded patients who were followed-up at outside facilities. Continence was assessed at each follow up visit using the self-administered EPIC questionnaire [14]. Throughout this study, continence was defined as the use of no pads. Patients with artificial urinary sphincters were considered incontinent. We utilized three data sets to predict continence recovery: 1) a set of 16 clinicopathological features (Table 1); 2) 492 APMs (41 APMs during each of 12 RARP steps); 3) a combined set of 16 clincopathological features and 492 APMs (total 508 parameters).
We utilized three predicting models (Cox Proportional Hazards (CPH) [15], Random Survival Forests (RSF) [16], DL Models based Survival analysis (DeepSurv)) [17] to predict urinary continence after RARP. Preprocessing of the data included imputing missing values of any feature by its median value, and five-fold splitting for cross validation. We used our dataset of 100 cases to train, validate and test the model. The dataset was split into five folds, with 20 cases in each fold. We also stratified each fold based on patient continence status to make sure the continence rates in each fold were similar. All models were trained on three folds, validated on the fourth fold, and tested on the remaining held-out fifth fold. Concordance Index (CI) and Mean Absolute Error (MAE) measured prediction performance. The statistical significance of the CI relative to chance performance was evaluated utilizing hypothesis testing; p-value was obtained using one-sample t-test. Based on these parameters, we selected the model with the highest CI and lowest MAE. From that top-performing model, we feature-ranked the parameters (APMs and clinicopathological) based on importance in predicting urinary continence (Figure 1).
Figure 1.

Study design of Step 1 and Step 2
Step 2 Faculty surgeon grouping and historical cases comparison
Using the model with the highest CI in Step 1 (DeepSurv), features were ranked and assigned a weighted score based on their importance in predicting continence as defined above (Figure 2). We then used the top five ranked features to differentiate the eight faculty surgeons into two groups. For each of the five features, the four surgeons with more efficient APMs were provided with a score equal to the weight of that particular feature. For example, if a surgeon had an average “Time of the third instrument not moving during anterior VUA (#1 ranked feature)” value in the top half of the eight surgeons, he/she receive a score of 0.041. If a surgeon had the feature value in the lower half of surgeons, he/she would receive a score of zero. A final score was generated for each surgeon by summing the five scores across each feature (Table 2). The four surgeons with the highest scores as predicted by the DeepSurv model were categorized in “Group 1/APMs” (more efficient APMs) versus “Group 2/APMs” (less efficient APMs). Separately, surgeons were grouped according to their self-reported RARP caseload: “Group 1/Experience” (more experience) and “Group 2/Experience” (less experience).
Figure 2.
Top-10 feature ranking by DeepSurv utilizing Data set 3 (automated performance metrics and clincopathological features) with weighting
VUA: Vesicourethral anastomosis
Table 2.
Surgeon grouping based on top five ranked features (APMs) by deep learning algorithm
| Rank ╲ Features and weighted scores | 1 | 2 | 3 | 4 | 5 | |
|---|---|---|---|---|---|---|
| 0.041 | 0.029 | 0.022 | 0.016 | 0.014 | ||
| 1 | H | E | F | E | A | |
| 2 | E | G | A | F | F | |
| 3 | D | D | C | D | H | |
| 4 | G | A | E | C | D | |
| 5 | C | C | B | H | C | |
| 6 | B | B | G | B | E | |
| 7 | A | F | D | A | B | |
| 8 | F | H | H | G | G | |
APMs: Automated performance metrics
Final score calculation
A: 0.029+0.022+0.014=0.064
B: 0.000
C: 0.022+0.016=0.038
D: 0.041+0.029+0.016+0.014=0.099
E: 0.041+0.029+0.022+0.016=0.107
F: 0.022+0.016+0.014=0.051
G: 0.041+0.029=0.069
H: 0.041+0.014=0.054
The four surgeons with more efficient APMs were provided with a score equal to the weight of that particular feature. A final score was generated for each surgeon by summing the five scores across each feature. The four surgeons with the highest total were categorized in “Group 1/APMs” (E, D, G, A) versus “Group 2/APMs” (H, F, C, B).
Continence outcomes from a separate set of consecutive historical RARPs (January 2015 to August 2016) from these eight surgeons were then compared according to ranking by APMs (“Group 1/APMs” vs “Group 2/APMs”), and separately according to experience (“Group 1/Experience” vs “Group 2/Experience”). Of note, no APMs were available for analysis in this cohort of historical cases.
Data collection and analysis
Both prospective RARP (model training cohort) and historical patient data (model testing cohort) were captured based on a separate IRB-approved protocol. Clinical parameters included baseline characteristics, surgical sub-procedures performed (nerve-sparing, standard/extended lymph node dissection, urethropexy), intraoperative outcomes, perioperative outcomes, final pathologic results, and long-term oncologic and functional outcomes. Postoperative complications were graded by the Clavien-Dindo classification system [18]. Prostate-specific antigen (PSA) was checked at three, six, nine, and 12 months; and then every six months after. Biochemical recurrence was defined as two consecutive PSA levels of >0.2 ng/ml [19]. Potency was evaluated at each follow up visit using the Sexual Health Inventory for Men (SHIM) questionnaire [20]. Potency was defined as the ability to achieve and maintain satisfactory erections for sexual intercourse in >50% of attempts, with or without the use of Phosphodiesterase 5 inhibitors. If patients required vacuum erection device, penile injections, or transurethral alprostadil for intercourse, they were considered not potent. The number of cases by each surgeon available for analysis in Steps 1 and 2 is shown (Table 3).
Table 3.
Contribution of cases from surgeons to the DL algorithm training cohort and testing cohort
| Step 1 | Step 2 | |||
|---|---|---|---|---|
| DL algorithm training cohort (n=100) | DL algorithm testing cohort (n=493) | |||
| Surgeon | Cases | Surgeon | Cases | |
| A | 35 | A | 177 | |
| B | 5 | B | 18 | |
| C | 20 | C | 151 | |
| D | 10 | D | 25 | |
| E | 15 | E | 73 | |
| F | 3 | F | 15 | |
| G | 10 | G | 28 | |
| H | 2 | H | 6 | |
CPH was performed with Lifelines python package (https://lifelines.readthedocs.io/en/latest/). RSF and DeepSurv were trained with scikit-learn 0.19.0 package (http://scikit-learn.org/stable/). For comparison of historical case outcomes, continuous variables were evaluated using Kruskal-Wallis test. Categorical variables were compared using Chi-squared and Fisher’s exact test. The rate of complications and readmission were analyzed by Chi-squared test. Data were analyzed using SPSS 21.0 software.
Results
Step 1 Urinary continence recovery prediction modeling and feature ranking
Of the 100 RARPs included in Step 1, continence was attained in 79/100 (79%) patients with median time to urinary continence 126 days (16–553 days). Among the three models, DeepSurv constructed on data set 3 (clinicopathological features and APMs) had the highest CI and lowest MAE, thus the highest prediction accuracy (Table 4). Incidentally, in the feature ranking by the DeepSurv model, only APMs (no clinicopathological features) ranked in the top 10 (Figure 2). Three of the top-ranked features were APMs measured during the vesico-urethral anastomosis and one was during the prostatic apical dissection.
Table 4.
Continence recovery prediction by three models
| Data set 1 (Clinicopathological features only) | |||
|---|---|---|---|
| Models | CI (Mean, SD) | p-value | MAE (Mean, SD) |
| CPH | 0.583 (0.05) | 0.048 | 137.58 (17.25) |
| RSF | 0.579 (0.06) | 0.094 | 104.56 (21.17) |
| DeepSurv | 0.562 (0.03) | 0.019 | 104.63 (28.28) |
| Data set 2 (APMs only) | |||
| Models | CI (Mean, SD) | p-value | MAE (Mean, SD) |
| CPH | * | * | * |
| RSF | 0.551 (0.05) | 0.154 | 98.39 (23.03) |
| DeepSurv | 0.578 (0.07) | 0.133 | 97.51 (27.94) |
| Data set 3 (APMs + clinicopathological features) | |||
| Models | CI (Mean, SD) | p-value | MAE (Mean, SD) |
| CPH | 0.544 | * | 134.73 |
| RSF | 0.580 (0.07) | 0.127 | 101.22 (23.34) |
| DeepSurv | 0.599 (0.06) | 0.049 | 85.9 (24.53) |
CPH: Cox proportional hazards; RSF: Random survival forests; DeepSurv: Deep learning models based survival analysis; CI: Concordance Index; MAE: Mean Absolute Error; SD: Standard deviation; APMs: Automated performance metrics
Convergence failed during the estimation of the co-efficients in the Cox proportional hazard model due to the use of Newton-Raphson algorithm. (Source: https://lifelines.readthedocs.io/en/latest/Examples.html#problems-with-convergence-in-the-cox-proportional-hazard-model (accessed 12/10/2018))
P value reflects statistical significance relative to chance performance
Step 2 Faculty surgeon stratification and historical cases comparison
In Step 2 of the study we included a total of 493 historical RARPs from the eight surgeons. Each surgeon had a median of 26.5 (6–177) cases available for analysis and outcomes comparison with median follow-up 18 months (4 to 24 months).
Surgeon grouping by APMs and prior RARP experience.
Four surgeons with the top summative scores in the DL model were assigned to “Group 1/APMs,” and the rest to “Group 2/APMs” (Table 2). The “Group 1/APMs” median weighted score was 0.084 (0.064–0.11) and the “Group 2/APMs” median score was 0.044 (0–0.054) (p=0.029).
Re-grouping of surgeons according to prior RARP experience revealed a re-ranked order of surgeons (Table 5). Notably, two surgeons moved from one group to the other. “Group 1/Experience” had a median 3000 (2100–3500) prior RARP cases. “Group 2/Experience” had a median 350 (100–500) cases (Table 5).
Table 5.
Surgeon grouping by APMs and by prior RARP experience
| Grouping by APMs |
Grouping by experience |
|||||||
|---|---|---|---|---|---|---|---|---|
| Group 1 / APMs | Group 2 / APMs | Group 1 / Experience | Group 2 / Experience | |||||
| Rank | Surgeon (APMs score) | Rank | Surgeon (APMs score) | Rank | Surgeon (Caseload) | Rank | Surgeon (Caseload) | |
| 1 | E (0.107) | 5 | H (0.054) | 1 | E (3500) | 5 | D (500) | |
| 2 | D (0.099) | 6 | F (0.051) | 2 | A (3000) | 6 | G (450) | |
| 3 | G (0.069) | 7 | C (0.038) | 3 | C (3000) | 7 | B (250) | |
| 4 | A (0.064) | 8 | B (0.000) | 4 | H (2100) | 8 | F (100) | |
APMs: Automated performance metrics; RARP: Robot-assisted radical prostatectomy
The eight surgeons were randomly labeled with an alphabetic character: A through H. These letters are not indictive of any ranking. Surgeons ranked in “Group 1 / APMs” are bold-faced to track their group assignment by experience.
Historical cases (testing cohort) comparison between “Group 1/APMs” and “Group 2/APMs”
Of the 493 historical patients: 303 had their surgery performed by surgeons in “Group 1/APMs”, and 190 patients had their surgery performed by surgeons in “Group 2/APMs”. Baseline characteristics were similar amongst historical patients of surgeons categorized by APMs (Table 6). There was no significant difference in final pathologic features (all p>0.05). There was no significant difference in length of stay (p=0.063), readmissions (p=0.736) or complications (p=0.664). “Group 1/APMs” cases had superior rates of urinary continence at 3 and 6 months postoperatively (47.5 vs 36.7%, p=0.034, and 68.3 vs 59.2%, p=0.047, respectively). Continence rates were similar for both groups at 12 months (75.1 vs 70.0%, p=0.298). No difference in PSA recurrence or erectile function was found (p>0.05).
Table 6.
Patient demographics, surgical data, surgical outcomes, functional and oncological outcomes comparison between surgeons grouped by APMs
| Group1/APMs | Group2/APMs | ||
|---|---|---|---|
| N=303 | N=190 | ||
| Patient demographics | Median (IQR) | Median (IQR) | p value |
| Age | 66 (60–71) | 65 (60–70) | 0.295 |
| BMI | 27.4 (24.9–30.8) | 27.5 (25.2–31.3) | 0.564 |
| CCI | 4 (4–5) | 4 (4–5) | 0.742 |
| Pre-operative IPSS | 5 (3–13) | 5 (3–10) | 0.237 |
| Pre-operative ESI (%) | 36.3 (110/303) | 29.5 (56/190) | 0.118 |
| Prostate volume (g) | 50 (40–59) | 49 (39–65) | 0.511 |
| Pre-biopsy PSA | 6.7 (4.9–10) | 6.9 (4.9–10.4) | 0.805 |
| D'Amico risk classification (%) | 0.340 | ||
| 1 | 18.8 (56/298) | 14.8 (28/190) | |
| 2 | 46.0 (137/298) | 52.6 (100/190) | |
| 3 | 35.2 (105/298) | 32.6 (62/190) | |
| Pre-operative Gleason score | 7 (7–8) | 7 (7–8) | 0.730 |
| Pre-operative ADT (%) | 6.4 (19/299) | 4.7 (9/190) | 0.347 |
| Surgery data | Median (IQR)/ % (n/N) | Median (IQR)/ % (n/N) | |
| Surgery time | 238 (209–272) | 240 (204–274) | 0.924 |
| EBL | 100 (75–150) | 100 (100–150) | 0.901 |
| Intra-operative water tight leak (%) | 0.4 (1/284) | 0 | 0.419 |
| Intra-operative complication (%) | 1 (3/299) | 0 | 0.168 |
| Bladder neck reconstruction (%) | 12.9 (39/301) | 10.0 (19/190) | 0.199 |
| Nerve sparing (%) | 82.2 (244/297) | 79.7 (149/187) | 0.497 |
| Pelvic lymph node dissection (%) | 0.537 | ||
| Standard | 58.1 (169/291) | 55.2 (101/183) | |
| Extended | 41.9 (122/291) | 44.8 (82/183) | |
| Rocco stitch (%) | 87 (262/301) | 84.2 (160/190) | 0.453 |
| Pathologic data | Median (IQR)/ % (n/N) | Median (IQR)/ % (n/N) | |
| Final pathology Gleason score | 7 (7–7) | 7 (7–7) | 0.827 |
| Extra prostatic extension (%) | 49.2 (147/299) | 53.4 (101/189) | 0.357 |
| Positive surgical margin (%) | 18.1 (54/298) | 20.6 (39/189) | 0.492 |
| Lymph node yield | 16 (10–25) | 15 (11–23) | 0.748 |
| Positive lymph node density (%) | 14.5 (43/296) | 10.8 (20/185) | 0.150 |
| Surgical outcomes | Median (IQR)/ % (n/N) | Median (IQR)/ % (n/N) | |
| Length of hospital stay (day) | 1 (1–2) | 1 (1–2) | 0.063 |
| Readmission within 90 days (%) | 6.8 (20/296) | 7.6 (14/185) | 0.736 |
| Foley catheter duration (day) | 8 (7–9) | 8 (7–10) | 0.250 |
| Pelvic drainage tube duration (day) | 2 (1–7) | 2 (1–7) | 0.554 |
| Post-operative complications (%) | 0.664 | ||
| Clavien-Dindo (I, II) | 7.8 (23/296) | 8.1 (15/186) | |
| Clavien-Dindo (III-V) | 2.4 (7/296) | 3.8 (7/186) | |
| Post-operative ADT (%) | 21.0 (48/229) | 24.3 (36/148) | 0.434 |
| Post-operative XRT (%) | 20.1 (46/229) | 23.3 (34/146) | 0.461 |
| Functional and oncological outcomes | |||
| Continence | % (n/N) | % (n/N) | |
| 3 months (%) | 47.5 (116/244) | 36.7 (55/150) | 0.034 |
| 6 months (%) | 68.3 (157/230) | 59.2 (84/142) | 0.047 |
| 12 months (%) | 75.1 (163/217) | 70.0 (91/130) | 0.298 |
| 18 months (%) | 77.4 (151/195) | 75.0 (90/120) | 0.414 |
| 24 months (%) | 85.5 (106/124) | 81.0 (68/84) | 0.386 |
| ESI (adjusted for pre-operative ESI) | % (n/N) | % (n/N) | |
| 3 months (%) | 12.4 (12/97) | 10.9 (5/46) | 0.7966 |
| 6 months (%) | 22.3 (21/94) | 18.2 (8/44) | 0.576 |
| 12 months (%) | 27.0 (24/89) | 38.1 (16/42) | 0.197 |
| 18 months (%) | 36.9 (31/84) | 42.5 (17/40) | 0.550 |
| 24 months (%) | 51.8 (29/56) | 60.7 (17/28) | 0.438 |
| Biochemical recurrence | % (n/N) | % (n/N) | |
| 3 months (%) | 5.2 (21/231) | 9.7 (14/144) | 0.838 |
| 6 months (%) | 12.8 (27/210) | 10.9 (15/137) | 0.408 |
| 12 months (%) | 15.1 (31/205) | 11.1 (14/126) | 0.301 |
| 18 months (%) | 17.9 (31/173) | 16.7 (19/114) | 0.784 |
| 24 months (%) | 19.4 (22/113) | 19.2 (14/73) | 0.961 |
APMs: Automated performance metrics; IQR: Interquartile range; BMI: Body mass index; CCI: Charlson comorbidity index; IPSS: International Prostate Symptom Score; ESI: Erection sufficient for intercourse; PSA: Prostate specific antigen; ADT: Androgen deprivation therapy; EBL: Estimated blood loss; LND: Lymph node dissection; XRT: External beam radiation
Historical cases comparison between “Group 1/Experience” and “Group 2/Experience”
Of the 493 patients, 429 had surgeries performed by surgeons in “Group 1/Experience” and 64 by surgeons in “Group 2/Experience”. Baseline differences between patients in “Group 1/Experience” and “Group 2/Experience” included preoperative IPSS (5 vs 4, p=0.004) and prostate volume (50 vs 44 ml, p=0.011) (Table 7). “Group 1/Experience” had a higher proportion of nerve sparing procedures (83.4 vs 66.7%, p=0.002) and extended pelvic lymph node dissections (46.5 vs 19.7%, p<0.001) as reported by the surgeon. Post-operatively, patients in “Group 1/Experience” had a slightly shorter but significant pelvic drainage tube duration (2 (IQR 1–7) vs 2 (IQR 1–9) days, p=0.049), and lower rate of low-grade (Clavien grade I-II) postoperative complications (6.7 vs 15.9%, p=0.038). There was no significant difference in continence, erectile function, or biochemical recurrence at three-, six-, 12-, 18-, or 24-months post-operatively (p>0.05).
Table 7.
Patient demographics, surgical data, surgical outcomes, functional and oncological outcomes comparison between surgeons grouped by surgical experience
| Group1/Experience | Group2/Experience | ||
|---|---|---|---|
| N=429 | N=64 | ||
| Patient demographics | Median (IQR) | Median (IQR) | p value |
| Age | 66 (60–71) | 66 (60–70) | 0.604 |
| BMI | 27.4 (25.1–30.9) | 27.5 (24.9–30.9) | 0.984 |
| CCI | 4 (4–5) | 4 (4–5) | 0.187 |
| Pre-operative IPSS | 5 (3–12) | 4 (2–6) | 0.004 |
| Pre-operative ESI (%) | 34.7 (149/429) | 26.6 (17/64) | 0.197 |
| Prostate volume (g) | 50 (40–63) | 44 (35–58) | 0.011 |
| Pre-biopsy PSA | 6.7 (4.8–10) | 7.9 (5.4–12.1) | 0.064 |
| D’Amico risk classification (%) | 0.569 | ||
| 1 | 18.2 (77/425) | 11.1 (7/63) | |
| 2 | 48.2 (205/425) | 50.8 (32/63) | |
| 3 | 33.6 (143/425) | 38.1 (24/63) | |
| Pre-operative Gleason score | 7 (7–8) | 7 (7–8) | 0.325 |
| Pre-operative ADT (%) | 5.9 (25/425) | 4.9 (3/64) | 0.860 |
| Surgical data | Median (IQR)/ % (n/N) | Median (IQR)/ % (n/N) | |
| Surgery time | 238 (206–272) | 247 (214–283) | 0.168 |
| EBL | 100 (75–150) | 100 (95–150) | 0.311 |
| Intra-operative water tight leak (%) | 0.2 (1/413) | 0 | 0.712 |
| Intra-operative complication (%) | 1.7 (7/424) | 1.6 (1/63) | 0.970 |
| Bladder neck reconstruction (%) | 10.8 (46/427) | 18.8 (12/64) | 0.065 |
| Nerve sparing (%) | 83.4 (351/421) | 66.7 (42/63) | 0.002 |
| Pelvic lymph node dissection (%) | <0.001 | ||
| Standard | 53.5 (221/413) | 80.3 (49/61) | |
| Extended | 46.5 (192/413) | 19.7 (12/61) | |
| Rocco stitch (%) | 86.9 (371/427) | 79.7 (51/64) | 0.259 |
| Pathologic data | Median (IQR)/ % (n/N) | Median (IQR)/ % (n/N) | |
| Final pathology Gleason score | 7 (7–7) | 7 (7–7) | 0.888 |
| Extra prostatic extension (%) | 50.9 (216/424) | 50 (32/64) | 0.135 |
| Positive surgical margin (%) | 18.4 (78/423) | 23.4 (15/64) | 0.343 |
| Lymph node yield | 16 (11–24) | 16 (9–23) | 0.267 |
| lymph node positive (%) | 12.6 (53/419) | 16.1 (10/62) | 0.448 |
| Surgical outcomes | Median (IQR)/ % (n/N) | Median (IQR)/ % (n/N) | |
| Length of hospital stay (day) | 1 (1–2) | 2 (1–2) | 0.117 |
| Readmission within 90 days (%) | 6.4 (27/418) | 11.1 (7/63) | 0.179 |
| Foley catheter duration | 8 (7–9) | 8 (7–11) | 0.199 |
| Pelvic drainage tube duration (day) | 2 (1–7) | 2 (1–9) | 0.049 |
| Post-operative complications (%) | 0.036 | ||
| Clavien-Dindo (I, II) | 6.7 (28/419) | 15.9 (10/63) | |
| Clavien-Dindo (III-V) | 3.1 (13/419) | 1.6 (1/63) | |
| Post-operative ADT (%) | 21.5 (69/321) | 26.8 (15/56) | 0.380 |
| Post-operative XRT (%) | 20.7 (66/319) | 25.0 (14/56) | 0.468 |
| Functional and oncological outcomes | |||
| Continence | % (n/N) | % (n/N) | |
| 3 months (%) | 42.5 (143/336) | 48.3 (28/58) | 0.417 |
| 6 months (%) | 64.2 (203/316) | 67.9 (38/56) | 0.602 |
| 12 months (%) | 72.4 (213/294) | 77.4 (41/53) | 0.458 |
| 18 months (%) | 76.4 (204/267) | 77.1 (37/48) | 0.912 |
| 24 months (%) | 83.1 (148/178) | 86.7 (26/30) | 0.630 |
| ESI (adjusted for pre-operative ESI) | % (n/N) | % (n/N) | |
| 3 months (%) | 11.7 (15/128) | 13.3 (2/15) | 0.855 |
| 6 months (%) | 21.1 (26/123) | 20 (3/15) | 0.919 |
| 12 months (%) | 29.3 (34/116) | 40 (6/15) | 0.398 |
| 18 months (%) | 38.2 (42/110) | 42.8 (6/14) | 0.735 |
| 24 months (%) | 55.4 (39/71) | 50.0 (5/10) | 0.747 |
| Biochemical recurrence | % (n/N) | % (n/N) | |
| 3 months (%) | 9.7 (31/320) | 7.3 (4/55) | 0.684 |
| 6 months (%) | 12.7 (37/292) | 9.1 (5/55) | 0.520 |
| 12 months (%) | 14.0 (39/278) | 11.3 (6/53) | 0.598 |
| 18 months (%) | 16.7 (40/239) | 20.8 (10/48) | 0.495 |
| 24 months (%) | 18.4 (29/158) | 25.0 (7/28) | 0.412 |
APMs: Automated performance metrics; IQR: Interquartile range; BMI: Body mass index; CCI: Charlson comorbidity index; IPSS: International Prostate Symptom Score; ESI: Erection sufficient for intercourse; PSA: Prostate specific antigen; ADT: Androgen deprivation therapy; EBL: Estimated blood loss; LND: Lymph node dissection; XRT: External beam radiation
Analysis for outliers
An additional analysis was performed to examine the homogeneity of clinical outcomes within each group (“Group 1/APMs”, “Group 2/APMs”, “Group 1/Experience”, “Group 2/Experience”). No differences between patients of each surgeon were noted for urinary continence and biochemical recurrence within each group. However, significant differences were noted for ESI outcomes within “Group 1/APMs” and “Group1/Experience”; one surgeon in “Group 1/APMs” had inferior outcomes compared to his peers (p<0.014) and one surgeon in “Group 1/Experience” had significantly superior outcomes compared to the peers (p<0.005).
Discussion
The present study underscores the potential practical application of APMs in conjunction with DL to predict key functional outcomes (urinary continence) after RARP. This approach allows for efficient processing of a complex dataset while minimizing human bias, with implications in how robotic surgeons may be assessed and trained in the future.
The combination of APMs and clinicopathological data appeared to be a better dataset than clinicopathological data or APMs alone in predicting postoperative continence. This highlights the potential effect of surgical skills on postoperative continence recovery, and the value of APMs for technical skills assessment. It also demonstrates the advantage of DL algorithms in processing large datasets. With more parameters, DL can provide more accurate predictions.
In contrast to manual assessment by content experts, key advantages of APMs include true objectivity, with minimal human processing, automaticity in data capture, and the sustainability of assessing large volumes of surgical procedures. Potential disadvantages include its current limitation to largely efficiency-based metrics [7]. In their present form APMs may merely be an indirect measure of surgical skills that may drive clinical outcomes. As a result, current APMs may not easily direct teaching (for example, a training surgeon is unlikely to find low instrument velocity as helpful feedback). Future APMs may better capture discrete surgical skills.
The fact that the ML (RSF) and DL (DeepSurv) approaches fared superior to the conventional Cox regression model in our dataset is consistent with much of the literature comparing conventional and self-learning artificial intelligence approaches [21–24]. ML and DL models have the ability to process large datasets with numerous variables simultaneously [12]. As a subset of ML, DL has the added ability to self-modulate to optimize predictive performance [25]. Furthermore, traditional ML models require a data scientist to manually identify features and hand-code them into the appropriate domains and data types, thus potentially introducing bias; in contrast, DL algorithms attempt to learn from the data directly and organize features automatically [25]. Naturally, results from ML models may be more easily interpreted, but it is harder to interpret the output of DL algorithms [25].
The feature rank derived from the combined clinicopathological and APM dataset with the DeepSurv model interestingly shows APMs at the top, suggesting that surgical technique in this instance may be more contributory than patient factors to determine postoperative urinary continence. Furthermore, many of these APMs incidentally involve wrist articulation during two specific steps of the RARP logically impact continence – prostatic apical dissection and vesico-urethral anastomosis. Although many prior studies have investigated patient and procedural factors that could influence urinary continence after RARP, most of these studies did not control for surgical skill in determining outcomes of the surgery [1]. We objectively demonstrated that surgical skills are associated with patients’ urinary continence recovery.
Surgeon classification by APMs showed differences in key historical clinical outcomes, such as 3-and 6-month urinary continence recovery. This is consistent with our prior work that showed APMs highly ranked by machine learning algorithms could distinguish surgeons with superior perioperative outcomes [26]. However, in that prior work, we did not address the differential weight each APM may contribute to predicting a clinical outcome. In this study we intentionally utilized a weighed score to rank surgeons. Furthermore, given that the DL model highly ranked APMs that specifically measure aspects of the apical dissection and vesicourethral anastomosis as most predictive of urinary continence, it seems confirmatory that their utilization to group surgeons in performance leads to observed differences in early continence outcomes in our historical case analysis. Although there was no significant difference in continence at 12-months, the earlier return to continence by surgeons with superior APMs, allows for a clinically important faster improvement of quality of life, return to work and adjuvant therapy if needed.
Interestingly, surgeon grouping by APMs versus experience did not agree. The ranking sequence was rearranged with two lesser-experienced surgeons (D, G) showing more efficient APMs than two more experienced surgeons (C, H). Surgical experience has been utilized extensively as the gold standard measure of surgical expertise – although imperfect, most validation studies for skills assessment tools have consistently used prior caseload as the benchmark [1, 4]. Re-grouping our surgeons by experience instead of APMs did not show differences in continence rates as seen by APMs grouping.
Our study has a few limitations. Although the combination of APMs and ML/DL are meant to maximize objectivity, it requires limited human handling that may introduce bias. Most of the human handling is in the data collection, organization, and pre-processing rather than the analyses themselves. Such handling is minimized to the extent possible, and it certainly represents progress in eliminating human bias in surgical assessment. While the DL approach utilizing APMs and clinicopathological data yielded the greatest prediction performance, the CI was 0.599. Certainly, there is room for improvement. There may remain other factors predictive of outcomes that are not captured in our model, but we expect improvement with the accumulation of more cases, refinement of DL modeling processes, enrichment of APMs and other parameters. We were also limited by sample size and an unbalanced number of cases performed by each surgeon. Still, we were able to demonstrate the feasibility of APMs and DL in the evaluation of robotic surgical skill and patient outcome prediction. Finally, this is a single-institution study, reflecting the surgical expertise and practice of a group of surgeons that may inherently share some common surgical style and postoperative management patterns. External validation with an outside dataset is necessary; a pending multi-institutional study paralleling that of the present study is in progress.
Our study adds to the growing chorus of evidence that surgical skills impact clinical outcomes. It may be prudent for future robotic surgeon credentialing to include a component of truly objective surgical skill evaluation, such as APMs. As our study suggests, merely relying on case experience as a surrogate for surgical expertise may not be sufficient. The near-term focus of our group will include the analyses of APMs and clinicopathological features predictive of biochemical failure and erectile function recovery after RARP. As a series, these studies would comprehensively evaluate the “trifecta” outcomes after RARP.
Acknowledgements
Devin Stuart1, Daphne Remulla1, Tiffany Chu1, Ryan Lee1, Kartik Aron1 data collection. Jie Cai1 statistical analyses. Anthony Jarc2, Liheng Guo2 processing of automated performance metrics.
1. Center for Robotic Simulation & Education, USC Institute of Urology, Keck School of Medicine, University of Southern California, Los Angeles, United States.
2. Intuitive Surgical Inc. Clinical Research, Norcross, United States.
Disclosure: Andrew J. Hung has a financial disclosure with Ethicon Inc. (consultant). Research reported in this publication was supported in part by the National Institute Of Biomedical Imaging And Bioengineering of the National Institutes of Health under Award Number K23EB026493 and an Intuitive Surgical Clinical Research Grant.
Contributor Information
Jian Chen, Email: jian.chen@med.usc.edu.
Saum Ghodoussipour, Email: saum.ghodoussipour@med.usc.edu.
Paul J. Oh, Email: paul.oh@med.usc.edu.
Zequn Liu, Email: tslzq1997@pku.edu.cn.
Jessica Nguyen, Email: jessica.nguyen@med.usc.edu.
Sanjay Purushotham, Email: sanjayp2005@gmail.com.
Inderbir S. Gill, Email: igill@med.usc.edu.
Yan Liu, Email: yanliu.cs@usc.edu.
References
- 1.Ficarra V, Novara G, Rosen RC, et al. Systematic review and meta-analysis of studies reporting urinary continence recovery after robot-assisted radical prostatectomy. Eur Urol 2012;62:405–17. [DOI] [PubMed] [Google Scholar]
- 2.Goldenberg MG, Goldenberg L, Grantcharov TP. Surgeon performance predicts early continence after robot-assisted radical Prostatectomy. J Endourol 2017;31:858–63. [DOI] [PubMed] [Google Scholar]
- 3.Birkmeyer JD, Finks JF, O’Reilly A, et al. Surgical skill and complication rates after bariatric surgery. N Engl J Med 2013;369:1434–42. [DOI] [PubMed] [Google Scholar]
- 4.Chen J, Cheng N, Cacciamani G, et al. Objective assessment of robotic surgical technical skill: A systemic review. J Urol 2018. July 24 pii: S0022–5347(18)43589–6. [DOI] [PubMed] [Google Scholar]
- 5.Lendvay TS, White L, Kowalewski T. Crowdsourcing to assess surgical skill. JAMA Surg 2015;150:1086–7. [DOI] [PubMed] [Google Scholar]
- 6.Ghani KR, Miller DC, Linsell S, et al. Measuring to improve: peer and crowd-sourced assessments of technical skill with robot-assisted radical prostatectomy. Eur Urol 2016;69:547–50. [DOI] [PubMed] [Google Scholar]
- 7.Hung AJ, Chen J, Jarc A, Hatcher D, Djaladat H, Gill IS. Development and validation of objective performance metrics for robot-assisted radical prostatectomy: A pilot study. J Urol 2018;199:296–304. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Chen J, Oh PJ, Cheng N, et al. Use of automated performance metrics to measure surgeon performance during robotic vesicourethral anastomosis and methodical development of a training tutorial. J Urol 2018;200:895–902. [DOI] [PubMed] [Google Scholar]
- 9.Judkins TN, Oleynikov D, Stergiou N. Objective evaluation of expert performance during human robotic surgical procedures. J Robot Surg 2008;1:307–12. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Judkins TN, Oleynikov D, Stergiou N. Objective evaluation of expert and novice performance during robotic surgical training tasks. Surg Endosc 2009;23:590–7. [DOI] [PubMed] [Google Scholar]
- 11.Hung AJ, Chen J, Che Z. et al. Utilizing machine learning and automated performance metrics to evaluate robot-assisted radical prostatectomy performance and predict outcomes. J Endourol 2018;32:438–44. [DOI] [PubMed] [Google Scholar]
- 12.Hung AJ, Chen J, Gill IS. Automated performance metrics and machine learning algorithms to measure surgeon performance and anticipate clinical outcomes in robotic surgery. JAMA Surg 2018;153:770–1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Hung AJ, Bottyan T, Clifford TG, et al. Structured learning for robotic surgery utilizing a proficiency score: a pilot study. World J Urol 2017;35:27–34. [DOI] [PubMed] [Google Scholar]
- 14.Wei JT, Dunn RL, Litwin MS, Sandler HM, Sanda MG. Development and validation of the expanded prostate cancer index composite (EPIC) for comprehensive assessment of health-related quality of life in men with prostate cancer. Urology 2000;56:899–905. [DOI] [PubMed] [Google Scholar]
- 15.Lin DY, Wei LJ. The robust inference for the Cox proportional hazards model. Journal of the American statistical Association 1989;84:1074–8. [Google Scholar]
- 16.Ishwaran H, Kogalur UB, Blackstone EH, Lauer MS. Random survival forests. The annals of applied statistics 2008;2:841–60. [Google Scholar]
- 17.Katzman JL, Shaham U, Cloninger A, Bates J, Jiang T, Kluger Y. Deep survival: A deep cox proportional hazards network. Stat 2016;1050:2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Dindo D, Demartines N, Clavien P. Classification of surgical complications. Ann Surg 2004;240:205–13. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Heidenreich A, Aus G, Bolla M, et al. EAU guidelines on prostate cancer [in Spanish]. Actas Urol Esp 2009;33:113–26. [DOI] [PubMed] [Google Scholar]
- 20.Cappelleri JC, Rosen RC. The Sexual Health Inventory for Men (SHIM): a 5-year review of research and clinical experience. Int J Impot Res 2005;17:307–19. [DOI] [PubMed] [Google Scholar]
- 21.Catto JW, Abbod MF, Wild PJ, et al. The application of artificial intelligence to microarray data: identification of a novel gene signature to identify bladder cancer progression. Eur Urol 2010;57:398–406. [DOI] [PubMed] [Google Scholar]
- 22.Catto JW, Abbod MF, Linkens DA, et al. Neuro-fuzzy modeling: an accurate and interpretable method for predicting bladder cancer progression. J Urol 2006;175:474–9. [DOI] [PubMed] [Google Scholar]
- 23.Bassi P, Sacco E, De Marco V, et al. Prognostic accuracy of an artificial neural network in patients undergoing radical cystectomy for bladder cancer: a comparison with logistic regression analysis. BJU Int 2007;99:1007–12. [DOI] [PubMed] [Google Scholar]
- 24.Poulakis V, Witzsch U, de Vries R, et al. Preoperative neural network using combined magnetic resonance imaging variables, prostate-specific antigen, and Gleason score for predicting prostate cancer biochemical recurrence after radical prostatectomy. Urology 2004;64:1165–70. [DOI] [PubMed] [Google Scholar]
- 25.Beam AL, Kohane IS. Big data and machine learning in health care. JAMA 2018;319:1317–8. [DOI] [PubMed] [Google Scholar]
- 26.Hung AJ, Chen J, Oh PJ, et al. PD38–03 Automated performance metrics during robotic-assisted radical prostatectomy can differentiate clinical outcome. J Urol 2018; 199(4s), pp. e736–e737 [Google Scholar]

