Table 2.
Overall and subgroup analysis of ChatGPT-4 on urological cancer treatment recommendations.
ChatGPT-4 | Cancer type | Disease Status | |||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
All | Prostate Ca | Kidney Ca. | Bladder Ca. | p-value | Localized | Systemic | Recurrent | p-value | |||||||||
Mean | (SD) | Mean | (SD) | Mean | (SD) | Mean | (SD) | Mean | (SD) | Mean | (SD) | Mean | (SD) | ||||
Query prompts (n) | 108 | 36 | 34 | 38 | 56 | 40 | 6 | ||||||||||
ChatGPT total RECs a | 6.0 | (1.92) | 6.4 | (2.10) | 5.8 | (1.64) | 5.8 | (1.97) | 0.336 | 5.7 | (1.55) | 5.6 | (2.04) | 7.8 | (1.85) | 0.017* | |
Rater-approved ChatGPT REC ratio (%) b | 88.5 | (14.8) | 90.1 | (12.8) | 83.4 | (19.0) | 91.4 | (11.0) | 0.048* | 85.7 | (16.0) | 93.3 | (11.1) | 73.8 | (16.6) | 0.002* | |
NCCN-aligned ChatGPT REC ratio (%) b | 86.7 | (16.1) | 85.0 | (16.1) | 82.9 | (19.2) | 91.7 | (11.8) | 0.050 | 83.7 | (17.8) | 91.4 | (12.0) | 73.2 | (17.9) | 0.009* | |
Rater-disagreed ChatGPT REC ratio (%) b | 9.5 | (13.7) | 7.9 | (12.4) | 13.9 | (18.1) | 7.1 | (9.0) | 0.072 | 12.4 | (15.4) | 4.3 | (7.8) | 24.5 | (17.1) | <0.001** | |
NCCN total RECs | 6.0 | (2.18) | 5.9 | (1.11) | 6.1 | (2.89) | 6.0 | (2.29) | 0.961 | 5.7 | (1.89) | 6.1 | (2.40) | 7.6 | (2.33) | 0.131 | |
ChatGPT REC/NCCN REC ratio (%) c | 100.0 | (40.5) | 102.7 | (35.5) | 103.4 | (56.7) | 95.2 | (28.5) | 0.728 | 108.5 | (45.1) | 86.6 | (30.3) | 113.8 | (41.3) | 0.057 | |
NCCN-aligned ChatGPT REC/NCCN REC ratio (%) c | 81.0 | (20.6) | 80.8 | (20.8) | 78.0 | (20.1) | 83.6 | (21.3) | 0.638 | 84.6 | (20.0) | 77.3 | (21.8) | 77.3 | (17.2) | 0.314 | |
Correctness | (range 1–5) | 4.5 | (0.65) | 4.4 | (0.69) | 4.4 | (0.75) | 4.6 | (0.49) | 0.215 | 4.4 | (0.67) | 4.7 | (0.51) | 3.7 | (0.79) | <0.001** |
Comprehensiveness | (range 1–5) | 4.4 | (0.70) | 4.6 | (0.69) | 4.3 | (0.68) | 4.5 | (0.73) | 0.388 | 4.5 | (0.72) | 4.4 | (0.67) | 4.1 | (0.65) | 0.207 |
Specificity | (range 1–5) | 4.0 | (0.71) | 4.0 | (0.67) | 3.8 | (0.71) | 4.1 | (0.75) | 0.278 | 3.9 | (0.59) | 4.0 | (0.80) | 3.7 | (0.71) | 0.545 |
Appropriateness | (range 1–5) | 4.4 | (0.70) | 4.3 | (0.73) | 4.3 | (0.75) | 4.5 | (0.62) | 0.432 | 4.3 | (0.75) | 4.5 | (0.52) | 3.6 | (0.88) | 0.006* |
RECs: recommendations.
ChatGPT total RECs as the denominator.
NCCN total RECs as the denominator.
Significant p < 0.05.