Table 3.
Comparison of ChatGPT-4's performance using prompts with and without the specification “according to NCCN.”
ChatGPT-4 | Prompt template | |||||||
---|---|---|---|---|---|---|---|---|
All | Non-specific prompt b | NCCN-specified prompt c | p-value | |||||
Mean | (SD) | Mean | (SD) | Mean | (SD) | |||
Query prompts (n) | 108 | 54 | 54 | |||||
ChatGPT total RECs a | 6.0 | (1.92) | 6.9 | (1.67) | 5.0 | (1.68) | <0.001** | |
Rater-approved cGPT REC ratio % d | 88.5 | (14.8) | 85.8 | (15.9) | 91.2 | (13.3) | 0.011* | |
NCCN-aligned cGPT REC ratio % d | 86.7 | (16.1) | 83.5 | (16.7) | 89.9 | (15.0) | 0.006** | |
Rater-disagreed cGPT REC ratio % d | 9.5 | (13.7) | 11.7 | (14.8) | 7.4 | (12.3) | 0.020** | |
NCCN total RECs | 6.0 | (2.18) | 6.0 | (2.18) | 6.0 | (2.18) | ||
ChatGPT REC/NCCN REC ratio % e | 100.0 | (40.5) | 116.1 | (39.6) | 84.1 | (35.0) | <0.001** | |
NCCN-aligned ChatGPT REC/NCCN REC ratio % e | 81.0 | (20.6) | 89.7 | (15.5) | 72.4 | (21.6) | <0.001** | |
Correctness | (range 1–5) | 4.5 | (0.65) | 4.4 | (0.69) | 4.6 | (0.59) | 0.017* |
Comprehensiveness | (range 1–5) | 4.4 | (0.70) | 4.7 | (0.41) | 4.2 | (0.81) | <0.001** |
Specificity | (range 1–5) | 4.0 | (0.71) | 4.2 | (0.57) | 3.7 | (0.73) | <0.001** |
Appropriateness | (range 1–5) | 4.4 | (0.70) | 4.4 | (0.72) | 4.3 | (0.68) | 0.640 |
RECs : recommendations.
Prompt without “according to NCCN”.
Prompt with “according to NCCN”.
ChatGPT total RECs as the denominator.
NCCN total RECs as the denominator.
* Significant p < 0.05; ** Significant p < 0.01.