Table 1.
Case ID | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Distribution of votes for the Gray Zone clinical expert recommendations: | |||||||||||||||
Expert 1 | 61.54 | 20 | 5.56 | 7.14 | 71.43 | 60 | 40 | 8 | 29.41 | 60 | 37.5 | 62.5 | 0 | 25 | 16.67 |
Expert 2 | 15.38 | 26.67 | 0 | 57.14 | 0 | 40 | 20 | 0 | 52.94 | 30 | 25 | 12.5 | 100 | 50 | 50 |
Expert 3 | 0 | 33.33 | 55.56 | 35.71 | 14.29 | 0 | 40 | 32 | 5.88 | 10 | 25 | 12.5 | 0 | 25 | 33.33 |
Expert 4 | 0 | 20 | 38.89 - | 0 | – | – | 20 | 11.76 | – | 12.5 | 12.5 | – | – | – | |
Expert 5 | 23.08 | – | – | – | 14.29 | – | – | 40 | – | – | – | – | – | – | – |
GPT-4’s self-assessment: | |||||||||||||||
Closest | E3 | E2 | E1 | E1 | E4 | E2 | E1 | E3 | E3 | E1+E2 E3 | E1 | E2+E3 | E2 | E2+E3 | |
Favourite | E3 | E3 | E4 | E1 | E2 | E2 | E2 | E2 | E2 | E1+E2 E3 | E2 | E1 | E2 | E2 | |
Senior physician’s assessment: Initial recommendation | |||||||||||||||
Correctness | 4 | 4 | 3 | 4 | 4 | 4 | 3 | 2 | 4 | 3 | 4 | 3 | 4 | 3 | 4 |
Comprehensi. | 3 | 4 | 3 | 2 | 3 | 2 | 4 | 2 | 4 | 4 | 3 | 3 | 2 | 4 | 4 |
Novel aspects | Yes | Yes | No | Yes | No | Yes | Yes | Yes | No | Yes | Yes | Yes | Yes | Yes | Yes |
Hallucination | No | No | No | No | No | No | No | Yes | No | Yes | No | No | No | No | No |
Revised recommendation | |||||||||||||||
Correctness | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 |
Comprehensi. | 3 | 4 | 3 | 4 | 4 | 4 | 3 | 4 | 4 | 4 | 3 | 4 | 3 | 4 | 4 |
Novel aspects | Yes | No | No | Yes | No | Yes | No | Yes | No | No | No | No | No | No | Yes |
Hallucination | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No |
Closest: ChatGPT-4’s initial recommendation is closest to which expert’s recommendation.
Favourite: Which expert’s recommendation is the most proper for the patient.
Comprehensi., Comprehensiveness.