Skip to main content
. 2023 Sep 14;13:1265024. doi: 10.3389/fonc.2023.1265024

Table 1.

The performance of ChatGPT-4’s initial recommendations and revised recommendations on the Gray Zone cases.

Case ID 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Distribution of votes for the Gray Zone clinical expert recommendations:
Expert 1 61.54 20 5.56 7.14 71.43 60 40 8 29.41 60 37.5 62.5 0 25 16.67
Expert 2 15.38 26.67 0 57.14 0 40 20 0 52.94 30 25 12.5 100 50 50
Expert 3 0 33.33 55.56 35.71 14.29 0 40 32 5.88 10 25 12.5 0 25 33.33
Expert 4 0 20 38.89 - 0 20 11.76 12.5 12.5
Expert 5 23.08 14.29 40
GPT-4’s self-assessment:
Closest E3 E2 E1 E1 E4 E2 E1 E3 E3 E1+E2 E3 E1 E2+E3 E2 E2+E3
Favourite E3 E3 E4 E1 E2 E2 E2 E2 E2 E1+E2 E3 E2 E1 E2 E2
Senior physician’s assessment: Initial recommendation
Correctness 4 4 3 4 4 4 3 2 4 3 4 3 4 3 4
Comprehensi. 3 4 3 2 3 2 4 2 4 4 3 3 2 4 4
Novel aspects Yes Yes No Yes No Yes Yes Yes No Yes Yes Yes Yes Yes Yes
Hallucination No No No No No No No Yes No Yes No No No No No
Revised recommendation
Correctness 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
Comprehensi. 3 4 3 4 4 4 3 4 4 4 3 4 3 4 4
Novel aspects Yes No No Yes No Yes No Yes No No No No No No Yes
Hallucination No No No No No No No No No No No No No No No

Closest: ChatGPT-4’s initial recommendation is closest to which expert’s recommendation.

Favourite: Which expert’s recommendation is the most proper for the patient.

Comprehensi., Comprehensiveness.