Table 2.
κ coefficient for interrater agreement between GPT-4 and the physicians’ evaluations for the differential diagnosis lists.
| Differential-diagnosis lists generator | Cohen κ coefficient (95% CI) | Strength of agreement [34] | Number of differential-diagnosis lists |
| All | 0.63 (0.56-0.69) | Fair to good | 1176 |
| GPT-4 | 0.47 (0.39-0.56) | Fair to good | 392 |
| Google Barda | 0.67 (0.52-0.73) | Fair to good | 392 |
| LLaMA2 chatbotb | 0.63 (0.52-0.73) | Fair to good | 392 |
aCurrently Google Gemini.
bLLaMA2: LLM Meta AI 2.