. 2024 Jun 26;8:e59267. doi: 10.2196/59267

Table 2.

κ coefficient for interrater agreement between GPT-4 and the physicians’ evaluations for the differential diagnosis lists.

Differential-diagnosis lists generator	Cohen κ coefficient (95% CI)	Strength of agreement [34]	Number of differential-diagnosis lists
All	0.63 (0.56-0.69)	Fair to good	1176
GPT-4	0.47 (0.39-0.56)	Fair to good	392
Google Bard^a	0.67 (0.52-0.73)	Fair to good	392
LLaMA2 chatbot^b	0.63 (0.52-0.73)	Fair to good	392

^aCurrently Google Gemini.

^bLLaMA2: LLM Meta AI 2.