Skip to main content
. 2023 Nov 20;20:30. doi: 10.3352/jeehp.2023.20.30

Table 1.

Agreement between the 3 attempts of each chatbot calculated using the Fleiss kappa

GPT-4 Bing GPT-3 Claude Bard
Total 0.647 0.668 0.700 0.714 0.574
Areas
 Surgery 0.100 0.655 0.769 0.843 0.688
 Internal medicine 0.638 0.837 0.669 0.678 0.632
 Pediatrics 0.571 0.595 0.550 0.847 0.417
 Obstetrics & gynecology 0.745 0.396 0.733 0.699 0.697
 Public health 0.709 0.844 0.699 0.741 0.096
 Emergency medicine 1.000 0.111 0.832 -0.007 0.495
Type of item
 Recall 0.533 0.782 0.665 0.623 0.321
 Application of knowledge 0.688 0.632 0.708 0.735 0.628