Table 2.
Author | No. of MCQs |
Tested vs. Human |
Medical Field |
Questions Evaluated By |
Performance Scores |
---|---|---|---|---|---|
Sevgi et al. | 3 | No | Neurosurgery |
Evaluated by the author according to current literature |
2 (66.6%) of the questions were accurate |
Biswas | 5 | No | General | N/A | N/A |
Agarwal et al. | 320 | No | Medical Physiology | 2 Physiologists |
p value validity < 0.001 for: Chat-GPT vs. Bing < 0.001 Bard vs. Bing < 0.001 p value of difficulty < 0.006 Chat-GPT vs. Bing 0.010 Chat-GPT vs. Bard 0.003 |
Ayub et al. | 40 | No | Dermatology |
2 board certified dermatologists |
16 (40%) of questions valid for exams |
Cheung et al. | 50 | Yes | Internal Medicine/Surgery |
5 International medical experts and educators |
Overall performance: AI score 20 (40%) vs. Human score 30 (60%) Mean difference -0.80 ± 4.82 Total time required: AI 20 min 25 s vs. Human 211 min 33 s |
Totlis et al. | 18 | No | Anatomy | N/A | N/A |
Han et al. | 3 | No | Biochemistry | N/A | N/A |
Klang et al. | 210 | No |
Internal Medicine Surgery Obstetrics & Gynecology Psychiatry Pediatrics |
5 Specialist physicians in the tested fields |
Problematic questions by field: Surgery 30% Gynecology 20% Pediatrics 10% Internal medicine 10% Psychiatry 0% |
Summary of key parameters investigated in each study, November 2023