Table 3. Comparison between AI vs human with the same reference.
AI wins | Human wins | Equal | Mean difference | |
---|---|---|---|---|
(AI–human) | ||||
(± SD) | ||||
Appropriateness of the question | 18 (36%) | 27 (54%) | 5 (10%) | - 0.11 ± 1.05 |
Clarity and specificity | 18 (36%) | 26 (52%) | 6 (12%) | - 0.13 ± 1.08 |
Relevance | 18 (36%) | 27 (54%) | 5 (10%) | - 0.32 ± 1.04 |
Quality of the alternatives & discriminative power | 21 (42%) | 26 (52%) | 3 (6%) | - 0.10 ± 0.94 |
Suitability for graduate medical school exam | 22 (44%) | 28 (56%) | 2 (4%) | - 0.14 ± 1.12 |
Total score | 20 (40%) | 30 (60%) | 0 (0%) | - 0.80 ± 4.82 |