Table 1. Comparison of AI model performance by question difficulty.
Difficulty | Questions | ChatGPT-3.5 | GPT-4 | Bard | P-values | ||
ChatGPT-3.5 vs GPT-4 | ChatGPT-3.5 vs Bard | GPT-4 vs Bard | |||||
Easy | 360 | 251 (69.7%) | 333 (92.5%) | 275 (76.4%) | <0.001* | 0.053 | <0.001* |
Moderate | 353 | 190 (53.8%) | 291 (82.4%) | 223 (63.2%) | <0.001* | <0.05* | <0.001* |
Hard | 364 | 154 (42.3%) | 223 (61.3%) | 166 (45.6%) | <0.001* | 0.411 | <0.001* |
Overall | 1077 | 595 (55.3%) | 847 (78.7%) | 664 (61.7%) | <0.001* | <0.01* | <0.001* |