Table 4.
Accuracy across attempts and direction of answer changes between first and third attempts for each artificial intelligence model. Accuracy values represent the percentage of correct responses out of 119 questions per model. Transitions reflect changes between the first and third attempts. “Incorrect to correct” indicates beneficial revisions, while “correct to incorrect” denotes detrimental changes.
| Model | Accuracy (first attempt), n (%) | Accuracy (third attempt), n (%) | Incorrect to correct, n | Correct to incorrect, n |
| ChatGPT | 95 (79.8) | 96 (80.7) | 1 | 0 |
| Copilot | 101 (84.9) | 107 (89.9) | 6 | 0 |
| DeepSeek | 86 (72.3) | 86 (72.3) | 2 | 2 |
| Gemini | 100 (84) | 100 (84) | 1 | 1 |
| Grok | 109 (91.6) | 109 (91.6) | 0 | 0 |