Skip to main content
. 2026 Mar 9;5:e76928. doi: 10.2196/76928

Table 4.

Accuracy across attempts and direction of answer changes between first and third attempts for each artificial intelligence model. Accuracy values represent the percentage of correct responses out of 119 questions per model. Transitions reflect changes between the first and third attempts. “Incorrect to correct” indicates beneficial revisions, while “correct to incorrect” denotes detrimental changes.

Model Accuracy (first attempt), n (%) Accuracy (third attempt), n (%) Incorrect to correct, n Correct to incorrect, n
ChatGPT 95 (79.8) 96 (80.7) 1 0
Copilot 101 (84.9) 107 (89.9) 6 0
DeepSeek 86 (72.3) 86 (72.3) 2 2
Gemini 100 (84) 100 (84) 1 1
Grok 109 (91.6) 109 (91.6) 0 0