. 2026 Mar 9;5:e76928. doi: 10.2196/76928

Table 4.

Accuracy across attempts and direction of answer changes between first and third attempts for each artificial intelligence model. Accuracy values represent the percentage of correct responses out of 119 questions per model. Transitions reflect changes between the first and third attempts. “Incorrect to correct” indicates beneficial revisions, while “correct to incorrect” denotes detrimental changes.

Model	Accuracy (first attempt), n (%)	Accuracy (third attempt), n (%)	Incorrect to correct, n	Correct to incorrect, n
ChatGPT	95 (79.8)	96 (80.7)	1	0
Copilot	101 (84.9)	107 (89.9)	6	0
DeepSeek	86 (72.3)	86 (72.3)	2	2
Gemini	100 (84)	100 (84)	1	1
Grok	109 (91.6)	109 (91.6)	0	0