Table 2.

Accuracy by cognitive level for AI models.

AI Model	Odds Ratio	95% Confidence Interval	P-value
ChatGPT-4o	1.166	[0.522, 2.605]	0.709
Llama 70B	0.358	[0.170, 0.754]	0.007^a
Llama 405B	1.379	[0.630, 3.018]	0.421

Only Llama 3.1 70B showed a significant decrease in odds of answering questions belonging to cognitive level 2 (n = 61) correctly compared to cognitive level 1 (n = 193).

Statistically significant.