Skip to main content
. 2025 Jan 10;13:e63731. doi: 10.2196/63731

Table 3.

Accuracy of large language models for 4 question types.

Question type GPT-3.5 GPT-4.0 GPT-4o Copilot ERNIE Bot-3.5 SPARK Qwen-2.5
2023

A1 0.526 0.684 0.789 0.763 0.763 0.719 0.860

A2 0.443 0.807 0.830 0.784 0.807 0.705 0.909

A3 0.647 0.794 0.794 0.824 0.824 0.765 0.853

A4 1.000 1.000 1.000 1.000 1.000 1.000 1.000
2022

A1 0.476 0.746 0.881 0.825 0.817 0.698 0.913

A2 0.405 0.582 0.671 0.696 0.671 0.620 0.810

A3 0.519 0.667 0.889 0.778 0.852 0.630 0.963

A4 0.500 0.750 1.000 0.875 0.875 0.875 0.875
2021

A1 0.467 0.660 0.820 0.667 0.727 0.600 0.927

A2 0.538 0.738 0.846 0.738 0.831 0.646 0.938

A3 0.524 0.762 0.857 0.714 0.857 0.571 0.810

A4 0.500 1.000 1.000 1.000 0.750 0.500 1.000
2020

A1 0.478 0.675 0.771 0.650 0.771 0.656 0.879

A2 0.492 0.695 0.763 0.695 0.814 0.678 0.915

A3 0.550 0.600 0.550 0.650 0.750 0.600 0.900

A4 0.500 0.750 1.000 0.750 1.000 0.250 1.000
2019

A1 0.490 0.680 0.804 0.503 0.791 0.601 0.869

A2 0.569 0.745 0.843 0.588 0.784 0.588 0.843

A3 0.556 0.778 0.861 0.500 0.778 0.583 0.917
Overall 0.495 0.703 0.807 0.688 0.781 0.650 0.889