Table 3.
Accuracy of large language models for 4 question types.
| Question type | GPT-3.5 | GPT-4.0 | GPT-4o | Copilot | ERNIE Bot-3.5 | SPARK | Qwen-2.5 | ||
| 2023 | |||||||||
|
|
A1 | 0.526 | 0.684 | 0.789 | 0.763 | 0.763 | 0.719 | 0.860 | |
|
|
A2 | 0.443 | 0.807 | 0.830 | 0.784 | 0.807 | 0.705 | 0.909 | |
|
|
A3 | 0.647 | 0.794 | 0.794 | 0.824 | 0.824 | 0.765 | 0.853 | |
|
|
A4 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | |
| 2022 | |||||||||
|
|
A1 | 0.476 | 0.746 | 0.881 | 0.825 | 0.817 | 0.698 | 0.913 | |
|
|
A2 | 0.405 | 0.582 | 0.671 | 0.696 | 0.671 | 0.620 | 0.810 | |
|
|
A3 | 0.519 | 0.667 | 0.889 | 0.778 | 0.852 | 0.630 | 0.963 | |
|
|
A4 | 0.500 | 0.750 | 1.000 | 0.875 | 0.875 | 0.875 | 0.875 | |
| 2021 | |||||||||
|
|
A1 | 0.467 | 0.660 | 0.820 | 0.667 | 0.727 | 0.600 | 0.927 | |
|
|
A2 | 0.538 | 0.738 | 0.846 | 0.738 | 0.831 | 0.646 | 0.938 | |
|
|
A3 | 0.524 | 0.762 | 0.857 | 0.714 | 0.857 | 0.571 | 0.810 | |
|
|
A4 | 0.500 | 1.000 | 1.000 | 1.000 | 0.750 | 0.500 | 1.000 | |
| 2020 | |||||||||
|
|
A1 | 0.478 | 0.675 | 0.771 | 0.650 | 0.771 | 0.656 | 0.879 | |
|
|
A2 | 0.492 | 0.695 | 0.763 | 0.695 | 0.814 | 0.678 | 0.915 | |
|
|
A3 | 0.550 | 0.600 | 0.550 | 0.650 | 0.750 | 0.600 | 0.900 | |
|
|
A4 | 0.500 | 0.750 | 1.000 | 0.750 | 1.000 | 0.250 | 1.000 | |
| 2019 | |||||||||
|
|
A1 | 0.490 | 0.680 | 0.804 | 0.503 | 0.791 | 0.601 | 0.869 | |
|
|
A2 | 0.569 | 0.745 | 0.843 | 0.588 | 0.784 | 0.588 | 0.843 | |
|
|
A3 | 0.556 | 0.778 | 0.861 | 0.500 | 0.778 | 0.583 | 0.917 | |
| Overall | 0.495 | 0.703 | 0.807 | 0.688 | 0.781 | 0.650 | 0.889 | ||