Table 4. Investigation of the scores in different prompt modes: effect of prompt engineering on ChatGPT performance. Paired comparison of mean (SD) scores and P values is listed for each model variant under no-prompt versus prompt-engineering conditions, highlighting the variable benefit of structured prompts across model generations.
| Rounds | 1 | 2 | 3 | 4 | 5 | Mean (SD) | P value | |
|---|---|---|---|---|---|---|---|---|
| Midterm exams | ||||||||
| GPT-3.5N | 60 | 59 | 58 | 60 | 59 | 59.2 (0.84) | <.001 | |
| GPT-3.5P | 68 | 72 | 71 | 69 | 69 | 69.8 (1.64) | —a | |
| GPT-4N | 80 | 81 | 81 | 82 | 83 | 81.4 (1.14) | .002 | |
| GPT-4P | 83 | 84 | 85 | 85 | 86 | 84.6 (1.14) | — | |
| GPT-4oN | 88 | 89 | 88 | 90 | 88 | 88.6 (0.89) | .07 | |
| GPT-4oP | 89 | 90 | 90 | 89 | 90 | 89.6 (0.55) | — | |
| GPT-4o1-miniN | 91 | 90 | 92 | 90 | 91 | 90.8 (0.84) | .69 | |
| GPT-4o1-miniP | 91 | 91 | 91 | 92 | 90 | 91 (0.71) | — | |
| GPT-4o1N | 92 | 92 | 91 | 92 | 92 | 91.8 (0.45) | .55 | |
| GPT-4oP | 91 | 92 | 92 | 92 | 91 | 91.6 (0.55) | — | |
| Final exams | ||||||||
| GPT-3.5N | 54 | 56 | 55 | 54 | 56 | 55(1) | <.01 | |
| GPT-3.5P | 61 | 60 | 60 | 60 | 60 | 60.2 (0.45) | — | |
| GPT-4N | 85 | 84 | 84 | 85 | 83 | 84.2 (0.84) | <.01 | |
| GPT-4P | 89 | 87 | 87 | 88 | 88 | 87.8 (0.84) | — | |
| GPT-4oN | 89 | 90 | 90 | 90 | 90 | 89.8 (0.45) | .94 | |
| GPT-4oP | 90 | 91 | 90 | 91 | 90 | 90.4 (0.55) | — | |
| GPT-4o1-miniN | 91 | 92 | 92 | 91 | 91 | 91.4 (0.55) | .58 | |
| GPT-4o1-miniP | 92 | 91 | 92 | 92 | 91 | 91.6 (0.55) | — | |
| GPT-4o1N | 91 | 92 | 91 | 91 | 92 | 91.4 (0.55) | .24 | |
| GPT-4oP | 91 | 92 | 92 | 92 | 92 | 91.8 (0.45) | — | |
Not applicable.