Skip to main content
. 2025 Oct 1;11:e78320. doi: 10.2196/78320

Table 4. Investigation of the scores in different prompt modes: effect of prompt engineering on ChatGPT performance. Paired comparison of mean (SD) scores and P values is listed for each model variant under no-prompt versus prompt-engineering conditions, highlighting the variable benefit of structured prompts across model generations.

Rounds 1 2 3 4 5 Mean (SD) P value
Midterm exams
 GPT-3.5N 60 59 58 60 59 59.2 (0.84) <.001
 GPT-3.5P 68 72 71 69 69 69.8 (1.64) a
 GPT-4N 80 81 81 82 83 81.4 (1.14) .002
 GPT-4P 83 84 85 85 86 84.6 (1.14)
 GPT-4oN 88 89 88 90 88 88.6 (0.89) .07
 GPT-4oP 89 90 90 89 90 89.6 (0.55)
 GPT-4o1-miniN 91 90 92 90 91 90.8 (0.84) .69
 GPT-4o1-miniP 91 91 91 92 90 91 (0.71)
 GPT-4o1N 92 92 91 92 92 91.8 (0.45) .55
 GPT-4oP 91 92 92 92 91 91.6 (0.55)
Final exams
 GPT-3.5N 54 56 55 54 56 55(1) <.01
 GPT-3.5P 61 60 60 60 60 60.2 (0.45)
 GPT-4N 85 84 84 85 83 84.2 (0.84) <.01
 GPT-4P 89 87 87 88 88 87.8 (0.84)
 GPT-4oN 89 90 90 90 90 89.8 (0.45) .94
 GPT-4oP 90 91 90 91 90 90.4 (0.55)
 GPT-4o1-miniN 91 92 92 91 91 91.4 (0.55) .58
 GPT-4o1-miniP 92 91 92 92 91 91.6 (0.55)
 GPT-4o1N 91 92 91 91 92 91.4 (0.55) .24
 GPT-4oP 91 92 92 92 92 91.8 (0.45)
a

Not applicable.