. 2025 Oct 1;11:e78320. doi: 10.2196/78320

Table 4. Investigation of the scores in different prompt modes: effect of prompt engineering on ChatGPT performance. Paired comparison of mean (SD) scores and P values is listed for each model variant under no-prompt versus prompt-engineering conditions, highlighting the variable benefit of structured prompts across model generations.

Rounds	1	2	3	4	5	Mean (SD)	P value
Midterm exams
GPT-3.5N	60	59	58	60	59	59.2 (0.84)	<.001
GPT-3.5P	68	72	71	69	69	69.8 (1.64)	—^a
GPT-4N	80	81	81	82	83	81.4 (1.14)	.002
GPT-4P	83	84	85	85	86	84.6 (1.14)	—
GPT-4oN	88	89	88	90	88	88.6 (0.89)	.07
GPT-4oP	89	90	90	89	90	89.6 (0.55)	—
GPT-4o1-miniN	91	90	92	90	91	90.8 (0.84)	.69
GPT-4o1-miniP	91	91	91	92	90	91 (0.71)	—
GPT-4o1N	92	92	91	92	92	91.8 (0.45)	.55
GPT-4oP	91	92	92	92	91	91.6 (0.55)	—
Final exams
GPT-3.5N	54	56	55	54	56	55(1)	<.01
GPT-3.5P	61	60	60	60	60	60.2 (0.45)	—
GPT-4N	85	84	84	85	83	84.2 (0.84)	<.01
GPT-4P	89	87	87	88	88	87.8 (0.84)	—
GPT-4oN	89	90	90	90	90	89.8 (0.45)	.94
GPT-4oP	90	91	90	91	90	90.4 (0.55)	—
GPT-4o1-miniN	91	92	92	91	91	91.4 (0.55)	.58
GPT-4o1-miniP	92	91	92	92	91	91.6 (0.55)	—
GPT-4o1N	91	92	91	91	92	91.4 (0.55)	.24
GPT-4oP	91	92	92	92	92	91.8 (0.45)	—

Not applicable.