. 2023 Oct 30;13:18617. doi: 10.1038/s41598-023-45644-9

Table 4.

P-values of the Wilcoxon signed-rank tests adjusted for multiple comparisons using the Holm-Bonferroni method. Effect sizes measured with Cohen’s d reported for significant results.

Criterion/Linguistic characteristic	Human vs. ChatGPT-3	ChatGPT-3 vs. ChatGPT-4	ChatGPT-3 vs. ChatGPT-4
Topic and completeness	$< 0.001 (d = - 0.77)$	$< 0.001 (d = - 1.09)$	0.095
Logic and composition	$< 0.001 (d = - 0.84)$	$< 0.001 (d = - 1.20)$	$0.025 (d = - 0.45)$
Expressiveness and compr.	$0.008 (d = - 0.57)$	$< 0.001 (d = - 0.88)$	0.055
Language mastery	$< 0.001 (d = - 1.15)$	$< 0.001 (d = - 1.43)$	0.105
Complexity	$0.025 (d = - 0.52)$	$< 0.001 (d = - 0.99)$	$0.025 (d = - 0.48)$
Vocabulary and text linking	$< 0.001 (d = - 0.76)$	$< 0.001 (d = - 1.27)$	$0.012 (d = - 0.50)$
Language constructs	$< 0.001 (d = - 0.82)$	$< 0.001 (d = - 1.15)$	0.105
Lexical diversity	$< 0.001 (d = 1.06)$	$0.001 (d = - 0.60)$	$< 0.001 (d = - 1.93)$
Syntactic complexity (depth)	$0.001 (d = - 0.59)$	0.055	0.105
Syntactic complexity (clauses)	$< 0.001 (d = - 0.93)$	$0.004 (d = - 0.54)$	$0.024 (d = 0.49)$
Nominalizations	$< 0.001 (d = - 0.88)$	$< 0.001 (d = - 1.35)$	$0.020 (d = - 0.29)$
Modals	$0.025 (d = 0.39)$	$< 0.001 (d = 1.08)$	$< 0.001 (d = 0.76)$
Epistemic markers	$< 0.001 (d = 1.01)$	$< 0.001 (d = 1.53)$	$0.005 (d = 0.65)$
Discourse markers	0.150	$< 0.001 (d = 0.98)$	$< 0.001 (d = 0.85)$