Skip to main content
. 2023 Oct 30;13:18617. doi: 10.1038/s41598-023-45644-9

Table 4.

P-values of the Wilcoxon signed-rank tests adjusted for multiple comparisons using the Holm-Bonferroni method. Effect sizes measured with Cohen’s d reported for significant results.

Criterion/Linguistic characteristic Human vs. ChatGPT-3 ChatGPT-3 vs. ChatGPT-4 ChatGPT-3 vs. ChatGPT-4
Topic and completeness <0.001(d=-0.77) <0.001(d=-1.09) 0.095
Logic and composition <0.001(d=-0.84) <0.001(d=-1.20) 0.025(d=-0.45)
Expressiveness and compr. 0.008(d=-0.57) <0.001(d=-0.88) 0.055
Language mastery <0.001(d=-1.15) <0.001(d=-1.43) 0.105
Complexity 0.025(d=-0.52) <0.001(d=-0.99) 0.025(d=-0.48)
Vocabulary and text linking <0.001(d=-0.76) <0.001(d=-1.27) 0.012(d=-0.50)
Language constructs <0.001(d=-0.82) <0.001(d=-1.15) 0.105
Lexical diversity <0.001(d=1.06) 0.001(d=-0.60) <0.001(d=-1.93)
Syntactic complexity (depth) 0.001(d=-0.59) 0.055 0.105
Syntactic complexity (clauses) <0.001(d=-0.93) 0.004(d=-0.54) 0.024(d=0.49)
Nominalizations <0.001(d=-0.88) <0.001(d=-1.35) 0.020(d=-0.29)
Modals 0.025(d=0.39) <0.001(d=1.08) <0.001(d=0.76)
Epistemic markers <0.001(d=1.01) <0.001(d=1.53) 0.005(d=0.65)
Discourse markers 0.150 <0.001(d=0.98) <0.001(d=0.85)