Table 4.
P-values of the Wilcoxon signed-rank tests adjusted for multiple comparisons using the Holm-Bonferroni method. Effect sizes measured with Cohen’s d reported for significant results.
Criterion/Linguistic characteristic | Human vs. ChatGPT-3 | ChatGPT-3 vs. ChatGPT-4 | ChatGPT-3 vs. ChatGPT-4 |
---|---|---|---|
Topic and completeness | 0.095 | ||
Logic and composition | |||
Expressiveness and compr. | 0.055 | ||
Language mastery | 0.105 | ||
Complexity | |||
Vocabulary and text linking | |||
Language constructs | 0.105 | ||
Lexical diversity | |||
Syntactic complexity (depth) | 0.055 | 0.105 | |
Syntactic complexity (clauses) | |||
Nominalizations | |||
Modals | |||
Epistemic markers | |||
Discourse markers | 0.150 |