. 2022 Feb 21;2021:881–890.

Table 3:

Human judgement counts for sentence pairs from the test set for all models and the reference human sentences. S: the generated was simpler; F: the original was simpler; E: both of equal complexity; N: cannot understand either; U: was not changed by the model/human reference; SG simplification gain as defined in Equation 2. Bold indicates best model. Scores in SG are significant (p < 0.05)

	S	F	E	N	U	SG
Human	1 730	273	904	40	4 053	0.21
n-gram	1 452	1 004	1 732	110	2 702	0.06
GPT-1	1 404	747	1 736	117	2 996	0.09
GPT-2	1 372	1 077	1 661	118	2 772	0.04
NTS	587	855	1 022	98	4 438	-0.04
ClinicalNTS	1 483	1 597	404	93	3 423	-0.02
PhraseTable	2 425	2 759	269	98	1 449	-0.05