Table 4. Blinded guess of question writer (i.e. AI vs human).
AI (total = 50) | Human (total = 50) | Correlation | p | |
---|---|---|---|---|
(Correct guess, %) | (Correct guess, %) | |||
Assessor A | 24, 48% | 23, 46% | - 0.14–0.26 | 0.55 |
Assessor B | 14, 28% | 41, 82% | - 0.38–0.10 | 0.24 |
Assessor C | 33, 66% | 24, 48% | - 0.35–0.06 | 0.16 |
Assessor D | 27, 53% | 26, 52% | - 0.26–0.14 | 0.55 |
Assessor E | 26, 52% | 32, 64% | - 0.36–0.04 | 0.11 |
GPT-2 Output Detector | 7, 14% | 45, 90% | - 0.40–0.21 | 0.54 |