TABLE A4.
Detector | Comparison | AUROC | AUPRC | Brier Score |
---|---|---|---|---|
Originality.ai | GPT-3.5 v human-written | 1.000 | 1.000 | 0.013 |
GPT-4 v human-written | 0.995 | 0.996 | 0.027 | |
Mixed human/GPT-3.5 v human-written | 0.782 | 0.775 | 0.400 | |
Mixed human/GPT-4 v human-written | 0.706 | 0.703 | 0.426 | |
Translated v human-written | 0.912 | 0.921 | 0.199 | |
Sapling | GPT-3.5 v human-written | 0.991 | 0.990 | 0.024 |
GPT-4 v human-written | 0.973 | 0.975 | 0.043 | |
Mixed human/GPT-3.5 v human-written | 0.617 | 0.671 | 0.338 | |
Mixed human/GPT-4 v human-written | 0.606 | 0.655 | 0.359 | |
Translated v human-written | 0.609 | 0.595 | 0.397 | |
GPTZero | GPT-3.5 v human-written | 1.000 | 1.000 | 0.010 |
GPT-4 v human-written | 0.999 | 0.999 | 0.016 | |
Mixed human/GPT-3.5 v human-written | 0.684 | 0.692 | 0.346 | |
Mixed human/GPT-4 v human-written | 0.596 | 0.636 | 0.372 | |
Translated v human-written | 0.585 | 0.586 | 0.385 | |
Kashyapa | GPT-3.5 v human-written | 0.970 | 0.970 | 0.477 |
GPT-4 v human-written | 0.880 | 0.826 | 0.498 | |
Mixed human/GPT-3.5 v human-written | 0.670 | 0.701 | 0.499 | |
Mixed human/GPT-4 v human-written | 0.640 | 0.702 | 0.499 |
NOTE. Each comparison represents performance metrics for detecting 100 human-written versus 100 generated, mixed, or translated abstracts. However, the Kashyap detector underwent preliminary assessment with 10 human-written versus 10 generated or mixed abstracts with performance as listed, but because of inferior performance compared with other detectors was excluded for future analysis. Bold text indicates the model achieved the best performance for the specified comparison as measured by the listed metric (ie, highest AUROC or AUPRC, or lowest Brier Score).
Abbreviations: AI, artificial intelligence; AUPRC, area under the precision recall curve; AUROC, area under the receiver operating characteristic curve.
Detector was run on a preliminary cohort with 10 abstracts in each category, but further analysis was not performed because of lower accuracy compared with other detectors.