Table 1.
Fleiss Kappa of different prompts in different models
Model | Prompt | Fleiss Kappa | 95% CI | |
---|---|---|---|---|
gpt-4-Web | IO | 0.525 | 0.523 | 0.527 |
0-COT | 0.450 | 0.448 | 0.452 | |
P-COT | 0.334 | 0.332 | 0.337 | |
ROT | 0.467 | 0.465 | 0.470 | |
gpt-4-API | IO | 0.288 | 0.286 | 0.290 |
0-COT | 0.067 | 0.065 | 0.069 | |
P-COT | 0.331 | 0.330 | 0.333 | |
ROT | 0.205 | 0.203 | 0.206 | |
gpt-4-API-0 | IO | 0.525 | 0.523 | 0.526 |
0-COT | 0.285 | 0.283 | 0.287 | |
P-COT | 0.660 | 0.658 | 0.661 | |
ROT | 0.451 | 0.449 | 0.453 | |
Bard | IO | 0.374 | 0.372 | 0.376 |
0-COT | 0.355 | 0.353 | 0.357 | |
P-COT | 0.323 | 0.321 | 0.326 | |
ROT | 0.180 | 0.178 | 0.182 | |
gpt-3.5-Web | IO | 0.409 | 0.407 | 0.411 |
0-COT | −0.002 | −0.004 | 0.000 | |
P-COT | 0.276 | 0.274 | 0.278 | |
ROT | 0.016 | 0.014 | 0.018 | |
gpt-3.5-API | IO | 0.188 | 0.186 | 0.190 |
0-COT | 0.004 | 0.002 | 0.006 | |
P-COT | 0.031 | 0.029 | 0.033 | |
ROT | 0.014 | 0.012 | 0.016 | |
gpt-3.5-API-0 | IO | 0.984 | 0.983 | 0.986 |
0-COT | 0.461 | 0.459 | 0.464 | |
P-COT | 0.533 | 0.531 | 0.535 | |
ROT | 0.581 | 0.578 | 0.583 | |
gpt-3.5-ft | IO | 0.162 | 0.160 | 0.164 |
0-COT | 0.021 | 0.020 | 0.023 | |
P-COT | 0.065 | 0.063 | 0.067 | |
ROT | 0.033 | 0.032 | 0.035 | |
gpt-3.5-ft-0 | IO | 0.982 | 0.980 | 0.984 |
0-COT | 0.412 | 0.410 | 0.414 | |
P-COT | 0.355 | 0.353 | 0.356 | |
ROT | 0.398 | 0.396 | 0.400 |