Table 4.
Runtime and accuracy of model predictions.
| Run | Runtime (seconds) | Exact match accuracy (95% CI) | Loose match accuracy | Q1a accuracy | Q2b accuracy | Q3c accuracy | Q4d accuracy | ||||||||
| Llama 3.3-70B | |||||||||||||||
|
|
Run 1 | 1331.919 | 0.275 (0.227-0.297) | 0.755 | 0.16 | 0.86 | 0.08 | 0 | |||||||
|
|
Run 2 | 1318.557 | 0.25 (0.227-0.297) | 0.77 | 0.22 | 0.76 | 0.02 | 0 | |||||||
|
|
Run 3 | 1326.938 | 0.255 (0.227-0.297) | 0.765 | 0.26 | 0.74 | 0.02 | 0 | |||||||
| Mistral-7B | |||||||||||||||
|
|
Run 1 | 1245.114 | 0.295 (0.284-0.358) | 0.785 | 0.46 | 0.52 | 0.2 | 0 | |||||||
|
|
Run 2 | 1249.270 | 0.35 (0.284-0.358) e | 0.775 | 0.52 | 0.64 | 0.24 | 0 | |||||||
|
|
Run 3 | 1244.751 | 0.315 (0.284-0.358) | 0.76 | 0.38 | 0.64 | 0.24 | 0 | |||||||
| Gemma 2-9B | |||||||||||||||
|
|
Run 1 | 1250.046 | 0.255 (0.257-0.329) | 0.77 | 0.06 | 0.26 | 0.7 | 0 | |||||||
|
|
Run 2 | 1439.940 | 0.315 (0.257-0.329) | 0.82 | 0 | 0.46 | 0.76 | 0.04 | |||||||
|
|
Run 3 | 1229.739 | 0.305 (0.257-0.329) | 0.795 | 0 | 0.42 | 0.8 | 0 | |||||||
| DeepSeek r1–distill Qwen-14B | |||||||||||||||
|
|
Run 1 | 1317.195 | 0.28 (0.252-0.324) | 0.81 | 0 | 0.06 | 0.82 | 0.24 | |||||||
|
|
Run 2 | 1309.082 | 0.27 (0.252-0.324) | 0.815 | 0 | 0.14 | 0.9 | 0.04 | |||||||
|
|
Run 3 | 1257.635 | 0.31 (0.252-0.324) | 0.8 | 0 | 0.22 | 0.94 | 0.08 | |||||||
| Qwen 2.5-7B | |||||||||||||||
|
|
Run 1 | 7211.855 | 0.296 (0.270-0.343) | 0.835 | 0 | 0.48 | 0.68 | 0.02 | |||||||
|
|
Run 2 | 7302.680 | 0.315 (0.270-0.343) | 0.84 | 0 | 0.44 | 0.74 | 0.08 | |||||||
|
|
Run 3 | 7231.687 | 0.305 (0.270-0.343) | 0.825 | 0 | 0.56 | 0.64 | 0.02 | |||||||
aQ1: quartile 1.
bQ2: quartile 2.
cQ3: quartile 3.
dQ4: quartile 4.
eItalicization indicates the highest accuracy.