Skip to main content
. 2026 Feb 11;5:e84322. doi: 10.2196/84322

Table 4.

Runtime and accuracy of model predictions.

Run Runtime (seconds) Exact match accuracy (95% CI) Loose match accuracy Q1a accuracy Q2b accuracy Q3c accuracy Q4d accuracy
Llama 3.3-70B

Run 1 1331.919 0.275 (0.227-0.297) 0.755 0.16 0.86 0.08 0

Run 2 1318.557 0.25 (0.227-0.297) 0.77 0.22 0.76 0.02 0

Run 3 1326.938 0.255 (0.227-0.297) 0.765 0.26 0.74 0.02 0
Mistral-7B

Run 1 1245.114 0.295 (0.284-0.358) 0.785 0.46 0.52 0.2 0

Run 2 1249.270 0.35 (0.284-0.358) e 0.775 0.52 0.64 0.24 0

Run 3 1244.751 0.315 (0.284-0.358) 0.76 0.38 0.64 0.24 0
Gemma 2-9B

Run 1 1250.046 0.255 (0.257-0.329) 0.77 0.06 0.26 0.7 0

Run 2 1439.940 0.315 (0.257-0.329) 0.82 0 0.46 0.76 0.04

Run 3 1229.739 0.305 (0.257-0.329) 0.795 0 0.42 0.8 0
DeepSeek r1–distill Qwen-14B

Run 1 1317.195 0.28 (0.252-0.324) 0.81 0 0.06 0.82 0.24

Run 2 1309.082 0.27 (0.252-0.324) 0.815 0 0.14 0.9 0.04

Run 3 1257.635 0.31 (0.252-0.324) 0.8 0 0.22 0.94 0.08
Qwen 2.5-7B

Run 1 7211.855 0.296 (0.270-0.343) 0.835 0 0.48 0.68 0.02

Run 2 7302.680 0.315 (0.270-0.343) 0.84 0 0.44 0.74 0.08

Run 3 7231.687 0.305 (0.270-0.343) 0.825 0 0.56 0.64 0.02

aQ1: quartile 1.

bQ2: quartile 2.

cQ3: quartile 3.

dQ4: quartile 4.

eItalicization indicates the highest accuracy.