Skip to main content
[Preprint]. 2025 Feb 12:arXiv:2502.09659v1. [Version 1]

Table 3. Comparative Results on VAC dataset.

Direct comparison of GPT-4o, Llama-3.2–1B, and Llama-3.2–3B models, highlighting the differences in performance metrics (Precision, Recall, and F1-score) with and without interventions.

Automated Validation Manual Validation
Models Shots P (%) R (%) F1 (%) P (%) R (%) F1 (%)
GPT-4o Without Substances 0 100.00 34.05 50.80 100.00 48.28 65.08
1 100.00 38.30 55.39 100.00 54.40 70.47
2 99.80 40.03 57.14 99.86 56.14 71.87
3 100.00 45.48 62.52 100.00 60.44 75.34
4 100.00 46.69 63.66 100.00 60.94 75.73
GPT-4o With Substances 0 100.00 33.07 49.69 100.00 50.84 67.40
1 100.00 38.62 55.72 100.00 56.53 72.22
2 100.00 39.73 56.87 100.00 57.26 72.82
3 100.00 46.27 63.27 100.00 61.75 76.35
4 100.00 47.05 63.99 100.00 63.03 77.32
Llama-3.2–3B-Instruct Without Substances 0 98.00 24.21 38.58 99.01 39.09 56.05
1 100.00 32.81 49.41 100.00 47.80 64.68
2 100.00 33.37 50.04 100.00 50.00 66.67
3 99.52 35.75 52.60 99.67 51.98 68.33
4 100.00 34.58 51.39 100.00 49.82 66.50
Llama-3.2–3B-Instruct With Substances 0 99.16 25.05 40.00 99.09 46.12 62.94
1 100.00 33.75 50.47 100.00 56.64 72.32
2 100.00 35.59 52.83 100.00 57.43 72.96
3 99.50 39.54 56.59 99.68 61.19 75.83
4 100.00 35.35 55.44 100.00 60.32 75.25