Table 3. Comparative Results on VAC dataset.
Direct comparison of GPT-4o, Llama-3.2–1B, and Llama-3.2–3B models, highlighting the differences in performance metrics (Precision, Recall, and F1-score) with and without interventions.
| Automated Validation | Manual Validation | ||||||
|---|---|---|---|---|---|---|---|
| Models | Shots | P (%) | R (%) | F1 (%) | P (%) | R (%) | F1 (%) |
| GPT-4o Without Substances | 0 | 100.00 | 34.05 | 50.80 | 100.00 | 48.28 | 65.08 |
| 1 | 100.00 | 38.30 | 55.39 | 100.00 | 54.40 | 70.47 | |
| 2 | 99.80 | 40.03 | 57.14 | 99.86 | 56.14 | 71.87 | |
| 3 | 100.00 | 45.48 | 62.52 | 100.00 | 60.44 | 75.34 | |
| 4 | 100.00 | 46.69 | 63.66 | 100.00 | 60.94 | 75.73 | |
| GPT-4o With Substances | 0 | 100.00 | 33.07 | 49.69 | 100.00 | 50.84 | 67.40 |
| 1 | 100.00 | 38.62 | 55.72 | 100.00 | 56.53 | 72.22 | |
| 2 | 100.00 | 39.73 | 56.87 | 100.00 | 57.26 | 72.82 | |
| 3 | 100.00 | 46.27 | 63.27 | 100.00 | 61.75 | 76.35 | |
| 4 | 100.00 | 47.05 | 63.99 | 100.00 | 63.03 | 77.32 | |
| Llama-3.2–3B-Instruct Without Substances | 0 | 98.00 | 24.21 | 38.58 | 99.01 | 39.09 | 56.05 |
| 1 | 100.00 | 32.81 | 49.41 | 100.00 | 47.80 | 64.68 | |
| 2 | 100.00 | 33.37 | 50.04 | 100.00 | 50.00 | 66.67 | |
| 3 | 99.52 | 35.75 | 52.60 | 99.67 | 51.98 | 68.33 | |
| 4 | 100.00 | 34.58 | 51.39 | 100.00 | 49.82 | 66.50 | |
| Llama-3.2–3B-Instruct With Substances | 0 | 99.16 | 25.05 | 40.00 | 99.09 | 46.12 | 62.94 |
| 1 | 100.00 | 33.75 | 50.47 | 100.00 | 56.64 | 72.32 | |
| 2 | 100.00 | 35.59 | 52.83 | 100.00 | 57.43 | 72.96 | |
| 3 | 99.50 | 39.54 | 56.59 | 99.68 | 61.19 | 75.83 | |
| 4 | 100.00 | 35.35 | 55.44 | 100.00 | 60.32 | 75.25 | |