Table 4. Comparative Results on AdjuvareDB dataset annotated by AdjuvareDB team.
Direct comparison of GPT-4o, Llama-3.2–1B, and Llama-3.2–3B models, highlighting the differences in performance metrics (Precision, Recall, and F1-score) with and without interventions. Includes statistical significance of observed improvements.
| Automated Validation | ||||
| Models | Shots | P (%) | R (%) | F1 (%) |
| GPT-4o (Without Interventions) | 0 | 97.73 | 50.58 | 66.65 |
| 1 | 100.00 | 60.75 | 75.58 | |
| 2 | 100.00 | 63.90 | 77.98 | |
| 3 | 100.00 | 62.63 | 77.02 | |
| 4 | 100.00 | 63.48 | 77.66 | |
| GPT-4o (With Interventions) | 0 | 100.00 | 62.31 | 76.78 |
| 1 | 100.00 | 64.62 | 78.51 | |
| 2 | 100.00 | 67.50 | 80.59 | |
| 3 | 100.00 | 69.02 | 81.67 | |
| 4 | 100.00 | 66.67 | 80.00 | |
| Llama-3.2–3B-Instruct (Without Interventions) | 0 | 100.00 | 25.97 | 41.11 |
| 1 | 100.00 | 27.09 | 42.63 | |
| 2 | 99.19 | 29.57 | 45.55 | |
| 3 | 100.00 | 37.46 | 54.50 | |
| 4 | 100.00 | 34.63 | 51.44 | |
| Llama-3.2–3B-Instruct (With Interventions) | 0 | 97.66 | 37.32 | 54.00 |
| 1 | 98.55 | 46.73 | 63.39 | |
| 2 | 99.54 | 47.41 | 64.22 | |
| 3 | 100.00 | 48.48 | 65.31 | |
| 4 | 100.00 | 48.84 | 65.62 | |