Skip to main content
[Preprint]. 2025 Feb 12:arXiv:2502.09659v1. [Version 1]

Table 4. Comparative Results on AdjuvareDB dataset annotated by AdjuvareDB team.

Direct comparison of GPT-4o, Llama-3.2–1B, and Llama-3.2–3B models, highlighting the differences in performance metrics (Precision, Recall, and F1-score) with and without interventions. Includes statistical significance of observed improvements.

Automated Validation
Models Shots P (%) R (%) F1 (%)
GPT-4o (Without Interventions) 0 97.73 50.58 66.65
1 100.00 60.75 75.58
2 100.00 63.90 77.98
3 100.00 62.63 77.02
4 100.00 63.48 77.66
GPT-4o (With Interventions) 0 100.00 62.31 76.78
1 100.00 64.62 78.51
2 100.00 67.50 80.59
3 100.00 69.02 81.67
4 100.00 66.67 80.00
Llama-3.2–3B-Instruct (Without Interventions) 0 100.00 25.97 41.11
1 100.00 27.09 42.63
2 99.19 29.57 45.55
3 100.00 37.46 54.50
4 100.00 34.63 51.44
Llama-3.2–3B-Instruct (With Interventions) 0 97.66 37.32 54.00
1 98.55 46.73 63.39
2 99.54 47.41 64.22
3 100.00 48.48 65.31
4 100.00 48.84 65.62