[Preprint]. 2025 Feb 12:arXiv:2502.09659v1. [Version 1]

Table 4. Comparative Results on AdjuvareDB dataset annotated by AdjuvareDB team.

Direct comparison of GPT-4o, Llama-3.2–1B, and Llama-3.2–3B models, highlighting the differences in performance metrics (Precision, Recall, and F1-score) with and without interventions. Includes statistical significance of observed improvements.

		Automated Validation
Models	Shots	P (%)	R (%)	F1 (%)
GPT-4o (Without Interventions)	0	97.73	50.58	66.65
	1	100.00	60.75	75.58
	2	100.00	63.90	77.98
	3	100.00	62.63	77.02
	4	100.00	63.48	77.66
GPT-4o (With Interventions)	0	100.00	62.31	76.78
	1	100.00	64.62	78.51
	2	100.00	67.50	80.59
	3	100.00	69.02	81.67
	4	100.00	66.67	80.00
Llama-3.2–3B-Instruct (Without Interventions)	0	100.00	25.97	41.11
	1	100.00	27.09	42.63
	2	99.19	29.57	45.55
	3	100.00	37.46	54.50
	4	100.00	34.63	51.44
Llama-3.2–3B-Instruct (With Interventions)	0	97.66	37.32	54.00
	1	98.55	46.73	63.39
	2	99.54	47.41	64.22
	3	100.00	48.48	65.31
	4	100.00	48.84	65.62