[Preprint]. 2025 Feb 12:arXiv:2502.09659v1. [Version 1]

Table 3. Comparative Results on VAC dataset.

Direct comparison of GPT-4o, Llama-3.2–1B, and Llama-3.2–3B models, highlighting the differences in performance metrics (Precision, Recall, and F1-score) with and without interventions.

		Automated Validation			Manual Validation
Models	Shots	P (%)	R (%)	F1 (%)	P (%)	R (%)	F1 (%)
GPT-4o Without Substances	0	100.00	34.05	50.80	100.00	48.28	65.08
	1	100.00	38.30	55.39	100.00	54.40	70.47
	2	99.80	40.03	57.14	99.86	56.14	71.87
	3	100.00	45.48	62.52	100.00	60.44	75.34
	4	100.00	46.69	63.66	100.00	60.94	75.73
GPT-4o With Substances	0	100.00	33.07	49.69	100.00	50.84	67.40
	1	100.00	38.62	55.72	100.00	56.53	72.22
	2	100.00	39.73	56.87	100.00	57.26	72.82
	3	100.00	46.27	63.27	100.00	61.75	76.35
	4	100.00	47.05	63.99	100.00	63.03	77.32
Llama-3.2–3B-Instruct Without Substances	0	98.00	24.21	38.58	99.01	39.09	56.05
	1	100.00	32.81	49.41	100.00	47.80	64.68
	2	100.00	33.37	50.04	100.00	50.00	66.67
	3	99.52	35.75	52.60	99.67	51.98	68.33
	4	100.00	34.58	51.39	100.00	49.82	66.50
Llama-3.2–3B-Instruct With Substances	0	99.16	25.05	40.00	99.09	46.12	62.94
	1	100.00	33.75	50.47	100.00	56.64	72.32
	2	100.00	35.59	52.83	100.00	57.43	72.96
	3	99.50	39.54	56.59	99.68	61.19	75.83
	4	100.00	35.35	55.44	100.00	60.32	75.25