Skip to main content
. Author manuscript; available in PMC: 2025 Sep 30.
Published in final edited form as: Proc Mach Learn Res. 2025 Jun;287:527–542.

Table 3:

Performance of models under different promptings and data integration methods on traditional NLP metrics. ↑ indicates the higher the better.

BLEU-1 Avg ↑ BLEU-4 Avg ↑ ROUGE-L Avg ↑ Hallucination (%)↓

qwen2:7b_FS_UGP 0.011 0.000 0.008 92.860
qwen2:7b.ZS-UGP 0.017 0.000 0.011 92.860
qwen2:7b_ZS_UCP 0.280 0.016 0.252 64.740
qwen2:7b_ZS_FCSP 0.281 0.016 0.252 64.680
qwen2:7b_FS_UCP 0.324 0.012 0.299 57.510
qwen2:7b_FS_FCSP 0.326 0.012 0.299 57.320

qwen2.5:7b_ZS_UGP 0.018 0.000 0.012 92.860
qwen2.5:7b_FS_UGP 0.047 0.004 0.038 88.240
qwen2.5:7b_ZS_UCP 0.420 0.025 0.392 44.110
qwen2.5:7b_ZS_FCSP 0.421 0.026 0.391 44.210
qwen2.5:7b_FS_UCP 0.462 0.024 0.427 38.550
qwen2.5:7b_FS_FCSP 0.463 0.025 0.426 38.550

qwen2.5_32b_FS_UGP 0.002 0.000 0.001 92.860
qwen2.5_32b_ZS_UGP 0.018 0.000 0.012 92.860
qwen2.5_32b_ZS_UCP 0.371 0.030 0.343 55.410
qwen2.5_32b_ZSCOT_UCP 0.371 0.031 0.343 55.570
qwen2.5_32b_ZS_FCSP 0.376 0.031 0.346 54.960
qwen2.5_32b_ZSCOT_FCSP 0.376 0.031 0.346 54.960
qwen2.5_32b_FS_UCP 0.383 0.033 0.348 54.330
qwen2.5_32b_FS_FCSP 0.386 0.033 0.350 53.870

llama3:8b_FS_UGP 0.010 0.000 0.007 92.860
llama3:8b_ZS_UGP 0.018 0.000 0.012 92.860
llama3:8b_ZS_UCP 0.338 0.020 0.303 61.160
llama3:8b_ZS_FCSP 0.339 0.021 0.303 61.110
llama3:8b_FS_UCP 0.365 0.024 0.329 53.230
llama3:8b_FS_FCSP 0.367 0.024 0.331 52.970

gpt4o_ZS_UGP 0.003 0.000 0.001 92.860
gpt4o_FS_UGP 0.004 0.000 0.001 92.860
gpt4o_ZS_FCSP 0.091 0.007 0.077 85.160
gpt4o_FS_FCSP 0.092 0.007 0.078 85.160
gpt4o_ZS_UCP 0.093 0.007 0.079 84.880
gpt4o_FS_UCP 0.093 0.007 0.080 84.880