Table 3:
Performance of models under different promptings and data integration methods on traditional NLP metrics. ↑ indicates the higher the better.
| BLEU-1 Avg ↑ | BLEU-4 Avg ↑ | ROUGE-L Avg ↑ | Hallucination (%)↓ | |
|---|---|---|---|---|
|
| ||||
| qwen2:7b_FS_UGP | 0.011 | 0.000 | 0.008 | 92.860 |
| qwen2:7b.ZS-UGP | 0.017 | 0.000 | 0.011 | 92.860 |
| qwen2:7b_ZS_UCP | 0.280 | 0.016 | 0.252 | 64.740 |
| qwen2:7b_ZS_FCSP | 0.281 | 0.016 | 0.252 | 64.680 |
| qwen2:7b_FS_UCP | 0.324 | 0.012 | 0.299 | 57.510 |
| qwen2:7b_FS_FCSP | 0.326 | 0.012 | 0.299 | 57.320 |
|
| ||||
| qwen2.5:7b_ZS_UGP | 0.018 | 0.000 | 0.012 | 92.860 |
| qwen2.5:7b_FS_UGP | 0.047 | 0.004 | 0.038 | 88.240 |
| qwen2.5:7b_ZS_UCP | 0.420 | 0.025 | 0.392 | 44.110 |
| qwen2.5:7b_ZS_FCSP | 0.421 | 0.026 | 0.391 | 44.210 |
| qwen2.5:7b_FS_UCP | 0.462 | 0.024 | 0.427 | 38.550 |
| qwen2.5:7b_FS_FCSP | 0.463 | 0.025 | 0.426 | 38.550 |
|
| ||||
| qwen2.5_32b_FS_UGP | 0.002 | 0.000 | 0.001 | 92.860 |
| qwen2.5_32b_ZS_UGP | 0.018 | 0.000 | 0.012 | 92.860 |
| qwen2.5_32b_ZS_UCP | 0.371 | 0.030 | 0.343 | 55.410 |
| qwen2.5_32b_ZSCOT_UCP | 0.371 | 0.031 | 0.343 | 55.570 |
| qwen2.5_32b_ZS_FCSP | 0.376 | 0.031 | 0.346 | 54.960 |
| qwen2.5_32b_ZSCOT_FCSP | 0.376 | 0.031 | 0.346 | 54.960 |
| qwen2.5_32b_FS_UCP | 0.383 | 0.033 | 0.348 | 54.330 |
| qwen2.5_32b_FS_FCSP | 0.386 | 0.033 | 0.350 | 53.870 |
|
| ||||
| llama3:8b_FS_UGP | 0.010 | 0.000 | 0.007 | 92.860 |
| llama3:8b_ZS_UGP | 0.018 | 0.000 | 0.012 | 92.860 |
| llama3:8b_ZS_UCP | 0.338 | 0.020 | 0.303 | 61.160 |
| llama3:8b_ZS_FCSP | 0.339 | 0.021 | 0.303 | 61.110 |
| llama3:8b_FS_UCP | 0.365 | 0.024 | 0.329 | 53.230 |
| llama3:8b_FS_FCSP | 0.367 | 0.024 | 0.331 | 52.970 |
|
| ||||
| gpt4o_ZS_UGP | 0.003 | 0.000 | 0.001 | 92.860 |
| gpt4o_FS_UGP | 0.004 | 0.000 | 0.001 | 92.860 |
| gpt4o_ZS_FCSP | 0.091 | 0.007 | 0.077 | 85.160 |
| gpt4o_FS_FCSP | 0.092 | 0.007 | 0.078 | 85.160 |
| gpt4o_ZS_UCP | 0.093 | 0.007 | 0.079 | 84.880 |
| gpt4o_FS_UCP | 0.093 | 0.007 | 0.080 | 84.880 |