Table 2.
For the evaluation of the question-answering task, we compared ROUGE-L, BERTScore, MoverScore, and BLEURT on the zero-shot and retrieval-augmented generation framework.
|
|
LLaMa2-13b | GPT-4 | |||||||||
|
|
ROUGE-L | BERTScore | MoverScore | BLEURT | ROUGE -L | BERTScore | MoverScore | BLEURT | |||
| LiveQAa | |||||||||||
|
|
Z.Sb | 17.73 | 81.93 | 53.37 | 40.45 | 18.89 | 82.50 | 54.02 | 39.84 | ||
|
|
RAGc | 18.83d | 82.79d | 53.79d | 40.59d | 19.44d | 83.01d | 54.11d | 40.55d | ||
| ExpertQAa-Bio | |||||||||||
|
|
Z.S | 23.26 | 84.38 | 55.58 | 44.65 | 23.00 | 84.50 | 56.15 | 44.53 | ||
|
|
RAG | 25.79d | 85.18d | 56.17d | 45.20d | 27.20d | 85.83d | 57.11d | 45.91d | ||
| ExpertQAa-Med | |||||||||||
|
|
Z.S | 24.86 | 84.89 | 55.74 | 46.32 | 25.45 | 85.11 | 56.50 | 45.98 | ||
|
|
RAG | 27.49d | 85.80d | 56.58d | 46.47d | 28.08d | 86.30d | 57.32d | 47.00d | ||
| MedicationQAa | |||||||||||
|
|
Z.S | 13.30 | 81.81 | 51.96 | 38.30 | 14.41 | 82.55 | 52.62 | 37.41 | ||
|
|
RAG | 14.71d | 82.79d | 52.59d | 38.42d | 16.19d | 83.59d | 53.30d | 37.91d | ||
aQA: question-answering.
bZ.S: zero-shot.
cRAG: retrieval-augmented generation framework.
dThe superior score within the same data set.