Skip to main content
. 2024 Oct 3;26:e60601. doi: 10.2196/60601

Table 2.

For the evaluation of the question-answering task, we compared ROUGE-L, BERTScore, MoverScore, and BLEURT on the zero-shot and retrieval-augmented generation framework.


LLaMa2-13b GPT-4

ROUGE-L BERTScore MoverScore BLEURT ROUGE -L BERTScore MoverScore BLEURT
LiveQAa

Z.Sb 17.73 81.93 53.37 40.45 18.89 82.50 54.02 39.84

RAGc 18.83d 82.79d 53.79d 40.59d 19.44d 83.01d 54.11d 40.55d
ExpertQAa-Bio

Z.S 23.26 84.38 55.58 44.65 23.00 84.50 56.15 44.53

RAG 25.79d 85.18d 56.17d 45.20d 27.20d 85.83d 57.11d 45.91d
ExpertQAa-Med

Z.S 24.86 84.89 55.74 46.32 25.45 85.11 56.50 45.98

RAG 27.49d 85.80d 56.58d 46.47d 28.08d 86.30d 57.32d 47.00d
MedicationQAa

Z.S 13.30 81.81 51.96 38.30 14.41 82.55 52.62 37.41

RAG 14.71d 82.79d 52.59d 38.42d 16.19d 83.59d 53.30d 37.91d

aQA: question-answering.

bZ.S: zero-shot.

cRAG: retrieval-augmented generation framework.

dThe superior score within the same data set.