. 2024 Oct 3;26:e60601. doi: 10.2196/60601

Table 2.

For the evaluation of the question-answering task, we compared ROUGE-L, BERTScore, MoverScore, and BLEURT on the zero-shot and retrieval-augmented generation framework.

		LLaMa2-13b					GPT-4
		ROUGE-L	BERTScore	MoverScore	BLEURT	ROUGE -L		BERTScore	MoverScore	BLEURT
LiveQA^a
	Z.S^b	17.73	81.93	53.37	40.45	18.89		82.50	54.02	39.84
	RAG^c	18.83^d	82.79^d	53.79^d	40.59^d	19.44^d		83.01^d	54.11^d	40.55^d
ExpertQA^a-Bio
	Z.S	23.26	84.38	55.58	44.65	23.00		84.50	56.15	44.53
	RAG	25.79^d	85.18^d	56.17^d	45.20^d	27.20^d		85.83^d	57.11^d	45.91^d
ExpertQA^a-Med
	Z.S	24.86	84.89	55.74	46.32	25.45		85.11	56.50	45.98
	RAG	27.49^d	85.80^d	56.58^d	46.47^d	28.08^d		86.30^d	57.32^d	47.00^d
MedicationQA^a
	Z.S	13.30	81.81	51.96	38.30	14.41		82.55	52.62	37.41
	RAG	14.71^d	82.79^d	52.59^d	38.42^d	16.19^d		83.59^d	53.30^d	37.91^d

^aQA: question-answering.

^bZ.S: zero-shot.

^cRAG: retrieval-augmented generation framework.

^dThe superior score within the same data set.