Skip to main content
. 2024 Oct 23;8:240. doi: 10.1038/s41698-024-00733-4

Table 1.

Evaluation framework used for the systematic data extraction of studies evaluating the performance of LLM in medQA

Section Item Description
Source of prompts 1 Website, exam question bank, FAQs (Google Trends ©), guidelines, patient information forum
2 Number of questions (n)
Assessed large language model 3 Which LMM (e.g. GPT-3.5, Gemini)?
4 Standard model or fine-tuning with specific data applied?
Questioning procedure 5 Topic (e.g. cancer entity)
6 Source data of prompts and answers provided?
7 Prompt-engineering used?
8 Enquiry conducted once or repeated?
9 Enquiry conducted independent (i.e. “new question=new chat”)
10 Standalone questions or multiple continuous questions (i.e. “zero-shot” vs. “fire-side” enquiry)
11 Language
Output evaluation 12 Rater (Who is evaluating LLM output and level of experience?)
13 Is the rater blinded?
14 Number of raters?
15 If multiple raters, is inter-rater agreement reported?
16 Is there an endpoint as reported as metrics? (e.g. accuracy, readability)
17 Is grading of LMM output reported? (e.g. binary yes/no, Likert-scale, multidimensional?)
18 Is a control group reported? (e.g. “We compared performance of GPT-3.5 with GPT-4”)