Source of prompts |
1 |
Website, exam question bank, FAQs (Google Trends ©), guidelines, patient information forum |
|
2 |
Number of questions (n) |
Assessed large language model |
3 |
Which LMM (e.g. GPT-3.5, Gemini)? |
|
4 |
Standard model or fine-tuning with specific data applied? |
Questioning procedure |
5 |
Topic (e.g. cancer entity) |
|
6 |
Source data of prompts and answers provided? |
|
7 |
Prompt-engineering used? |
|
8 |
Enquiry conducted once or repeated? |
|
9 |
Enquiry conducted independent (i.e. “new question = new chat”) |
|
10 |
Standalone questions or multiple continuous questions (i.e. “zero-shot” vs. “fire-side” enquiry) |
|
11 |
Language |
Output evaluation |
12 |
Rater (Who is evaluating LLM output and level of experience?) |
|
13 |
Is the rater blinded? |
|
14 |
Number of raters? |
|
15 |
If multiple raters, is inter-rater agreement reported? |
|
16 |
Is there an endpoint as reported as metrics? (e.g. accuracy, readability)
|
|
17 |
Is grading of LMM output reported? (e.g. binary yes/no, Likert-scale, multidimensional?)
|
|
18 |
Is a control group reported? (e.g. “We compared performance of GPT-3.5 with GPT-4”) |