Table 2.
Generative LLM performance domain, criteria, and grading scale
| Domain | Criteria | Grading (1–100 scale) |
|---|---|---|
| Accuracy | Alignment with current clinical guidelines and standard practices | 1 = Poor alignment, 100 = Excellent alignment |
| Hallucination | Frequency of unverifiable or false information | 1 = Frequent hallucinations, 100 = No hallucinations |
| Specificity and relevance | Detail in response, relevance for patient’s own situation |
1 = Vague/irrelevant, 100 = Highly specific/relevant |
| Empathy and understandability | Tone sensitivity to patients’ possible emotional state | 1 = Insensitive/confusing, 100 = Empathetic/clear |
| Actionability | Practical steps and clarity in recommendations | 1 = Unclear/no steps, 100 = Clear/practical steps |