. 2025;33(1):4–10. doi: 10.5455/aim.2024.33.4-10

Table 2. PEM evaluated metrics by the included studies. The * indicates the technical metrics.

Metric	Frequency	References	Definition
Citation Support*	1	[41].	Whether the LLM includes and properly references credible sources.
Patient Satisfaction	1	[40].	The extent to which a patient would find the answer helpful and comforting.
Similarity*	1	[64].	How closely the responses of one LLM aligns with another.
Bias*	2	[41], [59].	The presence of unfair or prejudiced assumptions in the text.
Hallucinations*	2	[19], [41].	Fabricating information or presenting false details as fact.
Reasoning*	2	[22], [59].	The clarity and logical soundness of the argument or explanation.
Response Length*	2	[7], [43].	The conciseness or verbosity of the answer.
Responsiveness*	2	[7], [43].	The time to complete the LLM response.
Reproducibility*	5	[1], [18], [19], [23], [56].	Consistency of the answer when asked multiple times.
Safety	5	[22], [37], [41], [59], [69].	Avoidance of harmful, unethical, or disallowed content.
Clarity	6	[5], [9], [19], [39], [41], [60].	How easily the text can be understood.
Actionability	8	[2], [6], [13], [21], [30], [38], [54], [55].	Whether the response provides usable advice or next steps.
Tone	11	[5], [9], [10], [20], [30], [39]–[41], [47], [51], [60].	The emotional or stylistic manner of the answer.
Appropriateness	13	[9], [19], [22], [23], [27], [33], [39], [41], [45], [46], [50], [60], [69].	Suitability of the response for the context and audience.
Understandability	13	[2], [6], [9], [13], [21], [30], [38], [51], [53]–[55], [59], [60].	How straightforward and comprehensible the language is.
Reliability*	15	[10], [15]–[17], [20], [26]–[28], [31], [43], [47], [49], [57], [61], [64].	Trustworthiness and factual correctness of the content.
Quality*	19	[5], [11], [15]–[17], [25], [26], [28], [37], [46], [49], [50], [53], [54], [57], [61], [65], [66], [69].	Overall caliber and usefulness of the response.
Comprehensiveness	24	[1], [3], [4], [7], [10], [11], [18], [19], [22], [32]–[34], [36], [40], [44], [46]–[48], [50], [51], [56], [60], [67], [68].	The degree to which the answer covers all relevant points.
Readability*	51	[1]–[3], [6]–[11], [13]–[17], [20], [21], [23]–[26], [28], [30], [31], [34], [35], [38], [41], [42], [44]–[57], [59]–[67].	The ease with which the text can be read and parsed.
Accuracy*	54	[1]–[5], [7], [10]–[13], [16], [18]–[23], [25]–[34], [36], [38]–[44], [46]–[51], [53]–[56], [58]–[60], [62], [63], [66]–[69].	Correctness and precision of the information provided.