Skip to main content
. Author manuscript; available in PMC: 2025 Sep 30.
Published in final edited form as: Proc Mach Learn Res. 2025 Jun;287:527–542.

Table 4:

Consensus clinical evaluations of dense information extraction outputs from FS-FCSP by Llama3:8b. ↓ indicates the lower the better, and ↑ indicates the higher the better.

Evaluation Criteria Criteria Physician 1 Physician 2 Average

1. Clinical Relevance↑ Accuracy of Key Information 4 4 4
Critical Omissions 4 4 4

2. Comprehensibility↑ Readability 4 5 4.5
Conciseness 5 5 5

3. Clinical Usability↑ Practicality 4 4 4
Actionability 3 3 3

4. Error Impact Assessment↓ Severity of Errors 2 2 2
Tolerance for Hallucination 5 4 4.5

5. Alignment with Clinical Judgment↑ Trustworthiness 4 4 4
Contextual Appropriateness 4 4 4

6. Consistency↑ Model Consistency Across Cases 4 3 3.5

7. Preference Scoring↑ Overall Performance 4 4 4