Table 3.
General Matrices | ||||||
---|---|---|---|---|---|---|
Evaluation Matrices | Evaluation Purpose | Best Reported Performance | Clinical Task | Clinical Specialty | ||
Confusion Matrices-Based Scores | Correctness | 100% | Diagnosing glaucoma based on specific clinical case descriptions43 | Ophthalmology | ||
Generating radiology reports from concise imaging findings100 | Radiology | |||||
Accelerating review of historic echocardiogram reports77 | Internal Medicine | |||||
Interpret symptoms and management of common cardiac conditions79 | Internal Medicine | |||||
The diagnosis management of bacterial tonsillitis54 | Otolaryngology | |||||
Classifying margin status for lung cancer57 | Oncology | |||||
Average Word Count Reduction Percentage + Recall | Balance Between Conciseness and Completeness | Average Word Count Reduction Percentage=47% when Recall=90% | Summarizing radiology reports into structured format44 | Radiology | ||
Self-Designed Human Evaluation (e.g., Likert-Scale) | Correctness | 89.6% | Generating concise and accurate layperson summaries of musculoskeletal radiology reports101 | Radiology | ||
Completeness | 94.1% | Generating concise and accurate layperson summaries of musculoskeletal radiology reports101 | Radiology | |||
Conciseness | 12% | Summarizing patient questions and progress notes36 | No specific specialty | |||
Harmfulness* | 2% | Proposing a comprehensive management plan (suspected/confirmed diagnosis, workup, antibiotic therapy, source control, follow-up) for patients with positive blood cultures48 | Internal Medicine | |||
Readability | 80% | Generating radiology reports from concise imaging findings100 | Radiology | |||
Quality | 89% | Impression generation for whole-body PET reports35 | Radiology | |||
Appropriateness | 58.5% | Diagnosing and suggest examinations/treatments for urology patients (subgroups that had the best performance: non-oncology)42 | Urology | |||
Satisfactory | 80% | Proposing a comprehensive management plan (suspected/confirmed diagnosis, workup, antibiotic therapy, source control, follow-up) for patients with positive blood cultures48 | Internal Medicine | |||
Reliability/Stability | 70% | Predicting treatments for patients with aortic stenosis83 | Internal Medicine | |||
Preference over Human | 81% | Summarizing clinical text36 | No specific Specialty | |||
Level of Empathy | 61.4% | Generating high-quality responses to patient-submitted questions in the patient portal64 | Dermatology | |||
Hallucination Rate* | 4% | Improving the readability of foot and ankle orthopedic radiology reports106 | Radiology | |||
Utility | 81.6% | Impression generation for whole-body PET reports35 | Radiology | |||
Relevancy | 40% | Simplifying radiological MRI findings of the knee joint105 | Radiology | |||
Artificial Intelligence Performance Instrument (AIPI) | Other Performance | 15.1/20.0 | Managing cases in otolaryngology–head and neck surgery40 | Otolaryngology | ||
QAMAI Tool | Other Performance | 18.4/30 | Providing Triage for Maxillofacial Trauma Cases | Surgery | ||
Ottawa Clinic Assessment Tool | Other Performance | 3.88/5.00 | Recommending differential diagnosis for laryngology and head and neck (Recommending differential diagnosis) cases47 | Otolaryngology | ||
DISCERN | Quality | 15/35 | Diagnosing and suggest examinations/treatments for urology patients (subgroups that had the best performance: oncology, emergency, and male)42 | Urology | ||
Root Mean Square Error | Error | 2.96 | Measuring the angle of correction for high tibial osteotomy95 | Orthopedic | ||
Flesch Reading Ease | Readability | 72.7% | Improving the readability of foot and ankle orthopedic radiology reports63 | Radiology | ||
Flesch-Kincaid Reading Grade Level* | Readability | 6.2 | Summarizing discharge summary53 | No specific specialty | ||
Average of Gunning Fog, Flesch–Kincaid Grade Level, Automated Readability, Coleman–Liau* | Readability | 7.5 | Summarizing X-Ray report103 | Radiology | ||
Patient Education Materials Assessment Tool | Understandability | 81% | Summarizing discharge summary53 | No specific specialty | ||
Cohen’s Kappa | Reliability/Stability | 1.0 | Head and neck oncological board decisions: deciding on neoadjuvant chemotherapy and chemoradiotherapy treatment | Oncology | ||
Agreement with Expert or Ground Truth | 0.727 | Predicting the dichotomized modified Rankin Scale (mRS) score at 3 months post-thrombectomy84 | Neurology | |||
Cronbach’s α | Agreement with Expert or Ground Truth | 0.754 | Managing otolaryngology cases96 | Otolaryngology | ||
Mann-Whitney U test | Agreement with Expert or Ground Truth | 0.770 | Providing number of additional examinations when managing otolaryngological cases40 | Otolaryngology | ||
Spearman’s Coefficient | Reliability/Stability | 0.999 | Considering the patient’s symptoms and physical findings reported by practitioners when managing otolaryngology cases96 | Otolaryngology | ||
Percentage of Getting the Same Response to Identical Queries | Reliability/Stability | 100% | Predicting hemoglobinopathies from a patient’s laboratory results of CBC and ferritin values82 | Internal Medicine | ||
Agreement Percentage | Agreement with Expert or Ground Truth | 80% | Determining disease severity for acute ulcerative colitis presentations in the setting of an emergency department75 | Gastroenterology | ||
Global Quality Scale | Quality | 4.2 | Analyzing retinal detachment cases and suggesting the best possible surgical planning92 | Ophthalmology | ||
Fleiss Kappa | Reliability/Stability | 0.786 | Colonoscopy recommendations for colorectal cancer rescreening and surveillance74 | Gastroenterology | ||
Similarity Measurements for Generative NLP Models | ||||||
Evaluation Matrices | Correlated Evaluation (If Reported) Measured by Spearman’s Coefficient | Performance | ||||
Reported Coefficient | Evaluation and Task | Clinical Specialty | Best Reported Value | Task | Clinical Specialty | |
BLEU | 0.412 | Quality score of Impression generation for whole-body PET reports35 | Radiology | 24.7 | Impression generation for whole-body PET reports35 | Radiology |
0.225 | Completeness of clinical text summarization36 | No Specific Specialty | ||||
0.125 | Correctness of clinical text summarization36 | |||||
0.075 | Conciseness of clinical text summarization36 | |||||
BLEU-2 | 74.5 | Generating a comprehensive and coherent medical report of a given medical image from COVID-19 data39 | Internal Medicine | |||
BLEU-3 | 67.8 | |||||
BLEU-4 | 63.2 | |||||
ROUGE-1 | 0.402 | Quality score of Impression generation for whole-body PET reports35 | Radiology | 57.29 | Clinical notes summarization30 | No Specific Specialty |
ROUGE-2 | 0.379 | Quality score of Impression generation for whole-body PET reports35 | Radiology | 44.32 | Clinical notes summarization30 | No Specific Specialty |
ROUGE-L | 0.22 | Completeness of clinical text summarization36 | No Specific Specialty | 68.5 | Generating a comprehensive and coherent medical report of a given medical image from COVID-19 data39 | Internal Medicine |
0.16 | Correctness of clinical text summarization36 | |||||
0.19 | Conciseness of clinical text summarization36 | |||||
0.398 | Quality score of Impression generation for whole-body PET reports35 | Radiology | ||||
BERTScore-Precision | 86.57 | Clinical notes summarization30 | No Specific Specialty | |||
BERTScore-Recall | 87.14 | |||||
BERTScore-F1 | 0.18 | Completeness of clinical text summarization36 | No Specific Specialty | 89.4 | Summarizing longitudinal aneurysm reports102 | Radiology |
0.18 | Correctness of clinical text summarization36 | |||||
0.24 | Conciseness of clinical text summarization36 | |||||
0.407 | Quality score of Impression generation for whole-body PET reports35 | Radiology | ||||
MEDCON | 0.125 | Completeness of clinical text summarization36 | 64.9 | Clinical text summarization36 | No Specific Specialty | |
0.175 | Correctness of clinical text summarization36 | |||||
0.15 | Conciseness of clinical text summarization36 | |||||
CIDEr | 0.194 | Quality score of Impression generation for whole-body PET reports35 | Radiology | 97.5 | Generating a comprehensive and coherent medical report of a given medical image from COVID-19 data39 | Internal Medicine |
BARTScore+PET | 0.568 | −1.46 | Impression generation for whole-body PET reports35 | Radiology | ||
PEGASUSScore+PET | 0.563 | −1.44 | ||||
T5Score+PET | 0.542 | −1.41 | ||||
UniEval | 0.501 | 0.78 | ||||
BARTScore | 0.474 | −3.05 | ||||
CHRF | 0.433 | 42.2 | ||||
Movers core | 0.420 | 0.607 | ||||
ROUGE-WE-1 | 0.403 | 54.8 | ||||
ROUGE-LSUM | 0.397 | 50.8 | ||||
ROUGE-WE-2 | 0.396 | 40.7 | ||||
METEOR | 0.388 | 0.279 | ||||
ROUGE-WE-3 | 0.385 | 42.5 | ||||
RedGraph | 0.384 | 0.397 | ||||
PRISM | 0.369 | −3.24 | ||||
ROUGE-3 | 0.345 | 20.5 | ||||
S3-pyr | 0.302 | 0.71 | ||||
S3-resp | 0.301 | 0.79 | ||||
Stats-novel trigram | 0.292 | 0.99 | ||||
Stats-density | 0.280 | 6.51 | ||||
BLANC | 0.165 | 0.131 | ||||
Stats-compression | 0.145 | 8.36 | ||||
SUPERT | 0.082 | 0.557 | ||||
Stats-coverage | 0.078 | 8.36 | ||||
SummaQA | 0.075 | 0.180 |
Lower value represents better performance.