Skip to main content
[Preprint]. 2024 Aug 19:2024.08.11.24311828. [Version 2] doi: 10.1101/2024.08.11.24311828

Table 3.

Summary of Evaluation Matrices.

General Matrices
Evaluation Matrices Evaluation Purpose Best Reported Performance Clinical Task Clinical Specialty
Confusion Matrices-Based Scores Correctness 100% Diagnosing glaucoma based on specific clinical case descriptions43 Ophthalmology
Generating radiology reports from concise imaging findings100 Radiology
Accelerating review of historic echocardiogram reports77 Internal Medicine
Interpret symptoms and management of common cardiac conditions79 Internal Medicine
The diagnosis management of bacterial tonsillitis54 Otolaryngology
Classifying margin status for lung cancer57 Oncology
Average Word Count Reduction Percentage + Recall Balance Between Conciseness and Completeness Average Word Count Reduction Percentage=47% when Recall=90% Summarizing radiology reports into structured format44 Radiology
Self-Designed Human Evaluation (e.g., Likert-Scale) Correctness 89.6% Generating concise and accurate layperson summaries of musculoskeletal radiology reports101 Radiology
Completeness 94.1% Generating concise and accurate layperson summaries of musculoskeletal radiology reports101 Radiology
Conciseness 12% Summarizing patient questions and progress notes36 No specific specialty
Harmfulness* 2% Proposing a comprehensive management plan (suspected/confirmed diagnosis, workup, antibiotic therapy, source control, follow-up) for patients with positive blood cultures48 Internal Medicine
Readability 80% Generating radiology reports from concise imaging findings100 Radiology
Quality 89% Impression generation for whole-body PET reports35 Radiology
Appropriateness 58.5% Diagnosing and suggest examinations/treatments for urology patients (subgroups that had the best performance: non-oncology)42 Urology
Satisfactory 80% Proposing a comprehensive management plan (suspected/confirmed diagnosis, workup, antibiotic therapy, source control, follow-up) for patients with positive blood cultures48 Internal Medicine
Reliability/Stability 70% Predicting treatments for patients with aortic stenosis83 Internal Medicine
Preference over Human 81% Summarizing clinical text36 No specific Specialty
Level of Empathy 61.4% Generating high-quality responses to patient-submitted questions in the patient portal64 Dermatology
Hallucination Rate* 4% Improving the readability of foot and ankle orthopedic radiology reports106 Radiology
Utility 81.6% Impression generation for whole-body PET reports35 Radiology
Relevancy 40% Simplifying radiological MRI findings of the knee joint105 Radiology
Artificial Intelligence Performance Instrument (AIPI) Other Performance 15.1/20.0 Managing cases in otolaryngology–head and neck surgery40 Otolaryngology
QAMAI Tool Other Performance 18.4/30 Providing Triage for Maxillofacial Trauma Cases Surgery
Ottawa Clinic Assessment Tool Other Performance 3.88/5.00 Recommending differential diagnosis for laryngology and head and neck (Recommending differential diagnosis) cases47 Otolaryngology
DISCERN Quality 15/35 Diagnosing and suggest examinations/treatments for urology patients (subgroups that had the best performance: oncology, emergency, and male)42 Urology
Root Mean Square Error Error 2.96 Measuring the angle of correction for high tibial osteotomy95 Orthopedic
Flesch Reading Ease Readability 72.7% Improving the readability of foot and ankle orthopedic radiology reports63 Radiology
Flesch-Kincaid Reading Grade Level* Readability 6.2 Summarizing discharge summary53 No specific specialty
Average of Gunning Fog, Flesch–Kincaid Grade Level, Automated Readability, Coleman–Liau* Readability 7.5 Summarizing X-Ray report103 Radiology
Patient Education Materials Assessment Tool Understandability 81% Summarizing discharge summary53 No specific specialty
Cohen’s Kappa Reliability/Stability 1.0 Head and neck oncological board decisions: deciding on neoadjuvant chemotherapy and chemoradiotherapy treatment Oncology
Agreement with Expert or Ground Truth 0.727 Predicting the dichotomized modified Rankin Scale (mRS) score at 3 months post-thrombectomy84 Neurology
Cronbach’s α Agreement with Expert or Ground Truth 0.754 Managing otolaryngology cases96 Otolaryngology
Mann-Whitney U test Agreement with Expert or Ground Truth 0.770 Providing number of additional examinations when managing otolaryngological cases40 Otolaryngology
Spearman’s Coefficient Reliability/Stability 0.999 Considering the patient’s symptoms and physical findings reported by practitioners when managing otolaryngology cases96 Otolaryngology
Percentage of Getting the Same Response to Identical Queries Reliability/Stability 100% Predicting hemoglobinopathies from a patient’s laboratory results of CBC and ferritin values82 Internal Medicine
Agreement Percentage Agreement with Expert or Ground Truth 80% Determining disease severity for acute ulcerative colitis presentations in the setting of an emergency department75 Gastroenterology
Global Quality Scale Quality 4.2 Analyzing retinal detachment cases and suggesting the best possible surgical planning92 Ophthalmology
Fleiss Kappa Reliability/Stability 0.786 Colonoscopy recommendations for colorectal cancer rescreening and surveillance74 Gastroenterology
Similarity Measurements for Generative NLP Models
Evaluation Matrices Correlated Evaluation (If Reported) Measured by Spearman’s Coefficient Performance
Reported Coefficient Evaluation and Task Clinical Specialty Best Reported Value Task Clinical Specialty
BLEU 0.412 Quality score of Impression generation for whole-body PET reports35 Radiology 24.7 Impression generation for whole-body PET reports35 Radiology
0.225 Completeness of clinical text summarization36 No Specific Specialty
0.125 Correctness of clinical text summarization36
0.075 Conciseness of clinical text summarization36
BLEU-2 74.5 Generating a comprehensive and coherent medical report of a given medical image from COVID-19 data39 Internal Medicine
BLEU-3 67.8
BLEU-4 63.2
ROUGE-1 0.402 Quality score of Impression generation for whole-body PET reports35 Radiology 57.29 Clinical notes summarization30 No Specific Specialty
ROUGE-2 0.379 Quality score of Impression generation for whole-body PET reports35 Radiology 44.32 Clinical notes summarization30 No Specific Specialty
ROUGE-L 0.22 Completeness of clinical text summarization36 No Specific Specialty 68.5 Generating a comprehensive and coherent medical report of a given medical image from COVID-19 data39 Internal Medicine
0.16 Correctness of clinical text summarization36
0.19 Conciseness of clinical text summarization36
0.398 Quality score of Impression generation for whole-body PET reports35 Radiology
BERTScore-Precision 86.57 Clinical notes summarization30 No Specific Specialty
BERTScore-Recall 87.14
BERTScore-F1 0.18 Completeness of clinical text summarization36 No Specific Specialty 89.4 Summarizing longitudinal aneurysm reports102 Radiology
0.18 Correctness of clinical text summarization36
0.24 Conciseness of clinical text summarization36
0.407 Quality score of Impression generation for whole-body PET reports35 Radiology
MEDCON 0.125 Completeness of clinical text summarization36 64.9 Clinical text summarization36 No Specific Specialty
0.175 Correctness of clinical text summarization36
0.15 Conciseness of clinical text summarization36
CIDEr 0.194 Quality score of Impression generation for whole-body PET reports35 Radiology 97.5 Generating a comprehensive and coherent medical report of a given medical image from COVID-19 data39 Internal Medicine
BARTScore+PET 0.568 −1.46 Impression generation for whole-body PET reports35 Radiology
PEGASUSScore+PET 0.563 −1.44
T5Score+PET 0.542 −1.41
UniEval 0.501 0.78
BARTScore 0.474 −3.05
CHRF 0.433 42.2
Movers core 0.420 0.607
ROUGE-WE-1 0.403 54.8
ROUGE-LSUM 0.397 50.8
ROUGE-WE-2 0.396 40.7
METEOR 0.388 0.279
ROUGE-WE-3 0.385 42.5
RedGraph 0.384 0.397
PRISM 0.369 −3.24
ROUGE-3 0.345 20.5
S3-pyr 0.302 0.71
S3-resp 0.301 0.79
Stats-novel trigram 0.292 0.99
Stats-density 0.280 6.51
BLANC 0.165 0.131
Stats-compression 0.145 8.36
SUPERT 0.082 0.557
Stats-coverage 0.078 8.36
SummaQA 0.075 0.180
*

Lower value represents better performance.