[Preprint]. 2024 Aug 19:2024.08.11.24311828. [Version 2] doi: 10.1101/2024.08.11.24311828

Table 3.

Summary of Evaluation Matrices.

General Matrices
Evaluation Matrices	Evaluation Purpose	Best Reported Performance		Clinical Task		Clinical Specialty
Confusion Matrices-Based Scores	Correctness	100%		Diagnosing glaucoma based on specific clinical case descriptions⁴³		Ophthalmology
				Generating radiology reports from concise imaging findings¹⁰⁰		Radiology
				Accelerating review of historic echocardiogram reports⁷⁷		Internal Medicine
				Interpret symptoms and management of common cardiac conditions⁷⁹		Internal Medicine
				The diagnosis management of bacterial tonsillitis⁵⁴		Otolaryngology
				Classifying margin status for lung cancer⁵⁷		Oncology
Average Word Count Reduction Percentage + Recall	Balance Between Conciseness and Completeness	Average Word Count Reduction Percentage=47% when Recall=90%		Summarizing radiology reports into structured format⁴⁴		Radiology
Self-Designed Human Evaluation (e.g., Likert-Scale)	Correctness	89.6%		Generating concise and accurate layperson summaries of musculoskeletal radiology reports¹⁰¹		Radiology
	Completeness	94.1%		Generating concise and accurate layperson summaries of musculoskeletal radiology reports¹⁰¹		Radiology
	Conciseness	12%		Summarizing patient questions and progress notes³⁶		No specific specialty
	Harmfulness^*	2%		Proposing a comprehensive management plan (suspected/confirmed diagnosis, workup, antibiotic therapy, source control, follow-up) for patients with positive blood cultures⁴⁸		Internal Medicine
	Readability	80%		Generating radiology reports from concise imaging findings¹⁰⁰		Radiology
	Quality	89%		Impression generation for whole-body PET reports³⁵		Radiology
	Appropriateness	58.5%		Diagnosing and suggest examinations/treatments for urology patients (subgroups that had the best performance: non-oncology)⁴²		Urology
	Satisfactory	80%		Proposing a comprehensive management plan (suspected/confirmed diagnosis, workup, antibiotic therapy, source control, follow-up) for patients with positive blood cultures⁴⁸		Internal Medicine
	Reliability/Stability	70%		Predicting treatments for patients with aortic stenosis⁸³		Internal Medicine
	Preference over Human	81%		Summarizing clinical text³⁶		No specific Specialty
	Level of Empathy	61.4%		Generating high-quality responses to patient-submitted questions in the patient portal⁶⁴		Dermatology
	Hallucination Rate^*	4%		Improving the readability of foot and ankle orthopedic radiology reports¹⁰⁶		Radiology
	Utility	81.6%		Impression generation for whole-body PET reports³⁵		Radiology
	Relevancy	40%		Simplifying radiological MRI findings of the knee joint¹⁰⁵		Radiology
Artificial Intelligence Performance Instrument (AIPI)	Other Performance	15.1/20.0		Managing cases in otolaryngology–head and neck surgery⁴⁰		Otolaryngology
QAMAI Tool	Other Performance	18.4/30		Providing Triage for Maxillofacial Trauma Cases		Surgery
Ottawa Clinic Assessment Tool	Other Performance	3.88/5.00		Recommending differential diagnosis for laryngology and head and neck (Recommending differential diagnosis) cases⁴⁷		Otolaryngology
DISCERN	Quality	15/35		Diagnosing and suggest examinations/treatments for urology patients (subgroups that had the best performance: oncology, emergency, and male)⁴²		Urology
Root Mean Square Error	Error	2.96		Measuring the angle of correction for high tibial osteotomy⁹⁵		Orthopedic
Flesch Reading Ease	Readability	72.7%		Improving the readability of foot and ankle orthopedic radiology reports⁶³		Radiology
Flesch-Kincaid Reading Grade Level^*	Readability	6.2		Summarizing discharge summary⁵³		No specific specialty
Average of Gunning Fog, Flesch–Kincaid Grade Level, Automated Readability, Coleman–Liau^*	Readability	7.5		Summarizing X-Ray report¹⁰³		Radiology
Patient Education Materials Assessment Tool	Understandability	81%		Summarizing discharge summary⁵³		No specific specialty
Cohen’s Kappa	Reliability/Stability	1.0		Head and neck oncological board decisions: deciding on neoadjuvant chemotherapy and chemoradiotherapy treatment		Oncology
Cohen’s Kappa	Agreement with Expert or Ground Truth	0.727		Predicting the dichotomized modified Rankin Scale (mRS) score at 3 months post-thrombectomy⁸⁴		Neurology
Cronbach’s α	Agreement with Expert or Ground Truth	0.754		Managing otolaryngology cases⁹⁶		Otolaryngology
Mann-Whitney U test	Agreement with Expert or Ground Truth	0.770		Providing number of additional examinations when managing otolaryngological cases⁴⁰		Otolaryngology
Spearman’s Coefficient	Reliability/Stability	0.999		Considering the patient’s symptoms and physical findings reported by practitioners when managing otolaryngology cases⁹⁶		Otolaryngology
Percentage of Getting the Same Response to Identical Queries	Reliability/Stability	100%		Predicting hemoglobinopathies from a patient’s laboratory results of CBC and ferritin values⁸²		Internal Medicine
Agreement Percentage	Agreement with Expert or Ground Truth	80%		Determining disease severity for acute ulcerative colitis presentations in the setting of an emergency department⁷⁵		Gastroenterology
Global Quality Scale	Quality	4.2		Analyzing retinal detachment cases and suggesting the best possible surgical planning⁹²		Ophthalmology
Fleiss Kappa	Reliability/Stability	0.786		Colonoscopy recommendations for colorectal cancer rescreening and surveillance⁷⁴		Gastroenterology
Similarity Measurements for Generative NLP Models
Evaluation Matrices	Correlated Evaluation (If Reported) Measured by Spearman’s Coefficient			Performance
Evaluation Matrices	Reported Coefficient	Evaluation and Task	Clinical Specialty	Best Reported Value	Task	Clinical Specialty
BLEU	0.412	Quality score of Impression generation for whole-body PET reports³⁵	Radiology	24.7	Impression generation for whole-body PET reports³⁵	Radiology
	0.225	Completeness of clinical text summarization³⁶	No Specific Specialty
	0.125	Correctness of clinical text summarization³⁶
	0.075	Conciseness of clinical text summarization³⁶
BLEU-2				74.5	Generating a comprehensive and coherent medical report of a given medical image from COVID-19 data³⁹	Internal Medicine
BLEU-3				67.8
BLEU-4				63.2
ROUGE-1	0.402	Quality score of Impression generation for whole-body PET reports³⁵	Radiology	57.29	Clinical notes summarization³⁰	No Specific Specialty
ROUGE-2	0.379	Quality score of Impression generation for whole-body PET reports³⁵	Radiology	44.32	Clinical notes summarization³⁰	No Specific Specialty
ROUGE-L	0.22	Completeness of clinical text summarization³⁶	No Specific Specialty	68.5	Generating a comprehensive and coherent medical report of a given medical image from COVID-19 data³⁹	Internal Medicine
	0.16	Correctness of clinical text summarization³⁶
	0.19	Conciseness of clinical text summarization³⁶
	0.398	Quality score of Impression generation for whole-body PET reports³⁵	Radiology
BERTScore-Precision				86.57	Clinical notes summarization³⁰	No Specific Specialty
BERTScore-Recall				87.14	Clinical notes summarization³⁰	No Specific Specialty
BERTScore-F1	0.18	Completeness of clinical text summarization³⁶	No Specific Specialty	89.4	Summarizing longitudinal aneurysm reports¹⁰²	Radiology
	0.18	Correctness of clinical text summarization³⁶
	0.24	Conciseness of clinical text summarization³⁶
	0.407	Quality score of Impression generation for whole-body PET reports³⁵	Radiology
MEDCON	0.125	Completeness of clinical text summarization³⁶		64.9	Clinical text summarization³⁶	No Specific Specialty
	0.175	Correctness of clinical text summarization³⁶
	0.15	Conciseness of clinical text summarization³⁶
CIDEr	0.194	Quality score of Impression generation for whole-body PET reports³⁵	Radiology	97.5	Generating a comprehensive and coherent medical report of a given medical image from COVID-19 data³⁹	Internal Medicine
BARTScore+PET	0.568			−1.46	Impression generation for whole-body PET reports³⁵	Radiology
PEGASUSScore+PET	0.563			−1.44
T5Score+PET	0.542			−1.41
UniEval	0.501			0.78
BARTScore	0.474			−3.05
CHRF	0.433			42.2
Movers core	0.420			0.607
ROUGE-WE-1	0.403			54.8
ROUGE-LSUM	0.397			50.8
ROUGE-WE-2	0.396			40.7
METEOR	0.388			0.279
ROUGE-WE-3	0.385			42.5
RedGraph	0.384			0.397
PRISM	0.369			−3.24
ROUGE-3	0.345			20.5
S³-pyr	0.302			0.71
S³-resp	0.301			0.79
Stats-novel trigram	0.292			0.99
Stats-density	0.280			6.51
BLANC	0.165			0.131
Stats-compression	0.145			8.36
SUPERT	0.082			0.557
Stats-coverage	0.078			8.36
SummaQA	0.075			0.180

Lower value represents better performance.