. 2024 May 17;27(7):110021. doi: 10.1016/j.isci.2024.110021

Table 2.

Model performance in the automatic assessment (57,601 QA pairs of 2,028 participants in the test set) and manual assessment (2,692 QA pairs of 100 participants sampled from the test set)

A. Automatic assessment: language-based metrics
BLEU_1	BLEU_2	BLEU_3	BLEU_4	CIDEr	ROUGE-L
0.42	0.35	0.30	0.26	0.23	0.30

B. Automatic assessment: disease-based metrics
	Specificity	Accuracy	Precision	Sensitivity	F1 score
Answer classification

Binary choice all^a	0.56	0.60	0.75	0.31	0.44
Binary choice 100^b before	0.81	0.70	0.80	0.61	0.69
Binary choice 100^b after	0.85	0.76	0.84	0.67	0.74
Multiple choice all^a	0.79	0.68	0.37	0.36	0.35
Multiple choice 100^b before	0.39	0.39	0.39	0.55	0.46
Multiple choice 100^b after	0.41	0.43	0.42	0.58	0.49

Condition classification

Microaneurysm	0.96	0.94	0.89	0.86	0.87
Diabetic retinopathy	0.98	0.94	0.79	0.90	0.84
Arteriosclerosis	0.94	0.87	0.68	0.77	0.72

C. Manual assessment: error types of QA assessed by ophthalmologists
	Rater 1 N (%)	Rater 2 N (%)	Kappa
Unrelated information	94 (3.5%)	81 (3.0%)	0.746
Factual error	474 (17.6%)	475 (17.6%)	0.835
Omission	109 (4.0%)	80 (3.0%)	0.743
Insufficient information	86 (3.2%)	77 (2.9%)	0.741

BLEU, bilingual evaluation understudy; CIDEr, consensus-based image description evaluation; ROUGE-L, recall-oriented understudy for gisting evaluation-longest common subsequence.

The metrics were calculated using all the data from the test set.

The metrics were calculated using a randomly sampled subset of the test set, comprising 100 participants.