Skip to main content
. 2024 May 17;27(7):110021. doi: 10.1016/j.isci.2024.110021

Table 2.

Model performance in the automatic assessment (57,601 QA pairs of 2,028 participants in the test set) and manual assessment (2,692 QA pairs of 100 participants sampled from the test set)

A. Automatic assessment: language-based metrics
BLEU_1 BLEU_2 BLEU_3 BLEU_4 CIDEr ROUGE-L
0.42 0.35 0.30 0.26 0.23 0.30
B. Automatic assessment: disease-based metrics
Specificity Accuracy Precision Sensitivity F1 score
Answer classification

Binary choice alla 0.56 0.60 0.75 0.31 0.44
Binary choice 100b before 0.81 0.70 0.80 0.61 0.69
Binary choice 100b after 0.85 0.76 0.84 0.67 0.74
Multiple choice alla 0.79 0.68 0.37 0.36 0.35
Multiple choice 100b before 0.39 0.39 0.39 0.55 0.46
Multiple choice 100b after 0.41 0.43 0.42 0.58 0.49

Condition classification

Microaneurysm 0.96 0.94 0.89 0.86 0.87
Diabetic retinopathy 0.98 0.94 0.79 0.90 0.84
Arteriosclerosis 0.94 0.87 0.68 0.77 0.72
C. Manual assessment: error types of QA assessed by ophthalmologists
Rater 1
N (%)
Rater 2
N (%)
Kappa
Unrelated information 94 (3.5%) 81 (3.0%) 0.746
Factual error 474 (17.6%) 475 (17.6%) 0.835
Omission 109 (4.0%) 80 (3.0%) 0.743
Insufficient information 86 (3.2%) 77 (2.9%) 0.741

BLEU, bilingual evaluation understudy; CIDEr, consensus-based image description evaluation; ROUGE-L, recall-oriented understudy for gisting evaluation-longest common subsequence.

a

The metrics were calculated using all the data from the test set.

b

The metrics were calculated using a randomly sampled subset of the test set, comprising 100 participants.