Table 2.
Model performance in the automatic assessment (57,601 QA pairs of 2,028 participants in the test set) and manual assessment (2,692 QA pairs of 100 participants sampled from the test set)
| A. Automatic assessment: language-based metrics | |||||
|---|---|---|---|---|---|
| BLEU_1 | BLEU_2 | BLEU_3 | BLEU_4 | CIDEr | ROUGE-L |
| 0.42 | 0.35 | 0.30 | 0.26 | 0.23 | 0.30 |
| B. Automatic assessment: disease-based metrics | |||||
|---|---|---|---|---|---|
| Specificity | Accuracy | Precision | Sensitivity | F1 score | |
| Answer classification | |||||
| Binary choice alla | 0.56 | 0.60 | 0.75 | 0.31 | 0.44 |
| Binary choice 100b before | 0.81 | 0.70 | 0.80 | 0.61 | 0.69 |
| Binary choice 100b after | 0.85 | 0.76 | 0.84 | 0.67 | 0.74 |
| Multiple choice alla | 0.79 | 0.68 | 0.37 | 0.36 | 0.35 |
| Multiple choice 100b before | 0.39 | 0.39 | 0.39 | 0.55 | 0.46 |
| Multiple choice 100b after | 0.41 | 0.43 | 0.42 | 0.58 | 0.49 |
| Condition classification | |||||
| Microaneurysm | 0.96 | 0.94 | 0.89 | 0.86 | 0.87 |
| Diabetic retinopathy | 0.98 | 0.94 | 0.79 | 0.90 | 0.84 |
| Arteriosclerosis | 0.94 | 0.87 | 0.68 | 0.77 | 0.72 |
| C. Manual assessment: error types of QA assessed by ophthalmologists | |||
|---|---|---|---|
| Rater 1 N (%) |
Rater 2 N (%) |
Kappa | |
| Unrelated information | 94 (3.5%) | 81 (3.0%) | 0.746 |
| Factual error | 474 (17.6%) | 475 (17.6%) | 0.835 |
| Omission | 109 (4.0%) | 80 (3.0%) | 0.743 |
| Insufficient information | 86 (3.2%) | 77 (2.9%) | 0.741 |
BLEU, bilingual evaluation understudy; CIDEr, consensus-based image description evaluation; ROUGE-L, recall-oriented understudy for gisting evaluation-longest common subsequence.
The metrics were calculated using all the data from the test set.
The metrics were calculated using a randomly sampled subset of the test set, comprising 100 participants.