Table 8.
Average F1-Box Score of each Specialist Versus the Consensus (Ground Truth)
|
Specialist |
F1-Box Score |
||
|---|---|---|---|
| Mild | Moderate | Severe | |
| A | 0.65 | 0.58 | 0.53 |
| B | 0.44 | 0.33 | 0.23 |
| C | 0.63 | 0.55 | 0.41 |
| D | 0.51 | 0.37 | 0.38 |
| E | 0.62 | 0.52 | 0.49 |
| Mean | 0.57 | 0.47 | 0.41 |
To be comparable to the models, we computed the F1-box score of each specialist to the ground truth in every validation split and then obtained the mean and SD. The “none” category (i.e., the healthy images) has been omitted because they were not annotated, and their agreement on this category is not required because it is always 1.