Table 4. Final scores: Precision (P), recall (R), and F1 scores at initial and final model revisions aggregated over 15 participants.
Initial | Final | ||||||||
---|---|---|---|---|---|---|---|---|---|
P | R | F1 | P | R | F1 | ||||
Range | Mean | Range | Mean | Range | Mean | ||||
Reports | 0.90 | 0.19 | 0.31 | [0.67, 0.90] | 0.77 ± 0.06 | [0.62, 0.81] | 0.72 ± 0.05 | [0.70, 0.79] | 0.75 ± 0.03 |
Sections | 0.86 | 0.20 | 0.32 | [0.73, 0.86] | 0.79 ± 0.04 | [0.45, 0.68] | 0.60 ± 0.07 | [0.57, 0.73] | 0.68 ± 0.04 |
Sentences | 0.84 | 0.13 | 0.22 | [0.75, 0.88] | 0.80 ± 0.04 | [0.36, 0.62] | 0.48 ± 0.06 | [0.50, 0.68] | 0.60 ± 0.04 |
Note: The initial model was trained on the same six encounters to bootstrap the learning cycle.