Table 3:
Finetuned T5 models on various clinical task with 95% confidence interval calculated using the bootstrapping method. A/P represents assessment and plan relational labeling task. Summarization use ROUGE L, A/P use F1-macro and SOAP use accuracy score for the evaluation metrics. The first row in the table represents best scores reported in the DR.BENCH paper and * in the other rows represent scores for the respective task in DR.BENCH paper (Gao et al., 2023)
Model | Training | Summarization | SOAP | A/P |
---|---|---|---|---|
Gao et al., 2023 | Single task | 7.60 (5.31 – 9.89) | 60.12 (59.33 – 60.90) | 80.09 (79.32 – 83.23) |
T5 220M | Single task | 26.35 (22.18 – 30.52) | 60.12 (59.33 – 60.90)* | 73.31 (71.34 – 77.65)* |
Multi-task | 24.84 (20.28 – 29.40) | 56.63 (55.83 – 57.42) | 43.25 (41.35 – 66.59) | |
T5 770M | Single task | 26.90 (22.58 – 31.23) | 55.57 (54.78 – 56.35)* | 77.96 (75.38 – 81.60)* |
Multi-task | 23.99 (19.86 – 28.13) | 51.10 (50.32 – 51.91) | 75.15 (71.93 – 78.19) | |
SciFive 220M | Single task | 25.31 (21.45, 29.17) | 57.74 (56.95 – 58.53)* | 76.76 (74.81 – 80.92)* |
Multi-task | 24.38 (19.99 – 28.78) | 54.86 (54.06 – 55.65) | 68.87 (65.50 – 72.12) | |
SciFive 770M | Single task | 27.31 (23.09 – 31.53) | 47.65 (46.85 – 48.47)* | 75.11 (73.10,79.42)* |
Multi-task | 25.31 (21.45 – 29.17) | 44.51 (43.72 – 45.29) | 77.50 (74.45 – 80.37) | |
Clinical-T5 220M | Single task | 25.35 (21.19 – 29.51) | 55.30 (54.51 – 56.11) | 80.44 (77.47 – 83.35) |
Multi-task | 26.21 (21.92 – 30.49) | 52.41 (51.62 – 53.20) | 65.49 (62.08 – 68.76) | |
Clinical-T5 770M | Single task | 28.28 (24.17 – 32.38) | 52.82 (52.03 – 53.61) | 78.79 (75.76 – 81.66) |
Multi-task | 28.55 (24.29 – 32.80) | 54.00 (53.21 – 54.80) | 80.58 (77.57 – 83.38) |