Table 4:
Finetuned T5 models on various clinical task with 95% confidence interval calculate using the bootstrapping method. All the evaluation metrics here are the accuracy score. The first row in the table represents best scores reported in the DR.BENCH paper and * in the other rows represent scores for the respective task in DR.BENCH paper (Gao et al., 2023)
Model | Training | EmrQA | MedNLI | MedQA |
---|---|---|---|---|
Gao et al., 2023 | Single task | 39.20 (34.63 – 43.78) | 84.88 (82.98 – 86.64) | 24.59 (22.31 – 27.02) |
T5 220M | Single task | 33.40 (29.27 – 37.61)* | 79.75 (78.62 – 82.70)* | 22.55 (20.01 – 25.69)* |
Multi-task | 38.48 (37.24 – 39.79) | 72.57 (70.18 −74.82) | 21.75 (19.48 – 24.12) | |
T5 770M | Single task | 38.05 (33.56 – 42.58)* | 84.04 (82.14 – 85.86)* | 20.97 (18.77 – 23.25)* |
Multi-task | 41.42 (40.16, 42.72) | 83.19 (81.22, 85.09) | 23.25 (20.97, 25.61) | |
SciFive 220M | Single task | 37.28 (32.84 – 42.11)* | 82.84 (80.87 – 84.74)* | 22.78 (20.50 – 25.14)* |
Multi-task | 40.08 (38.82 – 41.39) | 78.83 (76.72 – 80.94) | 21.52 (19.32 – 23.80) | |
SciFive 770M | Single task | 41.21 (39.93 – 42.49) | 83.89 (82.00 – 85.79) | 23.09 (20.82 – 25.37) |
Multitask | 41.26 (39.98 – 42.56) | 84.35 (82.49 – 86.22) | 23.72 (21.37 – 26.08) | |
Clinical-T5 220M | Single task | 41.35 (40.07 – 42.65) | 84.32 (82.42 – 86.15) | 21.92 (19.64 – 24.19) |
Multi-task | 40.30 (39.02 – 41.62) | 71.23 (68.92 – 73.56) | 22.46 (20.19 – 24.74) | |
Clinical-T5 770M | Single task | 42.69 (41.39 – 43.95) | 85.86 (85.02 – 88.47) | 24.27 (21.92 – 26.63) |
Multi-task | 42.61 (41.34 – 43.92) | 86.14 (84.32 – 87.90) | 25.84 (23.41 – 28.28) |