Skip to main content
. Author manuscript; available in PMC: 2023 Jul 25.
Published in final edited form as: Proc Conf Assoc Comput Linguist Meet. 2023 Jul;2023(ClinicalNLP):78–85.

Table 4:

Finetuned T5 models on various clinical task with 95% confidence interval calculate using the bootstrapping method. All the evaluation metrics here are the accuracy score. The first row in the table represents best scores reported in the DR.BENCH paper and * in the other rows represent scores for the respective task in DR.BENCH paper (Gao et al., 2023)

Model Training EmrQA MedNLI MedQA
Gao et al., 2023 Single task 39.20 (34.63 – 43.78) 84.88 (82.98 – 86.64) 24.59 (22.31 – 27.02)
T5 220M Single task 33.40 (29.27 – 37.61)* 79.75 (78.62 – 82.70)* 22.55 (20.01 – 25.69)*
Multi-task 38.48 (37.24 – 39.79) 72.57 (70.18 −74.82) 21.75 (19.48 – 24.12)
T5 770M Single task 38.05 (33.56 – 42.58)* 84.04 (82.14 – 85.86)* 20.97 (18.77 – 23.25)*
Multi-task 41.42 (40.16, 42.72) 83.19 (81.22, 85.09) 23.25 (20.97, 25.61)
SciFive 220M Single task 37.28 (32.84 – 42.11)* 82.84 (80.87 – 84.74)* 22.78 (20.50 – 25.14)*
Multi-task 40.08 (38.82 – 41.39) 78.83 (76.72 – 80.94) 21.52 (19.32 – 23.80)
SciFive 770M Single task 41.21 (39.93 – 42.49) 83.89 (82.00 – 85.79) 23.09 (20.82 – 25.37)
Multitask 41.26 (39.98 – 42.56) 84.35 (82.49 – 86.22) 23.72 (21.37 – 26.08)
Clinical-T5 220M Single task 41.35 (40.07 – 42.65) 84.32 (82.42 – 86.15) 21.92 (19.64 – 24.19)
Multi-task 40.30 (39.02 – 41.62) 71.23 (68.92 – 73.56) 22.46 (20.19 – 24.74)
Clinical-T5 770M Single task 42.69 (41.39 – 43.95) 85.86 (85.02 – 88.47) 24.27 (21.92 – 26.63)
Multi-task 42.61 (41.34 – 43.92) 86.14 (84.32 – 87.90) 25.84 (23.41 – 28.28)