Skip to main content
. Author manuscript; available in PMC: 2023 Jul 25.
Published in final edited form as: Proc Conf Assoc Comput Linguist Meet. 2023 Jul;2023(ClinicalNLP):78–85.

Table 3:

Finetuned T5 models on various clinical task with 95% confidence interval calculated using the bootstrapping method. A/P represents assessment and plan relational labeling task. Summarization use ROUGE L, A/P use F1-macro and SOAP use accuracy score for the evaluation metrics. The first row in the table represents best scores reported in the DR.BENCH paper and * in the other rows represent scores for the respective task in DR.BENCH paper (Gao et al., 2023)

Model Training Summarization SOAP A/P
Gao et al., 2023 Single task 7.60 (5.31 – 9.89) 60.12 (59.33 – 60.90) 80.09 (79.32 – 83.23)
T5 220M Single task 26.35 (22.18 – 30.52) 60.12 (59.33 – 60.90)* 73.31 (71.34 – 77.65)*
Multi-task 24.84 (20.28 – 29.40) 56.63 (55.83 – 57.42) 43.25 (41.35 – 66.59)
T5 770M Single task 26.90 (22.58 – 31.23) 55.57 (54.78 – 56.35)* 77.96 (75.38 – 81.60)*
Multi-task 23.99 (19.86 – 28.13) 51.10 (50.32 – 51.91) 75.15 (71.93 – 78.19)
SciFive 220M Single task 25.31 (21.45, 29.17) 57.74 (56.95 – 58.53)* 76.76 (74.81 – 80.92)*
Multi-task 24.38 (19.99 – 28.78) 54.86 (54.06 – 55.65) 68.87 (65.50 – 72.12)
SciFive 770M Single task 27.31 (23.09 – 31.53) 47.65 (46.85 – 48.47)* 75.11 (73.10,79.42)*
Multi-task 25.31 (21.45 – 29.17) 44.51 (43.72 – 45.29) 77.50 (74.45 – 80.37)
Clinical-T5 220M Single task 25.35 (21.19 – 29.51) 55.30 (54.51 – 56.11) 80.44 (77.47 – 83.35)
Multi-task 26.21 (21.92 – 30.49) 52.41 (51.62 – 53.20) 65.49 (62.08 – 68.76)
Clinical-T5 770M Single task 28.28 (24.17 – 32.38) 52.82 (52.03 – 53.61) 78.79 (75.76 – 81.66)
Multi-task 28.55 (24.29 – 32.80) 54.00 (53.21 – 54.80) 80.58 (77.57 – 83.38)