. Author manuscript; available in PMC: 2023 Jul 25.

Published in final edited form as: Proc Conf Assoc Comput Linguist Meet. 2023 Jul;2023(ClinicalNLP):78–85.

Table 3:

Finetuned T5 models on various clinical task with 95% confidence interval calculated using the bootstrapping method. A/P represents assessment and plan relational labeling task. Summarization use ROUGE L, A/P use F1-macro and SOAP use accuracy score for the evaluation metrics. The first row in the table represents best scores reported in the DR.BENCH paper and * in the other rows represent scores for the respective task in DR.BENCH paper (Gao et al., 2023)

Model	Training	Summarization	SOAP	A/P
Gao et al., 2023	Single task	7.60 (5.31 – 9.89)	60.12 (59.33 – 60.90)	80.09 (79.32 – 83.23)
T5 220M	Single task	26.35 (22.18 – 30.52)	60.12 (59.33 – 60.90)*	73.31 (71.34 – 77.65)*
	Multi-task	24.84 (20.28 – 29.40)	56.63 (55.83 – 57.42)	43.25 (41.35 – 66.59)
T5 770M	Single task	26.90 (22.58 – 31.23)	55.57 (54.78 – 56.35)*	77.96 (75.38 – 81.60)*
	Multi-task	23.99 (19.86 – 28.13)	51.10 (50.32 – 51.91)	75.15 (71.93 – 78.19)
SciFive 220M	Single task	25.31 (21.45, 29.17)	57.74 (56.95 – 58.53)*	76.76 (74.81 – 80.92)*
	Multi-task	24.38 (19.99 – 28.78)	54.86 (54.06 – 55.65)	68.87 (65.50 – 72.12)
SciFive 770M	Single task	27.31 (23.09 – 31.53)	47.65 (46.85 – 48.47)*	75.11 (73.10,79.42)*
	Multi-task	25.31 (21.45 – 29.17)	44.51 (43.72 – 45.29)	77.50 (74.45 – 80.37)
Clinical-T5 220M	Single task	25.35 (21.19 – 29.51)	55.30 (54.51 – 56.11)	80.44 (77.47 – 83.35)
	Multi-task	26.21 (21.92 – 30.49)	52.41 (51.62 – 53.20)	65.49 (62.08 – 68.76)
Clinical-T5 770M	Single task	28.28 (24.17 – 32.38)	52.82 (52.03 – 53.61)	78.79 (75.76 – 81.66)
	Multi-task	28.55 (24.29 – 32.80)	54.00 (53.21 – 54.80)	80.58 (77.57 – 83.38)