Table 10.
Model | Evaluation score on the assessment_and_plan division | ||||||
---|---|---|---|---|---|---|---|
ROUGE-1 | ROUGE-2 | ROUGE-L | BERTScore | BLEURT | medcon | Average | |
Retrieval-based | |||||||
trainUMLS | 44.59 | 21.50 | 29.66 | 70.39 | 44.77 | 24.70 | 42.94 |
trainsent | 41.28 | 19.73 | 28.02 | 69.48 | 43.18 | 18.79 | 40.28 |
BART-based | |||||||
BART | 0.00 | 0.00 | 0.00 | 0.00 | 29.05 | 0.00 | 7.26 |
BART (Division) | 43.31 | 20.59 | 26.55 | 67.49 | 40.99 | 32.30 | 42.73 |
BART + FTSAMSum | 1.52 | 0.49 | 0.87 | 35.38 | 19.79 | 1.00 | 14.28 |
BART + FTSAMSum (Division) | 43.89 | 21.37 | 27.56 | 68.09 | 41.96 | 31.33 | 43.08 |
BioBART | 0.00 | 0.00 | 0.00 | 0.00 | 29.05 | 0.00 | 7.26 |
BioBART (Division) | 42.44 | 19.44 | 26.42 | 67.57 | 43.88 | 31.12 | 43.00 |
LED-based | |||||||
LED | 0.00 | 0.00 | 0.00 | 0.00 | 29.05 | 0.00 | 7.26 |
LED (Division) | 28.23 | 6.13 | 12.44 | 55.75 | 27.78 | 21.94 | 30.27 |
LED + FTpubMed | 0.00 | 0.00 | 0.00 | 0.00 | 29.05 | 0.00 | 7.26 |
LED + FTpubMed (Division) | 28.00 | 5.99 | 13.07 | 55.68 | 20.95 | 25.01 | 29.33 |
OpenAI (wo FT) | |||||||
Text-Davinci-002 | 30.90 | 12.27 | 21.44 | 61.01 | 44.98 | 35.04 | 40.64 |
Text-Davinci-003 | 35.41 | 14.86 | 25.38 | 63.97 | 49.18 | 46.40 | 46.19 |
ChatGPT | 36.43 | 12.50 | 23.32 | 63.56 | 48.21 | 43.71 | 44.89 |
GPT-4 | 38.16 | 14.12 | 24.90 | 64.26 | 49.41 | 42.36 | 45.44 |
Similar to objective_exam and objective_results, BART and LED full note generation models suffered a significant drop at the objective_results division. This may be attributable to the appearance of text later in the sequence. The OpenAI were in general better performant with BART division-based models as next best.