Table 9.
Model | Evaluation score on the objective_results division | ||||||
---|---|---|---|---|---|---|---|
ROUGE-1 | ROUGE-2 | ROUGE-L | BERTScore | BLEURT | medcon | Average | |
Retrieval-based | |||||||
trainUMLS | 30.26 | 14.89 | 29.87 | 66.24 | 37.25 | 8.91 | 34.35 |
trainsent | 40.52 | 18.21 | 38.87 | 73.33 | 45.79 | 12.45 | 41.03 |
BART-based | |||||||
BART | 0.00 | 0.00 | 0.00 | 0.00 | 5.45 | 0.00 | 1.36 |
BART (Division) | 30.48 | 19.16 | 27.80 | 66.64 | 43.07 | 21.56 | 39.27 |
BART + FTSAMSum | 20.79 | 0.46 | 20.67 | 54.54 | 28.32 | 0.77 | 24.40 |
BART + FTSAMSum (Division) | 29.45 | 18.01 | 26.63 | 66.43 | 40.75 | 20.17 | 38.01 |
BioBART | 17.50 | 0.00 | 17.50 | 52.44 | 25.33 | 0.00 | 22.36 |
BioBART (Division) | 35.38 | 14.33 | 32.79 | 68.40 | 47.63 | 15.69 | 39.81 |
LED-based | |||||||
LED | 0.00 | 0.00 | 0.00 | 0.00 | 5.45 | 0.00 | 1.36 |
LED (Division) | 14.04 | 4.97 | 11.08 | 48.86 | 9.61 | 7.86 | 19.09 |
LED + FTpubMed | 0.00 | 0.00 | 0.00 | 0.00 | 5.45 | 0.00 | 1.36 |
LED + FTpubMed (Division) | 10.48 | 3.64 | 8.32 | 42.43 | 7.13 | 8.86 | 16.48 |
OpenAI (wo FT) | |||||||
Text-Davinci-002 | 41.48 | 20.12 | 39.95 | 70.61 | 50.79 | 24.42 | 44.92 |
Text-Davinci-003 | 44.92 | 25.21 | 43.84 | 72.35 | 55.87 | 29.37 | 48.90 |
ChatGPT | 34.50 | 17.75 | 30.84 | 66.68 | 48.51 | 22.28 | 41.29 |
GPT-4 | 37.65 | 19.94 | 35.73 | 68.33 | 48.50 | 26.73 | 43.67 |
Similar to objective_exam, BART and LED full note generation models suffered a significant drop at the objective_results division. This may be attributable to the higher sparsity of this division, low amounts of content (sometimes only 2-3 sentences), and the appearance of text later in the sequence. The OpenAI were in general better performant with BART division-based models as next best.