Table 6.
Language models | Number of trainable parameters | Learning rate | Total batch size | Number of training epochs | Implementations and pretrained weighted |
---|---|---|---|---|---|
PGN | 8.3 M | 1e-3 * | 25 * | 30 * | https://github.com/yuhaozhang/summarize-radiology-findings |
BERT2BERT | 301.7 M | 1e-4 | 32 | 15 | https://huggingface.co/yikuan8/Clinical-Longformer |
BART | 406.3 M | 5e-5 | 32 | 15 | https://huggingface.co/facebook/bart-large |
BioBART | 406.3 M | 5e-5 | 32 | 15 | https://huggingface.co/GanjinZero/biobart-large |
PEGASUS | 568.7 M | 2e-4 | 32 | 15 | https://huggingface.co/google/pegasus-large |
T5 | 783.2 M | 4e-4 | 32 | 15 | https://huggingface.co/google/t5-v1_1-large |
Clinical-T5 | 737.7 M | 4e-4 | 32 | 15 | https://huggingface.co/luqh/ClinicalT5-large |
FLAN-T5 | 783.2 M | 4e-4 | 32 | 15 | https://huggingface.co/google/flan-t5-large |
GPT2 | 1.5 B | 5e-5 | 32 | 15 | https://huggingface.co/gpt2-xl |
OPT | 1.3 B | 1e-4 | 32 | 15 | https://huggingface.co/facebook/opt-1.3b |
LLaMA-LoRA | 4.2 M | 2e-4 | 128 | 20 | available upon request |
Alpaca-LoRA | 4.2 M | 2e-4 | 128 | 20 | https://huggingface.co/tatsu-lab/alpaca-7b-wdiff |
Note that “*” denotes the hyperparameters directly taken from the original paper. Total batch size = training batch size per device × number of GPU devices × gradient accumulation steps