Table 2.
LLMs That Was Fine-Tuned | Fine-Tuning Algorithm | Fine-Tuning Hardware | Summary of Findings Related to Fine-Tuning |
---|---|---|---|
Llama | Full Parameter - DeepSpeed | 2*NVIDIA A100 GPUs | Disease diagnosis: No mention regarding the comparison of fine-tuned model and the original model. Fine-tuned Llama achieved 34.9% accuracy, 22.3% macro f1 score, and 28.7% micro f1 score.38 |
Llama | PEFT-LoRA | 1*Nvidia RTX A6000 GPU |
Predicting diagnosis-related group for hospitalized patients: No mention regarding the comparison of fine-tuned model and the original model. DRG-LLaMA -7B model exhibited a noteworthy macro-averaged F1 score of 0.327, a top-1 prediction accuracy of 52.0%, and a macro-averaged Area Under the Curve (AUC) of 0.986.109 1) A larger base model led to better fine-tuned performance. The best diagnosis accuracy of the fine-tuned Llama-13B achieved 54.6%, while that of the fine-tuned 7B model achieved 53.9%. 2) Longer input context from the fine-tuning data led to better performance. For fine-tuned Llama-13B, when the max input token size was 340, the best diagnosis accuracy was 49.9%, but the accuracy was increased to 54.6% when the max token size was 1024. |
Llama 2; FLAN-T5; FLAN-UL2; Vicuna Alpaca | PEFT-QLoRA | 1*NVIDIA Quadro RTX 8000 |
Clinical text summarization: Fine-tuned FLAN-T5 improved MEDCON score from 5 to a range of 26–69 on four datasets.36 1) QLoRA FLAN-T5 was the best-performing fine-tuned open-source model. It achieved a MEDCON score of 59 on Open-i data, 38 on MIMIC-CXR data, 26 on MIMIC-III data, and 46 on patient questions data. 2) QLoRA typically outperformed ICL with the better models (FLAN-T5 and Llama-2); given a sufficient number of in-context examples (from 1 to 64), however, all models surpassed even the best QLoRA fine-tuned model, FLAN-T5, in at least one dataset. 3) An LLM fine-tuned with domain-specific data performed worse than the original model. For example, when Alpaca achieved a BLEU value of 30, Med-Alpaca only reached 20. This highlights a distinction between domain adaptation and task adaptation. |
GPT-3; GPT-J; Falcon; Llama | PEFT-QLoRA | Open AI’s Cloud Resources | Phenotype recognition in clinical notes: No quantitative comparison between models before and after fine-tuning. Fine-tuned GPT-3 achieved the best performance of 81.6% F1 score on one dataset, and fine-tuned GPT-J performed the best on the other dataset (83.2% F1 score) 59 |
BART; PEGASUS; T5; FLAN-T5; BioBART; Clinical-T5; GPT2; OPT; Llama; Alpaca | PEFT-LoRA for Llama and Alpaca; full parameter tuning for other models. | At least two NVDIA A100 GPUs | Generating personalized impressions for whole-body PET reports: Biomedical domain pretrained LLMs did not outperform their base models. Specifically, the domain-specific fine-tuned BART model reduced the accuracy from 75.3% to 73.9%. This could be attributed to two reasons. First, our large training set diminished the benefits of medical-domain adaptation. Second, the corpora, such as MIMIC-III and PubMed, likely had limited PET-related content, making pretraining less effective for our task.35 |
Llama 2 | No mention | No mention |
Predicting opioid use disorder (OUD), substance use disorder (SUD), and Diabetes: Fine-tuned Llama 2 achieved 92%, 93%, 74%, and 88% AUROC on four datasets for predicting SUD. Fine-tuned Llama 2 achieved 95%, 72%, 73%, and 98% AUROC on four datasets for predicting OUD. Fine-tuned Llama 2 achieved 88%, 76%, 64%, and 94% AUROC on four datasets for predicting diabetes.34 1) An experiment of changing instructions suggests that fine-tuning on our datasets might have induced catastrophic forgetting particularly when dealing with a large volume of data. 2) Fine-tuned Llama 2 outperformed Llama 2 without fine-tuning on diabetes prediction (AUROC increased from 50% to 88%). |
Llama 2–7B; BioGPT-Large | Full Parameter - DeepSpeed | 4*A40 Nvidia GPUs |
Differential Diagnoses in PICU Patients: Fine-tuned model outperformed original model, but a smaller LM fine-tuned using domain-specific notes outperformed much larger models trained on general-domain data.64 Specifically: 1) Fine-tuned Llama-7B achieved an average quality score of 2.88, while Llama-65B without fine-tuning achieved an average quality score of 2.65. 2) Fine-tuned BioGPT-Large had an average score of 2.78, while BioGPT-Large without fine-tuning had a mean score of 2.02 |
BART; GPT; MedLM | PEFT-LoRA | No mention | Early detection of gout flares based on nurses’ chief complaint notes in the emergency department: No comparison between models before and after fine-tuning. Fine-tuned BART model (BioBART) performed the best, which achieved 0.73 and 0.67 F1 score on datasets GOUT-CC-2019-CORPUS and GROUT-CC-2020-CORPUS.71 |