Skip to main content
[Preprint]. 2024 Aug 19:2024.08.11.24311828. [Version 2] doi: 10.1101/2024.08.11.24311828

Table 2.

Eight Included Studies with LLM Fine-Tuning.

LLMs That Was Fine-Tuned Fine-Tuning Algorithm Fine-Tuning Hardware Summary of Findings Related to Fine-Tuning
Llama Full Parameter - DeepSpeed 2*NVIDIA A100 GPUs Disease diagnosis: No mention regarding the comparison of fine-tuned model and the original model. Fine-tuned Llama achieved 34.9% accuracy, 22.3% macro f1 score, and 28.7% micro f1 score.38
Llama PEFT-LoRA 1*Nvidia RTX A6000 GPU Predicting diagnosis-related group for hospitalized patients: No mention regarding the comparison of fine-tuned model and the original model. DRG-LLaMA -7B model exhibited a noteworthy macro-averaged F1 score of 0.327, a top-1 prediction accuracy of 52.0%, and a macro-averaged Area Under the Curve (AUC) of 0.986.109
1) A larger base model led to better fine-tuned performance. The best diagnosis accuracy of the fine-tuned Llama-13B achieved 54.6%, while that of the fine-tuned 7B model achieved 53.9%.
2) Longer input context from the fine-tuning data led to better performance. For fine-tuned Llama-13B, when the max input token size was 340, the best diagnosis accuracy was 49.9%, but the accuracy was increased to 54.6% when the max token size was 1024.
Llama 2; FLAN-T5; FLAN-UL2; Vicuna Alpaca PEFT-QLoRA 1*NVIDIA Quadro RTX 8000 Clinical text summarization: Fine-tuned FLAN-T5 improved MEDCON score from 5 to a range of 26–69 on four datasets.36
1) QLoRA FLAN-T5 was the best-performing fine-tuned open-source model. It achieved a MEDCON score of 59 on Open-i data, 38 on MIMIC-CXR data, 26 on MIMIC-III data, and 46 on patient questions data.
2) QLoRA typically outperformed ICL with the better models (FLAN-T5 and Llama-2); given a sufficient number of in-context examples (from 1 to 64), however, all models surpassed even the best QLoRA fine-tuned model, FLAN-T5, in at least one dataset.
3) An LLM fine-tuned with domain-specific data performed worse than the original model. For example, when Alpaca achieved a BLEU value of 30, Med-Alpaca only reached 20. This highlights a distinction between domain adaptation and task adaptation.
GPT-3; GPT-J; Falcon; Llama PEFT-QLoRA Open AI’s Cloud Resources Phenotype recognition in clinical notes: No quantitative comparison between models before and after fine-tuning. Fine-tuned GPT-3 achieved the best performance of 81.6% F1 score on one dataset, and fine-tuned GPT-J performed the best on the other dataset (83.2% F1 score) 59
BART; PEGASUS; T5; FLAN-T5; BioBART; Clinical-T5; GPT2; OPT; Llama; Alpaca PEFT-LoRA for Llama and Alpaca; full parameter tuning for other models. At least two NVDIA A100 GPUs Generating personalized impressions for whole-body PET reports: Biomedical domain pretrained LLMs did not outperform their base models. Specifically, the domain-specific fine-tuned BART model reduced the accuracy from 75.3% to 73.9%. This could be attributed to two reasons. First, our large training set diminished the benefits of medical-domain adaptation. Second, the corpora, such as MIMIC-III and PubMed, likely had limited PET-related content, making pretraining less effective for our task.35
Llama 2 No mention No mention Predicting opioid use disorder (OUD), substance use disorder (SUD), and Diabetes: Fine-tuned Llama 2 achieved 92%, 93%, 74%, and 88% AUROC on four datasets for predicting SUD. Fine-tuned Llama 2 achieved 95%, 72%, 73%, and 98% AUROC on four datasets for predicting OUD. Fine-tuned Llama 2 achieved 88%, 76%, 64%, and 94% AUROC on four datasets for predicting diabetes.34
1) An experiment of changing instructions suggests that fine-tuning on our datasets might have induced catastrophic forgetting particularly when dealing with a large volume of data.
2) Fine-tuned Llama 2 outperformed Llama 2 without fine-tuning on diabetes prediction (AUROC increased from 50% to 88%).
Llama 2–7B; BioGPT-Large Full Parameter - DeepSpeed 4*A40 Nvidia GPUs Differential Diagnoses in PICU Patients: Fine-tuned model outperformed original model, but a smaller LM fine-tuned using domain-specific notes outperformed much larger models trained on general-domain data.64 Specifically:
1) Fine-tuned Llama-7B achieved an average quality score of 2.88, while Llama-65B without fine-tuning achieved an average quality score of 2.65.
2) Fine-tuned BioGPT-Large had an average score of 2.78, while BioGPT-Large without fine-tuning had a mean score of 2.02
BART; GPT; MedLM PEFT-LoRA No mention Early detection of gout flares based on nurses’ chief complaint notes in the emergency department: No comparison between models before and after fine-tuning. Fine-tuned BART model (BioBART) performed the best, which achieved 0.73 and 0.67 F1 score on datasets GOUT-CC-2019-CORPUS and GROUT-CC-2020-CORPUS.71