Abstract
The availability of biomedical abstracts in online databases could improve health literacy and drive more informed choices. However, the technical language of these documents makes them inaccessible to healthcare consumers, causing disengagement, frustration and potential misuse. In this work we explore adapting foundation language models to the Plain Language Adaptation of Biomedical Abstracts benchmark. This task is challenging because it requires sentence-by-sentence simplifications, but entire abstracts must also be simplified cohesively. We present a sentence-wise autoregressive approach and report experiments with this technique in both zero-shot and fine-tuned settings, using both proprietary and open-source models. We also introduce a stochastic regularization technique to encourage recovery from source-copying during autoregressive inference. Our best-performing model achieves a 32 point increase in SARI and 6 point increase in BERTscore over the reported state-of-the-art. This also surpasses performance of recent open-domain and biomedical sentence simplification models on this task. Further, in manual evaluation, models achieve factual accuracy comparable to human-level, with simplicity close to that of humans. Abstracts simplified by these models could unlock a massive source of health information while retaining clear provenance for each statement to enhance trustworthiness.
Keywords: Text simplification, Foundation Language Models, Biomedical literature
1. Introduction
The inability of patients to fully understand available information about their health has a significant impact on outcomes [6]. While many consumer-facing knowledge bases exist, these are cumbersome and labor-intensive to update and thus typically do not include the latest medical knowledge from the literature. When deeper questions are not answered by these resources, consumers may read beyond their expertise, potentially leading to misunderstanding [3,22].
Neural biomedical text simplification efforts to date have largely either framed the task as document-level plain language summarization [8,9,15] or simplification at the level of discrete sentences [12,19,20]. In contrast, the Plain Language Adaptation of Biomedical Abstracts (PLABA) [2] benchmark task requires sentence-aligned simplification of whole documents. This has the added challenge that each simplified sentence is affected by the context of the entire simplified abstract (Fig. 1). For example, added background or parenthetical explanations of terms will only occur the first time a term or concept is introduced, and whether an expert concept is explained or omitted depends on its centrality to the abstract. Further, anaphora may need to be resolved, and replacements for names of diseases, drugs, or study groups must remain consistent throughout the abstract for readers to follow them. As opposed to plain language summarization, which seeks to distill several takeaways, the sentence-aligned adaptation approach ensures more complete preservation of information that consumers might want, which may prevent them from circumventing the summary and going to the source. It also provides clear provenance for each statement, which is crucial for building trust, especially given the tendency of neural language models to confabulate.
Fig. 1.

Excerpt of a PLABA abstract and output from our best-performing model. Notable changes are colored. Note that abstracts must be adapted sentence-wise, but as a whole, e.g. only explaining terms once.
In this work, we explore the use of foundation language models for the PLABA task, both in zero-shot and supervised fine-tuned settings. We evaluate models using automatic metrics on the PLABA test set and with manual judgments of simplicity, completeness, accuracy, and fluency. We find that both zero-shot GPT-3.5 and fine-tuned Llama 2 can generate simplifications with human-level factual accuracy even as they provide near-human levels of simplification. To our knowledge, our work is the first to include document context while performing sentence-wise simplification of biomedical documents. Our contributions are: (1) detailing methods for prompting and fine-tuning foundation language models to create sentence-aligned plain language adaptations of biomedical abstracts, (2) providing trained models for further research and use, and (3) extensive manual evaluations showing how models can simplify better and identifying where they are factually inaccurate. Code, model weights, outputs, and evaluations are available at https://github.com/ondovb/plaba-ft.
2. Methods
We explore several foundation language models: instruction fine-tuned GPT-3.5, Falcon [1], and Llama 2 [21]. In order to train models on a single GPU, we focus on model sizes with 13B parameters or fewer.
2.1. Sentence-Wise Autoregressive Prompting
The core of our method lies in progressively building prompts using system outputs as prior examples (Fig. 2). This takes advantage of the fact that foundation language models have typically been trained to be good in-context learners, following patterns in prompts and incorporating prior information [7]. In this approach, in an initial prompt, a general instruction is given (e.g. “Simplify:”) followed by the first source sentence prefixed with a label (e.g. “Original:”), and ending with a hanging label for completion (e.g. “Simple:”). The response is used to grow the prompt by filling in the first ‘simple’ sentence and providing the second sentence with the same labeling scheme. This continues until a response is obtained for each source sentence. Note that the responses can contain multiple sentences (essentially a split operation), but these can still be directly attributed to one source sentence.
Fig. 2.

Sentence-wise autoregressive prompting strategy. An initial prompt is provided with a general instruction (e.g. “Simplify:”), the first source sentence, labeled “Original:”, and the label “Simple:”. Subsequent prompts include all prior sentence/completion pairs, providing context while ensuring sentence alignment.
2.2. Supervised Fine-Tuning with Teaching Forcing
Though GPT-3.5 performs the task well with our prompting strategy, smaller, open-source models are desirable for many reasons, including, privacy, auditing, cost, and efficiency. Existing open-source foundation language models, however, lag far behind GPT-3.5 in zero-shot performance on this task. We thus sought to fine-tune such models. Since these models are purely causal, rather than sequence-to-sequence, supervised fine-tuning with gold outputs requires (1) constructing single inputs from training pairs, (2) inserting tags to mark the prompt and completion, and (3) masking tokens such that causal prediction and loss propagation is only performed for the section after the completion tag. Further, for efficient training, we use teacher forcing, as is common practice for autoregressive models. This means gold targets are used in training prompts where prior outputs would be inserted during inference (Fig. 6).
2.3. Source-Copying Exposure Regularization
A drawback of teacher forcing is exposure bias; i.e. a mismatch between prior generated outputs and the gold labels that were trained on, which may compound during autoregressive inference [5]. Further, pretrained language models are more likely to copy the source in machine translation settings [14]. In our case, even the gold training data contains targets that are similar or identical to the source, as annotators were instructed to leave simple language as-is. If a model leaves early sentences untouched during inference, in-context learning may be counterproductive, discouraging further sentences from being simplified. We thus introduce Source-Copying Exposure Regularization (SCER). For this method, rather than always using the gold label for teacher forcing, with some chance γ the source is copied instead (Fig. 3). We theorize this will gradually modify the model’s in-context learning behavior so that it can still produce a simple output when appropriate, even when prior outputs seen in the prompt are similar or identical to sources.
Fig. 3.

Source-Copying Exposure Regularization (SCER). Rather than always using the gold standard for prior context, there is a chance that the source sentence is copied.
2.4. Baselines
Other than the baselines presented with the PLABA dataset, we know of no published systems specifically designed for the PLABA task. We thus use recent state-of-the-art biomedical and open-domain sentence simplification models as additional baselines.
T5-PLABA (Attal et al.) [2]: The best-performing baseline reported with the PLABA dataset.
MUSS (Martin et al.) [16]: To our knowledge, the state of the art for open domain sentence simplification.
BART-UL-ME (Flores et al.) [9]: Recent biomedical simplification method with strong results on several datasets. Since we require sentence-level outputs, we use the reported BART-XSum model fine-tuned on the Med-EASi corpus [4] (which is mostly single sentences) using Unlikelihood Loss, which was the best performing model for that dataset.
2.5. Implementation
All experiments were implemented in Python. For GPT-3.5, we use the OpenAI API. For fine-tuning open-source models, we use the HuggingFace transformers library [23] using Low-Rank Adapters (LoRA) [11], with r = 16 and α = 32. A single NVidia A100 GPU was used for training and inference. All models had a batch size of 1 and maximum sequence length of 4,096. Llama-2-13B were 8-bit quantized. General instructions in prompts were “Rewrite for a lay audience:” for GPT-3.5 and “Simplify:” for open-source models. For SCER, we experiment with γ ∈ {0.25, 0.5, 0.75}.
3. Results
We evaluate GPT-3.5 zero-shot outputs in the following sections. Open-source models produce repetitive output with little simplification in the zero-shot setting but learn the task quickly with fine-tuning, with most needing only 100 examples for an initial large jump in validation BERTscore vs. zero-shot performance (Fig. 7). Continued training provides further improvement, with Llama 2 models performing better than Falcon, no benefit from larger Llama 2 models, and quicker training with SCER (Fig. 4). Note that 13B-parameter models have fewer steps because of longer computation time per step and equalized wall-clock training time. For SCER, all three values of γ have similar training trajectories; we choose γ = 0.5 for manual evaluation because it reaches the highest score.
Fig. 4.

BERTscores of checkpoints against the PLABA validation set.
3.1. Automatic Metrics
For automatic evaluation, we compare outputs on the PLABA test set to the gold standard adaptations using various relevant metrics, including BLEU [18], SARI [24], BERTscore [25], and Rouge [13]. Results are shown in Table 1.
Table 1.
Performance of systems via automatic metrics on the PLABA test set.
| Model | Rouge1 | Rouge2 | RougeL | BLEU | BERTscore | SARI |
|---|---|---|---|---|---|---|
| T5-PLABA ‡ | 0.56 | 0.3 | 0.42 | 0.28 | 0.9 | 0.33 |
| BART-UL-ME † | 0.51 | 0.33 | 0.48 | 0.29 | 0.92 | 0.36 |
| MUSS † | 0.52 | 0.3 | 0.46 | 0.25 | 0.92 | 0.36 |
| GPT-3.5 * | 0.45 | 0.2 | 0.37 | 0.17 | 0.92 | 0.34 |
| Falcon-7B | 0.65 | 0.48 | 0.62 | 0.45 | 0.94 | 0.49 |
| Falcon-7B-instruct | 0.65 | 0.49 | 0.63 | 0.45 | 0.94 | 0.47 |
| Llama-2-7B | 0.71 | 0.56 | 0.68 | 0.53 | 0.95 | 0.58 |
| Llama-2-7B-chat | 0.71 | 0.56 | 0.68 | 0.53 | 0.95 | 0.58 |
| Llama-2-13B | 0.66 | 0.5 | 0.64 | 0.46 | 0.94 | 0.48 |
| Llama-2-13B-chat | 0.67 | 0.5 | 0.64 | 0.47 | 0.94 | 0.48 |
| Llama-2-7B-chat+SCER | 0.76 | 0.64 | 0.74 | 0.62 | 0.96 | 0.65 |
Zero-shot.
Cross-corpus.
Previously reported results.
3.2. Manual Judgments
For manual evaluation, we chose an additional 40 abstracts using the same workflow as Attal et al. [2]. A pilot set of 3 abstracts was done by two annotators to compute inter-annotator agreement, which was generally high (Table 2); the rest were done by one annotator only. Following the four typical types of simplification judgments [17], each sentence was judged for completeness, fluency, simplicity, and accuracy, with simplicity and accuracy each judged at both sentence and term levels. Due to their more in-depth nature, sentence accuracy and completeness were performed for the three sentences of each abstract judged by both annotators to be most relevant to the consumer question. Judgments were performed on a 3-point likert scale and averages were linearly interpolated to a 0–100 scale. The two sub-axes (sentence and term) for both simplicity and accuracy were then averaged to create the final four axes. We manually evaluate (1) adaptations manually written by biomedical experts, as a human baseline, (2) GPT-3.5 zero-shot, for which automatic metrics are not a good measure, (3) the best-performing open-source model after fine-tuning (LLama-2-7B-chat), and (4) the latter with SCER, as an ablation experiment. Manual evaluations generally found simplifications to be of high quality (Fig. 5). The simplicity of Llama-27-B-chat (78.80) increased to 83.53 with SCER, supporting the hypothesis that training specifically to recover from source copying prevents propagation of complex outputs through the autoregressive prompting process. All system outputs with the lowest judgment for factual accuracy (−1) can be seen in Table 3.
Fig. 5.

Manual evaluation results for chosen systems on 40 additional abstracts.
3.3. Performance
During inference on the test set, Falcon-based models, on average, 26 s per abstract to create sentence-wise simplifications. Models based on Llama 2 took 31 s for 7B-parameter models and 68 s for 13B-parameter models.
4. Discussion
The models presented here show promise for making biomedical literature accessible to the general public. Yet, this work has several limitations. First, operating at the sentence level means that abstracts must be segmented first, which takes some time and is generally error-prone. Additionally, inference must be run n times for an abstract with n sentences. However, a benefit is that the entire original abstract is not needed as context to start generating output. It is thus not clear whether the strategy is costlier than a document-level approach, and more performance experiments are warranted. A further limitation is the relatively narrow scope of evaluation. The PLABA test set only contains 110 abstracts with 1,009 sentences, and manual evaluations only looked at 40 abstracts with 430 sentences due to the labor involved. As a proof-of-concept, and to test many variants, we fine-tuned open-source models for a relatively short amount of time (24 h) and did not yet see signs of overfit. Future studies are needed to explore the limits of training epochs and minimal dataset sizes. We also did not rigorously explore the effect of γ, the SCER source-copying probability. This value could be scheduled, perhaps reaching 1, similar to Bengio et al. [5], which would obviate autoregression and allow prompts with only the source sentences. Future work could also involve Reinforcement Learning from Human Feedback (RLHF), potentially using scores from our manual evaluations to train reward models.
5. Conclusion
In this work, we have shown that recent foundation language models are capable of simplifying biomedical abstracts sentence-by-sentence with factual accuracy similar to that of expert-written simplifications. Using a straightforward autoregressive prompting strategy, the proprietary GPT-3.5 model can perform this task zero-shot. While open source models, which may be desirable for both privacy and efficiency, lag far behind in the zero-shot setting, we show that they can be efficiently fine-tuned on a relatively small amount of data. This is enabled both by supervised fine-tuning with teacher forcing and a novel stochastic regularization regime that prevents degeneration into source-copying during inference. Both proprietary and fine-tuned models, however, fail to reach human levels of simplicity, according to manual evaluation. In closing this gap, we should continue to take care that accuracy is not compromised. Such a “progressive caution” approach [10] will allow incremental progress in simplicity while providing the benefits of current gains to consumers with minimal potential harm.
Acknowledgements.
This work was supported by the NLM Intramural Research Program and utilized the computational resources of the HPC Biowulf cluster (http://hpc.nih.gov).
Appendix
Fig. 6.

Using teacher forcing to fine-tune for sentence-level autoregressive prompting. A 3-sentence abstract will be used to generate 3 different training prompts with gold standard completions used for both prior system outputs and as supervised labels. Loss is only propagated for the completion. Note that, during training, these prompts would be pooled with those from other abstracts and shuffled.
Fig. 7.

BERTscores of checkpoints against the PLABA validation set for the first 1000 training steps.
Table 2.
Inter-annotator agreement for the pilot manual evaluation, via Krippendorf’s α, and at different levels of aggregation (sentence level being the original annotations). Agreement is generally high. Low agreement for fluency is due to over 98% of judgments being 1 (perfect), which breaks distributive assumptions of α. Sentence simplicity seems to be genuine disagreement, with one annotator judging mostly ones and the other with a more even mix.
| Aggregation | Comp. | Flue. | Simp-term | Simp-sent | Acc-term | Acc-sent |
|---|---|---|---|---|---|---|
| Sentence | 0.8833 | −0.0255 | 0.5279 | 0.1748 | 0.5845 | 0.8148 |
| Abstract | 0.9684 | 0.3241 | 0.854 | −0.1311 | 0.6757 | 0.8823 |
| System | 0.9853 | 0.0225 | 0.8077 | 0.0769 | 0.6026 | 0.7543 |
Table 3.
System outputs manually judged to have poor accuracy (score of −1) at either the sentence or term level. Inaccurate sections of outputs are italicized.
| GPT-3.5, PMID 25864273, sentence 9 |
| Source: The disturbance of daily life due to NSP in postpartum women with past history of PMS and Hiesho were significant higher than that for women without those. |
| Output: Postpartum women who had a history of premenstrual syndrome (PMS) and anemia during pregnancy experienced more disruption in their daily lives due to neck and shoulder pain compared to women without these conditions. |
| GPT-3.5, PMID 7920896, sentence 3 |
| Source: Laboratory evaluation revealed panhypopituitarism and pituitary cell anti- bodies were positive. |
| Output: The lab tests showed that she had a condition called panhypopituitarism, and antibodies were found in her pituitary cells. |
| Llama-2-7B-chat, PMID 7920896, sentence 4 |
| Source: Both CT scans and MR images showed empty sella. |
| Output: Both brain scans showed the pituitary gland was located in the bottom of the skull instead of in the middle of the brain. |
| Llama-2-7B-chat, PMID 27935031, sentence 1 |
| Source: Objectives: Interictal epileptiform discharges (IEDs) have been linked to memory impairment, but the spatial and temporal dynamics of this relationship remain elusive. |
| Output: Interictal epileptiform discharges (IEDs) are events that can lead to seizures. IEDs have been linked to memory impairment, but the spatial and temporal dynamics of this relationship remain elusive. |
| Llama-2-7B-chat+SCER, PMID 7920896, sentence 4 |
| Source: Both CT scans and MR images showed empty sella. |
| Output: T tests showed she had empty pituitary gland. |
| Llama-2-7B-chat+SCER, PMID 32202633, sentence 6 |
| Source: Exclusion criteria were qualitative studies; child age older than 12 years; and language assessment preverbal. |
| Output: Studies that were not quantitative (e.g. did not involve numerical data) were excluded. Additionally, studies on children older than 12 years and language assessment (measurement) of preverbal (non-syllabic) speech were excluded. |
| Llama-2-7B-chat+SCER, PMID 33479851, sentence 9 |
| Source: Other treatments in clinical development include fenfluramine in late phase III, perampanel, soticlestat-OV953/TAK-953, carisbamate and ganaxolone. |
| Output: Other treatments in clinical development include fenfluramine (anti-depressant), perampanel, soticlestat-OV953/TAK-953 (AED), carisbamate and ganax-olone (AEDs). |
| Llama-2-7B-chat+SCER, PMID 15359197, sentence 10 |
| Source: At the three-cross cutoff point for ketonuria and at the 5 mmol/l cutoff point for ketonemia, the two tests had the same negative likelihood ratio (0.1), but the positive likelihood ratio of 3-beta-hydroxybutyrate (infinite) was higher than that of ketonuria. |
| Output: At a higher cutoff point for ketonuria and at the highest cutoff point for 3-beta-hydroxybutyrate, the two tests had the same lack of accuracy (0.1). However, the accuracy of 3-beta-hydroxybutyrate was higher than that of ketonuria (infinite). |
References
- 1.Almazrouei E, et al. : The falcon series of open language models. arXiv preprint arXiv:2311.16867 (2023) [Google Scholar]
- 2.Attal K, Ondov B, Demner-Fushman D: A dataset for plain language adaptation of biomedical abstracts. Sci. Data 10(1), 8 (2023) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Aydın GÖ, Kaya N, Turan N: The role of health literacy in access to online health information. Procedia Soc. Behav. Sci 195, 1683–1687 (2015) [Google Scholar]
- 4.Basu C, Vasu R, Yasunaga M, Yang Q: Med-EASi: finely annotated dataset and models for controllable simplification of medical texts. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, pp. 14093–14101 (2023) [Google Scholar]
- 5.Bengio S, Vinyals O, Jaitly N, Shazeer N: Scheduled sampling for sequence prediction with recurrent neural networks. Adv. Neural Inf. Process. Syst 28 (2015) [Google Scholar]
- 6.Berkman ND, Sheridan SL, Donahue KE, Halpern DJ, Crotty K: Low health literacy and health outcomes: an updated systematic review. Ann. Intern. Med 155(2), 97–107 (2011) [DOI] [PubMed] [Google Scholar]
- 7.Brown T, et al. : Language models are few-shot learners. Adv. Neural. Inf. Process. Syst 33, 1877–1901 (2020) [Google Scholar]
- 8.Devaraj A, Marshall I, Wallace BC, Li JJ: Paragraph-level simplification of medical texts. In: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 4972–4984 (2021) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Flores LJ, Huang H, Shi K, Chheang S, Cohan A: Medical text simplification: optimizing for readability with unlikelihood training and reranked beam search decoding. In: Findings of the Association for Computational Linguistics: EMNLP 2023, pp. 4859–4873 (2023) [Google Scholar]
- 10.Goodman KW, Miller RA: Ethics in biomedical and health informatics: users, standards, and outcomes. In: Shortliffe EH, Cimino JJ (eds.) Biomedical Informatics, pp. 391–423. Springer, Cham: (2021). 10.1007/978-3-030-58721-512 [DOI] [Google Scholar]
- 11.Hu EJ, et al. : LoRA: low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685 (2021) [Google Scholar]
- 12.Kew T, et al. : BLESS: benchmarking large language models on sentence simplification. In: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 13291–13309 (2023) [Google Scholar]
- 13.Lin CY: ROUGE: a package for automatic evaluation of summaries. In: Text Summarization Branches Out, pp. 74–81 (2004) [Google Scholar]
- 14.Liu X, et al. : On the copying behaviors of pre-training for neural machine translation. In: Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pp. 4265–4275 (2021) [Google Scholar]
- 15.Lu J, Li J, Wallace BC, He Y, Pergola G: NapSS: paragraph-level medical text simplification via narrative prompting and sentence-matching summarization. In: Findings of the Association for Computational Linguistics: EACL 2023, pp. 1079–1091 (2023) [Google Scholar]
- 16.Martin L, Fan A, De La Clergerie ÉV, Bordes A, Sagot B: MUSS: multilingual unsupervised sentence simplification by mining paraphrases. In: Proceedings of the Thirteenth Language Resources and Evaluation Conference, pp. 1651–1664 (2022) [Google Scholar]
- 17.Ondov B, Attal K, Demner-Fushman D: A survey of automated methods for biomedical text simplification. J. Am. Med. Inform. Assoc 29(11), 1976–1988 (2022) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Papineni K, Roukos S, Ward T, Zhu WJ: BLEU: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pp. 311–318 (2002) [Google Scholar]
- 19.Pattisapu N, Prabhu N, Bhati S, Varma V: Leveraging social media for medical text simplification. In: Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 851–860 (2020) [Google Scholar]
- 20.Shardlow M, Alva-Manchego F: Simple TICO-19: a dataset for joint translation and simplification of COVID-19 texts. In: Proceedings of the Thirteenth Language Resources and Evaluation Conference, pp. 3093–3102 (2022) [Google Scholar]
- 21.Touvron H, et al. : Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023) [Google Scholar]
- 22.White RW, Horvitz E: Cyberchondria: studies of the escalation of medical concerns in web search. ACM Trans. Inf. Syst. (TOIS) 27(4), 1–37 (2009) [Google Scholar]
- 23.Wolf T, et al. : HuggingFace’s transformers: state-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) [Google Scholar]
- 24.Xu W, Napoles C, Pavlick E, Chen Q, Callison-Burch C: Optimizing statistical machine translation for text simplification. Trans. Assoc. Comput. Linguist 4, 401–415 (2016) [Google Scholar]
- 25.Zhang T, Kishore V, Wu F, Weinberger KQ, Artzi Y: BERTScore: evaluating text generation with BERT. arXiv preprint arXiv:1904.09675 (2019) [Google Scholar]
