Abstract
Large language models (LLMs) hold great promise in summarizing medical evidence. Most recent studies focus on the application of proprietary LLMs. Using proprietary LLMs introduces multiple risk factors, including a lack of transparency and vendor dependency. While open-source LLMs allow better transparency and customization, their performance falls short compared to proprietary ones. In this study, we investigated to what extent fine-tuning open-source LLMs can further improve their performance in summarizing medical evidence. Utilizing a benchmark dataset, MedReview, consisting of 8,161 pairs of systematic reviews and summaries, we fine-tuned three broadly-used, open-sourced LLMs, namely PRIMERA, LongT5, and Llama-2. Overall, the fine-tuned LLMs obtained an increase of 9.89 in ROUGE-L (95% confidence interval: 8.94–10.81), 13.21 in METEOR score (95% confidence interval: 12.05–14.37), and 15.82 in CHRF score (95% confidence interval: 13.89–16.44). The performance of fine-tuned LongT5 is close to GPT-3.5 with zero-shot settings. Furthermore, smaller fine-tuned models sometimes even demonstrated superior performance compared to larger zero-shot models. The above trends of improvement were also manifested in both human and GPT4-simulated evaluations. Our results can be applied to guide model selection for tasks demanding particular domain knowledge, such as medical evidence summarization.
1. Introduction
Medical evidence plays a critical role in healthcare decision-making. Particularly, systematic reviews and meta-analyses of randomized controlled trials (RCTs) are considered the gold standard for generating robust medical evidence.1,2 However, systematically reviewing multiple RCTs is laborious and time-consuming.3 It requires retrieving relevant studies, appraising the evidence quality, and synthesizing findings. Meanwhile, systematic reviews frequently become obsolete upon publication, primarily due to protracted review processes. The delay is exacerbated by the exponential increase in scientific discoveries, exemplified by over 133,000 new clinical trials registered at ClinicalTrials.gov since 2020.4 As such, it is imperative to establish an efficient, reliable, and scalable automated system to streamline and accelerate systematic reviews.
Typically, systematic reviews include quantitative and qualitative reports,5 the former being a statistical meta-analysis of relevant clinical trials, and the latter being a concise narrative explanation of the quantitative results.6,7 Language generation technologies could potentially be employed to auto-generate such narratives, yet they have not been widely applied for medical evidence summarization.7 Text summarization has attracted research attention for decades. Earlier techniques relied on extracting key phrases and sentences through rules or statistical heuristics like word frequency and sentence placement.8–12 These methods, however, struggled with comprehending context and generating cohesive summaries. A significant shift occurred with the adoption of neural network-based methods, enhanced by attention mechanisms.13–19 These mechanisms enable the model to concentrate on various input segments and understand long-range connections between text elements. This advancement allows for a deeper grasp of context, leading to smoother and more precise summaries.
Recent advancements in generative Artificial Intelligence (AI), notably large language models (LLMs), have shown tremendous potential in comprehending and generating natural language.1,20 While these generalist models perform well across diverse tasks, they fail to capture in-depth domain-specific knowledge, particularly in biomedicine.21,22 Furthermore, despite the relevant superior performances compared to open-source alternatives,7,23–26 the disadvantages of closed-source models were rarely discussed or even mentioned.27,28 The lack of transparency of closed-source models makes it challenging to understand the model behavior and troubleshoot customized variants. Moreover, reliance on closed-source models raises the risks associated with changes in service terms or discontinuation of services, which pose a critical threat to long-term projects. Open-source models provide a promising solution to mitigate the above risks.
While multiple open-source model architectures were proposed for either general-purpose foundation models or specifically for text summarization, few have been optimized to synthesize medical evidence. Gutierrez et al. compared the in-context learning performance of GPT-3 with fine-tuned BERT models on Named-Entity Recognition tasks in the biomedical domain.29 Tang et al. assessed zero-shot GPT-3.5 and Chat-GPT on summarizing Cochrane review abstracts.7 The above related studies, however, did not focus on fine-tuning open-source models for medical evidence summarization. It’s also unclear to what degree the optimization strategies, e.g., few-shot learning or fine-tuning, can help bridge the performance gap between open-source models and cutting-edge closed-source alternatives. To quantitively assess fine-tuning technologies to enhance open-source LLMs for medical evidence summarization, we experimented with three broadly used open-sourced LLMs: PRIMERA,15 LongT5,14 and Llama-2,25 including both architectures designed specifically for text summarization and architectures of generalist foundation models. Fine-tuning these models is challenging due to the substantial requirement for computational resources and the risk of catastrophic forgetting, a phenomenon in machine learning where the performance degrades on tasks where the LLM initially performed well.30 To address these issues, we employed Low-Rank Adaptation (LoRA),31 which is a parameter-efficient fine-tuning method focusing on updating only a minimal amount of model parameters during the fine-tuning process.
To facilitate future studies on leveraging LLMs for medical evidence summarization, we present a benchmark dataset, MedReview, consisting of 8,161 pairs of meta-analysis results and narrative summaries from the Cochrane Library,32 published on 37 topics between April 1996 and June 2023. The dataset consists of training, validation, and test sets (see details in the Method section and Supplement Table 1). This collection is an extension of our previous study of evaluating LLMs for evidence summarization7 and covers a wider range of specialties and writing styles, which highlight common text summarization challenges (Figure 1a).
2. Results
2.1. Comparison of different LLMs in automatic evaluations
First, we fine-tuned PRIMERA, LongT5, and Llama-2 using the LoRA method (Figure 1b). We observed that fine-tuning considerably improved the performance of most models (p < 0.01, Figure 2). Specifically, LongT5 models benefited the most from fine-tuning, which led to an increase from 14.72 to 24.61, 15.06 to 28.27 in METEOR, 15.15 to 38.81 in CHRF, and 36.05 to 51.43 in PICO-F1 (Supplementary Table 2). In contrast, PRIMERA demonstrated a relatively moderate improvement with a ROUGE-L increase from 18.90 to 20.48, METEOR increase from 25.15 to 26.50, and PICO-F1 increase from 43.22 to 49.47. However, there was a slight decrease in the CHRF from 39.25 to 37.84. Overall, the fine-tuned LLMs improved the ROUGE-L score by an absolute of 9.89% (95% confidence interval of improvement: 8.94–10.81), the METEOR score by 13.21 (95% confidence interval of improvement: 12.05–14.37), and the CHRF score by 15.82 (95% confidence interval of improvement: 13.89–16.44). These recently released models all outperformed the fine-tuned variant of BART, the previous SoTA. The fine-tuned BART achieved 17.74, 27.49, and 40.54 in ROUGE-L, METEOR, and CHRF, respectively. We compared the fine-tuned models with GPT-3.5-turbo, one of the most widely known and cutting-edge closed-source LLMs. Zero-shot GPT-3.5-turbo achieved 23.15 in ROUGE-L, 28.83 in METEOR, and 39.74 in CHRF scores. The performance gaps between the open-source models and GPT-3.5-turbo were reduced after fine-tuning. The fine-tuned LongT5 achieved similar results as GPT-3.5-turbo (Supplement Table 2). We also conducted a pilot study using GPT-4. The summaries generated by GPT-3.5-turbo and GPT-4 are not significantly different; as such, we only use GPT-3.5-turbo for comparison.
Furthermore, while few-shot learning helped reduce the performance gap, LLMs fine-tuned with the full training data still demonstrated better performance. We constructed two sets of few-shot learning baselines, one based on few-shot prompting and the other based on few-shot fine-tuning (see details in Methods). Under 1-, 2-, and 5-shot prompting, Mixtral-8×7B achieved 24.87/24.53/24.99 in Rouge-L, 27.82/25.78/27.59 in METEOR, 37.63/35.61/37.42 in CHRF respectively. After fine-tuning using 100 randomly selected samples, PRIMERA achieved 19.11 in ROUGE-L, 24.98 in METEOR, and 39.01 in CHRF; LongT5 was moderately improved with few-shot fine-tuning, resulting in 15.06 in ROUGE-L, 16.17 in METEOR, and 24.81 in CHRF.
We also calculated Pearson correlation coefficients among different automatic metrics using the entire test set (Supplementary Figure 2). Most metrics are strongly positively correlated with each other; with a few exceptions, most coefficients are larger than 0.7. The only exceptional metric is the PICO coverage, which is only weakly positively correlated to the others, with coefficients ranging between 0.33 and 0.46. As discussed above, zero-shot models tend to verbatim replicate the input, which is already abundant in PICO concepts, since we selected only the objective and main results section of the review abstracts.
2.2. Comparison between zero-shot LongT5-xl and fine-tuned LongT5-base
Next, we investigated whether fine-tuned smaller models have the capability to outperform zero-shot larger models. To this end, we fine-tuned the LongT5-base model, which has 10% fewer parameters than LongT5-xl. Figure 3 shows that fine-tuned LongT5-base outperforms zero-shot LongT5-xl. We also found that this observed trend holds across different LLM architectures. For instance, the performance of fine-tuned PRIMERA and LongT5 exceeded that of zero-shot Llama-2 (Supplementary Table 2), even though the latter model comprises at least 20 times more parameters.
2.3. Qualitative evaluation
Finally, we conducted a comprehensive human evaluation and a GPT-4 simulated26 evaluation of machine-generated summaries. Our baseline, zero-shot Llama-2, is one of the latest and largest open-sourced LLMs. In both evaluations, we requested clinical experts or GPT-4 to select the better summary from a pair of candidates - one generated using the baseline and the other generated using our fine-tuned models. The win-rate is the ratio of machine-generated summaries evaluated as better than the baseline. The baseline win-rate is 50%, given that (1) zero-shot Llama-2 is being compared to itself, (2) all pairs of summaries to be compared are identical, and (3) ties are broken randomly.24
According to the human evaluation, fine-tuned Llama-2 was preferred to zero-shot, with the win-rate increased from 50% to 59.20% (Figure 4a, Supplementary Table 3). Fine-tuned PRIMERA and LongT5 models also achieved 54.47% and 59.68% win-rate against the zero-shot Llama-2.
We further asked the evaluators to share the rationale behind their preference for the chosen summaries. The fine-tuned models were compared with zero-shot Llama-2 on multiple dimensions that have been established as desired properties of summaries.7,33 Figure 5 shows the number of cases where zero-shot LLama-2 generated better summaries (left/red), in contrast to the cases where the fine-tuned models generated better summaries (right/blue). With few exceptions, LLMs were improved in all aspects after fine-tuning (Supplementary Table 5). By manually comparing the summaries generated by zero-shot and fine-tuned models, we found that zero-shot models tend to present a detailed background of the summarized studies but do not provide any findings or conclusions, i.e., they present a high resemblance of leading sentences in the paragraphs. Recall that word embeddings are combined with positional embeddings to represent each token in a document in transformer architecture. In general summarization tasks, the key information is typically presented in leading or concluding sentences. The ordering of key information and other redundant information can impact the positional embeddings during pre-training. The zero-shot open-source summarization models mostly extract the leading sentences instead of key information. This indicates that positional embeddings significantly impacted the summary more than word embeddings in zero-shot open-source models. On the other hand, fine-tuned models align more closely with ground-truth summaries, which can provide supportive evidence or identify the lack of sufficient evidence for intervention outcomes.
The GPT-4 simulated evaluation also indicates a significant improvement in all models after fine-tuning (Figure 4b). In addition, 257 out of 378 simulated evaluation results concord with the judgment of human experts (68% accuracy).
We also evaluated all models on two distinct test sets of review articles (Supplementary Table 4). One group, denoted as “after cutoff,” was published after the latest knowledge cutoff date of every model, i.e., no articles in this group were used in pre-training the foundation models. The other group, denoted as “before cutoff”, was published before the knowledge cutoff date; the articles in this group may be used for pre-training purposes. On both the “after cutoff” and the “before cutoff” test sets, fine-tuned models demonstrated improved performances. Recap that all articles used for fine-tuning were published “before cutoff”. This demonstrates the generalizability of fine-tuned models “before cutoff” data on “after cutoff” test data.
3. Discussion
In this study, we focused on comparing open-source and closed-source LLMs in medical evidence summarization. While closed-source models, exemplified by GPT families and others, demonstrated superior performance as compared to open-source alternatives, the risks associated with using closed-source models are not negligible, which include lack of transparency, reliance on external dependency as a single-point-of-failure, potentially high-cost in migration to other vendors. Furthermore, it’s still unclear whether patients would consent to have their information utilized in LLMs, especially when they were unaware of and unable to understand what the LLMs will be used. However, such disadvantages were not paid enough attention. To mitigate the above risks, we investigated recently released open-source LLMs for medical evidence summarization. To our best knowledge, open-source models still fall behind in natural language understanding, as compared to closed-source ones. Based on the observation that fine-tuning can enhance, we validate principles of model optimization within evidence summarization and quantitatively measure the performance boost via fine-tuning.
To facilitate future studies in this direction, we first introduced MedReview, a collection of meta-analysis and summary pairs to assist in fine-tuning LLMs for medical evidence summarization. We further showed that open-source LLMs, including LongT5, PRIMERA, and Llama-2, demonstrated improved summarization performance after fine-tuning with MedReview, even close to GPT-3.5-turbo. This observation also confirms that these generalist models perform well across diverse tasks but need more in-depth domain-specific knowledge, e.g., in biomedicine. While few-shot learning has been effective in other NLP tasks, the applicability is still limited by the context window within lengthy document summarization. Fine-tuning is a robust approach to bridge this performance gap between open-source models and closed-source alternatives, while maintaining the advantages of transparency, easy customization, maintenance, and migration. However, the manual evaluation results show that fine-tuning does not guarantee truthful and accurate summaries. This issue was also reported in the previous evaluation of zero-shot GPT models.7 This highlights that trustworthy summarization of medical evidence remains challenging and unresolved.
In addition, the usage of large models is limited by the high demand for computing resources. Our experiments indicated that smaller models, when fine-tuned, can outperform zero-shot larger models on specific tasks. This is observed in two experiments. First, the fine-tuned LongT5-base performs better than the zero-shot LongT5-xl (Figure 3). Second, PRIMERA and LongT5-xl, despite being smaller in size than Llama-2 (70B), demonstrated enhanced performance than the latter one upon fine-tuning (Figure 5, Supplement Table 2). In the cases of limited computing resources, PRIMERA demonstrated robust performance in medical evidence summarization in the few-shot fine-tuning setup, which confirms the findings of the original reports of the PRIMERA paper. However, few-shot fine-tuned models underperformed as compared to those fully fine-tuned on the entire training data.
The evaluation of LLM-generated summaries presents another challenge in this task. Automatic metrics frequently used for text summarization can, at best, measure the similarity between word distributions of references and LLM-generated summaries. These metrics do not correlate strongly with properties of desired summaries, such as factual comprehensiveness and consistency. As such, human evaluation, especially those from clinical experts, is still critical but not scalable. Therefore, we used GPT-4 as a simulated evaluator and found that 68% of GPT4-simulated evaluation results aligned with human judgments. These findings suggest that GPT-4 holds the promise to not only facilitate summary evaluation, but also to provide feedback to align the summarization model with simulated expert feedback. The simulated evaluation results further confirmed that fine-tuned smaller models can surpass larger zero-shot models (Figure 4b). This observation is consistent with the improvement discussed in the automatic evaluation.
Summarization systems have diverse applications, particularly benefiting researchers and healthcare professionals. For researchers engaged in systematic reviews, these systems can analyze clinical trial reports efficiently, pinpointing relevant studies and distilling essential findings without requiring exhaustive document review. This significantly speeds up the research process. Healthcare professionals and policymakers, on the other hand, can utilize these systems to efficiently grasp the latest clinical trials’ outcomes and implications, which is vital for informed decision-making concerning patient care, treatment guidelines, and healthcare policies, especially under the urgent pandemic conditions. Furthermore, by condensing the latest clinical trial information, summarization systems provide concise, current data to clinical decision support systems, assisting healthcare providers in making evidence-based treatment decisions customized to their patients’ specific needs.
This study has certain limitations. Due to the restricted time and resources, we cannot extensively explore fine-tuning models such as Claude, GPT-3.5-turbo, and GPT-4. Instead, we mainly focused on open-sourced LLMs. This restriction, however, does not impact the applicability of our study in the clinical domain. We believe that open-sourced LLMs foster wider collaboration and transparency, thus forwarding research progress as other researchers can enhance, modify, and refine these models. Plus, by advocating for the democratization of technology, these LLMs encourage a more extensive use and a potentially higher rate of innovation. Another limitation is that the LLMs were not fine-tuned to summarize clinical trial publications but the manually curated “main results” of review abstracts. This study design was initially aimed at simplifying and expediting the development and testing of our summarization algorithms. A future direction is to deploy LLMs to directly synthesize information from clinical trials.
To summarize, our findings underscore the utility of fine-tuning as a robust technique for bridging the performance gap between open-source LLMs and closed-source ones, reinforcing its applicability across a spectrum of model architectures and sizes, and setting the stage for more nuanced investigations into the efficiency and effectiveness of model optimization strategies. However, additional research is warranted to fully uncover the potential of LLM in this context.
4. Methods
4.1. Data Collection
We collected 8,161 abstracts of systematic reviews from the Cochrane Library.32 Unlike abstracts of the biomedical literature, which are highly condensed summaries, systematic review abstracts provide a structured overview that enables readers to quickly determine the validity and applicability of the review. These abstracts typically follow a common structure, detailing preferred reporting items.5 The Cochrane review abstracts present background, objectives, search methods, selection criteria, data collection and analysis, main results, and authors’ conclusions. Within such a self-contained structure, the authors’ conclusion presents a narrative summary of the most salient details of the included clinical studies.7 This section is one of the first to consult when healthcare providers seek answers to clinical questions. Given the meta-analysis results as input, we aim to automatically reproduce this narrative summary. The collected reviews cover a wide range of topics, including but not limited to neurology, gastroenterology, rheumatology, nephrology, and radiology. The publication dates of the reviews range from April 1996 to June 2023.
We split the dataset into distinct training (91.56%), validation (4.83%), and test (3.61%) sets, ensuring that all of the samples appear in one set (Supplementary Table 1). All LLMs were prepared with a large extent of public textual data collected until a certain moment, known as the cutoff. To maintain a legitimate comparison between LLMs, we put all articles published after September 2022 as test data (because most LLMs studied in this study used the data up to September 2022). Articles published prior to this are primarily used for training and validation. The division of training and validation was stratified according to the time of publication.
Few-shot Baselines
Few-shot learning has been proven an effective and sample-efficient strategy for optimizing task-specific LLMs. We construct two few-shot learning baselines, one based on prompting and the other on fine-tuning.
For few-shot prompting, we use Mixtral-7×8B for the few-shot prompting foundation model. Due to the limit on token numbers, other open-source models cannot fit demonstrations of long document summarization in the context windows. We randomly selected 1, 2, and 5 samples from the training set as demonstrations. For few-shot fine-tuning, we followed the findings of the PRIMERA report that LLMs can be reasonably adapted to domain-specific tasks with a limited number of labeled samples. We used the same setup as the few-shot experiments in the PRIMERA, where we randomly selected 100 samples from the training set and fine-tuned LLMs.
4.2. Fine-tuning LLMs
We investigated several LLM architectures that have recently surfaced for tasks related to summarization tasks or as foundation models. In this study, we only consider models that satisfy the following two conditions. First, models need to be publicly accessible and open-sourced to ensure the transparency and accountability of the models. Second, context windows need to be long enough to digest input without requiring condensation or truncation. Bearing all these factors in mind, we included PRIMERA, LongT5, and Llama-2 in our studies. PRIMERA deploys a pretraining strategy named Entity Pyramid to select and aggregate salient information focusing on document summarization.15 LongT5 is an extension of T5 architecture that adopts a summarization pretraining strategy to scale up the input length.14 Llama-2 is one of the recently released open-source, scalable foundation models with 7B, 13B, and 70B parameters.25 Since Mixtral-8×7B demonstrated similar benchmark performance as Llama-2, we did not fine-tune Mixtral-8×7B.23 Instead, we report the few-shot prompting performance of Mixtral-8×7B.
Following a previous work,7 we selected the objective and main results sections of a review abstract as input and used the authors’ conclusion as a reference to fine-tune models. We applied the LoRA method, which keeps the original parameters frozen and adjusts only a relatively small number of extra parameters via matrix decomposition.31 The exact number of parameters depends on the rank hyperparameter in the LoRA method. We refer the readers to the original LoRA paper for technical details.31
Our implementation uses the following libraries, transformers,34 torch,35 and PEFT.36 Most fine-tuning jobs were completed on AWS and our local lab servers. Llama-2 models were fine-tuned on SageMaker platform. All models were fine-tuned for 1 epoch, within which the validation loss had already stopped decreasing. We set the learning rates to 3e-5,15 1e-4,25 and 1e-314 for PRIMERA, LLama-2, and LongT5 models, respectively. We set the rank hyperparameter of LoRA to 8.31
4.3. Evaluation Metrics
We first use the natural language generation (NLG) metrics to evaluate the quality of the generated summary. These metrics include ROUGE-L (Recall-Oriented Understudy for Gisting Evaluation) and METEOR (Metric for Evaluation of Translation with Explicit Ordering) scores. We also include the CHRF (CHaRacter-level F-score), which was reported to correlate highly with the readability of generated text.33 Their values range from 0.0 to 100.0, with a score of 100.0 indicating that the generated summaries are identical to the reference summary. The model performance on our test is approximately normally distributed and the performance of each model is independent of others. As such, we calculated the p-value using a paired t-test to determine the statistical significance of the difference between the two models.
4.3.1. PICO metrics
NLG metrics are known to be inadequate for evaluating factual completeness and consistency.33 We therefore propose to use a PICO (Participants, Interventions, Comparison, and Outcomes) extraction system to evaluate the accuracy of the generated summaries. More specifically, we fine-tuned a BERT-based13 model to extract PICO concepts and then score a generated summary by comparing the values of these PICO elements obtained from the reference. We consider a PICO element to be a true positive, if it satisfies two conditions: (1) the text in the reference overlaps with the text in the generation, and (2) the two entity types should have the same PICO category. The micro averages for precision, recall, and F1 scores are all computed over the PICO components.
4.3.2. Human evaluation
We conducted a review of the summary quality via human evaluation. The quality is measured from four aspects: consistency, comprehensiveness, specificity, and readability, which were established as essential factors for measuring machine summary quality. Consistency indicates whether the summary contradicts the input source. Comprehensiveness measures coverage of key information of input. Specificity measures the preciseness and conciseness of the summary. Readability indicates a machine summary is fluent and free of grammatical errors that hinder understanding.
To evaluate the machine summaries, we invited 7 clinical experts, each specializing in one or two of the following specialties, including Gastroenterology, General Surgery, Internal Medicine, Nephrology, Neurology, Radiology, and Rheumatology. All experts have obtained MD training and currently provide direct patient care. Following a recent LLM study,24 we request the experts to compare the quality of different machine summaries (interface shown in Supplementary Figure 1). Specifically, according to their domain knowledge, each expert was assigned review abstracts along with three summaries: (1) the Authors’ conclusion section, (2) zero-shot Llama-2, and (3) one of the fine-tuned models. From the zero-shot baseline and the fine-tuned models, the experts will select which one generates a better summary. To reduce the potential order-related bias, the order of summaries generated by zero-shot Llama-2 and fine-tuned summaries was randomized. We further asked the experts to choose the reasons for their choices.
4.3.3. GPT-4 evaluation
Like many other annotation scenarios, collecting experts’ feedback is not scalable with respect to the samples to be labeled. In addition to our manual review, we explored the use of GPT-426 as a simulated expert to answer the same questions as those assigned to human experts. Instead of selecting a sample of test data as in the manual review, we used the model summaries for all test articles for GPT-4 evaluation. We also analyzed the percentage of questions that human judgments agree with the GPT-4 evaluation.
Supplementary Material
Acknowledgements
This project was sponsored by the National Library of Medicine grant R01LM009886, R01LM014344, National Human Genome Research Institute grant R01HG012655, and the National Center for Advancing Clinical and Translational Science awards UL1TR001873 and UL1TR002384. Q.J. and Z.L. are supported by the NIH Intramural Research Program, National Library of Medicine. We also want to express our gratitude to Amazon Web Services (AWS) for providing the computational resources used in our research. The funder had no role in the design and conduct of the study; collection, management, analysis, and interpretation of the data; preparation, review, or approval of the manuscript; and decision to submit the manuscript for publication.
Footnotes
Competing Interests
The authors declare no competing interests.
Ethics declarations
Not applicable.
Code Availability
The code can be accessed at https://github.com/ebmlab/MedReview.
Data availability
The data can be accessed at https://github.com/ebmlab/MedReview.
References
- [1].Peng Y, Rousseau JF, Shortliffe EH, Weng C. AI-generated text may have a role in evidence-based medicine. Nat Med. 2023. Jul;29(7):1593–1594. doi: 10.1038/s41591-023-02366-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [2].Concato J, Shah N, Horwitz RI. Randomized, controlled trials, observational studies, and the hierarchy of research designs. N Engl J Med. 2000. Jun;342(25):1887–1892. doi: 10.1056/NEJM200006223422507. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [3].Borah R, Brown AW, Capers PL, Kaiser KA. Analysis of the time and workers needed to conduct systematic reviews of medical interventions using data from the PROSPERO registry. BMJ Open. 2017. Feb;7(2):e012545. doi: 10.1136/bmjopen-2016-012545. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [4].NLM. ClinicalTrial.gov (NLM), Accessed on Nov 18, 2023. Accessed: 2023-11-18. [Google Scholar]
- [5].Page MJ, McKenzie JE, Bossuyt PM, Boutron I, Hoffmann TC, Mulrow CD, Shamseer L, Tetzlaff JM, Akl EA, Brennan SE, Chou R, Glanville J, Grimshaw JM, Hróbjartsson A, Lalu MM, Li T, Loder EW, Mayo-Wilson E, McDonald S, McGuinness LA, Stewart LA, Thomas J, Tricco AC, Welch VA, Whiting P, Moher D. The PRISMA 2020 statement: an updated guideline for reporting systematic reviews. Rev Esp Cardiol. 2021. Sep;74(9):790–799. doi: 10.1016/j.rec.2021.07.010. [DOI] [PubMed] [Google Scholar]
- [6].Wallace BC, Saha S, Soboczenski F, Marshall IJ. Generating (Factual?) Narrative Summaries of RCTs: Experiments with Neural Multi-Document Summarization. AMIA Jt Summits Transl Sci Proc. 2021. May;2021:605–614. [PMC free article] [PubMed] [Google Scholar]
- [7].Tang L, Sun Z, Idnay B, Nestor JG, Soroush A, Elias PA, Xu Z, Ding Y, Durrett G, Rousseau JF, Weng C, Peng Y. Evaluating large language models on medical evidence summarization. NPJ Digit Med. 2023. Aug;6(1):158. doi: 10.1038/s41746-023-00896-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [8].Barzilay R, Elhadad N. Inferring Strategies for Sentence Ordering in Multidocument News Summarization. jair. 2002. Aug;17:35–55. doi: 10.1613/jair.991. [DOI] [Google Scholar]
- [9].Pivovarov R, Elhadad N. Automated methods for the summarization of electronic health records. J Am Med Inform Assoc. 2015. Sep;22(5):938–947. doi: 10.1093/jamia/ocv032. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [10].Zweigenbaum P, Demner-Fushman D, Yu H, Cohen KB. Frontiers of biomedical text mining: current progress. Brief Bioinform. 2007. Sep;8(5):358–375. doi: 10.1093/bib/bbm045. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [11].Li F, Han C, Huang M, Zhu X, Xia YJ, Zhang S, Yu H. Structure-Aware Review Mining and Summarization. In: Huang CR, Jurafsky D, editors. Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010). Beijing, China: Coling 2010 Organizing Committee; 2010. p. 653–661. [Google Scholar]
- [12].Demner-Fushman D, Lin JJ. Answering clinical questions with knowledge-based and statistical techniques. CL. 2007. Mar;33:63–103. doi: 10.1162/coli.2007.33.1.63. [DOI] [Google Scholar]
- [13].Gu Y, Tinn R, Cheng H, Lucas M, Usuyama N, Liu X, Naumann T, Gao J, Poon H. Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing. ACM Trans Comput Healthcare. 2021. Oct;3(1):1–23. doi: 10.1145/3458754. [DOI] [Google Scholar]
- [14].Guo M, Ainslie J, Uthus D, Ontanon S, Ni J, Sung YH, Yang Y. LongT5: Efficient text-to-text transformer for long sequences. In: Findings of the Association for Computational Linguistics: NAACL 2022. Stroudsburg, PA, USA: Association for Computational Linguistics; 2022. p. 724–736. doi: 10.18653/v1/2022.findings-naacl.55. [DOI] [Google Scholar]
- [15].Xiao W, Beltagy I, Carenini G, Cohan A. PRIMERA: Pyramid-based Masked Sentence Pre-training for Multi-document Summarization. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Stroudsburg, PA, USA: Association for Computational Linguistics; 2022. p. 5245–5263. doi: 10.18653/v1/2022.acl-long.360. [DOI] [Google Scholar]
- [16].Zhang J, Zhao Y, Saleh M, Liu P. PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization. In: Iii HD, Singh A, editors. Proceedings of the 37th International Conference on Machine Learning. vol. 119 of Proceedings of Machine Learning Research. PMLR; 2020. p. 11328–11339. [Google Scholar]
- [17].Lewis M, Liu Y, Goyal N, Ghazvininejad M, Mohamed A, Levy O, Stoyanov V, Zettlemoyer L. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Online: Association for Computational Linguistics; 2020. p. 7871–7880. doi: 10.18653/v1/2020.acl-main.703. [DOI] [Google Scholar]
- [18].Devlin J, Chang MW, Lee K, Toutanova K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Minneapolis, Minnesota: Association for Computational Linguistics; 2019. p. 4171–4186. doi: 10.18653/v1/N19-1423. [DOI] [Google Scholar]
- [19].Mrabet Y, Demner-Fushman D. HOLMS: Alternative Summary Evaluation with Large Language Models. In: Scott D, Bel N, Zong C, editors. Proceedings of the 28th International Conference on Computational Linguistics. Barcelona, Spain (Online): International Committee on Computational Linguistics; 2020. p. 5679–5688. doi: 10.18653/v1/2020.coling-main.498. [DOI] [Google Scholar]
- [20].Singhal K, Azizi S, Tu T, Mahdavi SS, Wei J, Chung HW, Scales N, Tanwani A, Cole-Lewis H, Pfohl S, Payne P, Seneviratne M, Gamble P, Kelly C, Babiker A, Schärli N, Chowdhery A, Mansfield P, Demner-Fushman D, Agüera Y Arcas B, Webster D, Corrado GS, Matias Y, Chou K, Gottweis J, Tomasev N, Liu Y, Rajkomar A, Barral J, Semturs C, Karthikesalingam A, Natarajan V. Large language models encode clinical knowledge. Nature. 2023. Aug;620(7972):172–180. doi: 10.1038/s41586-023-06291-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [21].Zack T, Lehman E, Suzgun M, Rodriguez JA, Celi LA, Gichoya J, Jurafsky D, Szolovits P, Bates DW, Abdulnour REE, Butte AJ, Alsentzer E. Coding inequity: Assessing GPT-4’s potential for perpetuating racial and gender biases in healthcare. bioRxiv. 2023. Jul. doi: 10.1101/2023.07.13.23292577. [DOI] [PubMed] [Google Scholar]
- [22].Jin Q, Yang Y, Chen Q, Lu Z. GeneGPT: Augmenting Large Language Models with Domain Tools for Improved Access to Biomedical Information. ArXiv. 2023. May. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [23].Jiang AQ, Sablayrolles A, Roux A, Mensch A, Savary B, Bamford C, Chaplot DS, Casas Ddl, Hanna EB, Bressand F, Lengyel G, Bour G, Lample G, Lavaud LR, Saulnier L, Lachaux MA, Stock P, Subramanian S, Yang S, Antoniak S, Scao TL, Gervet T, Lavril T, Wang T, Lacroix T, Sayed WE. Mixtral of Experts. arXiv. 2024. Jan. [Google Scholar]
- [24].Ouyang L, Wu J, Jiang X, Almeida D, Wainwright C, Mishkin P, Zhang C, Agarwal S, Slama K, Ray A, Schulman J, Hilton J, Kelton F, Miller L, Simens M, Askell A, Welinder P, Christiano PF, Leike J, Lowe R. Training language models to follow instructions with human feedback. In: Koyejo S, Mohamed S, Agarwal A, Belgrave D, Cho K, Oh A, editors. Advances in Neural Information Processing Systems. vol. 35. Curran Associates, Inc.; 2022. p. 27730–27744. [Google Scholar]
- [25].Touvron H, Martin L, Stone K, Albert P, Almahairi A, Babaei Y, Bashlykov N, Batra S, Bhargava P, Bhosale S, Bikel D, Blecher L, Ferrer CC, Chen M, Cucurull G, Esiobu D, Fernandes J, Fu J, Fu W, Fuller B, Gao C, Goswami V, Goyal N, Hartshorn A, Hosseini S, Hou R, Inan H, Kardas M, Kerkez V, Khabsa M, Kloumann I, Korenev A, Koura PS, Lachaux MA, Lavril T, Lee J, Liskovich D, Lu Y, Mao Y, Martinet X, Mihaylov T, Mishra P, Molybog I, Nie Y, Poulton A, Reizenstein J, Rungta R, Saladi K, Schelten A, Silva R, Smith EM, Subramanian R, Tan XE, Tang B, Taylor R, Williams A, Kuan JX, Xu P, Yan Z, Zarov I, Zhang Y, Fan A, Kambadur M, Narang S, Rodriguez A, Stojnic R, Edunov S, Scialom T. Llama 2: Open Foundation and Fine-Tuned Chat Models. arXiv. 2023. Jul. [Google Scholar]
- [26].OpenAI R. Gpt-4 technical report. arxiv 2303.08774. View in Article. 2023. [Google Scholar]
- [27].Nosek BA, Alter G, Banks GC, Borsboom D, Bowman SD, Breckler SJ, Buck S, Chambers CD, Chin G, Christensen G, Contestabile M, Dafoe A, Eich E, Freese J, Glennerster R, Goroff D, Green DP, Hesse B, Humphreys M, Ishiyama J, Karlan D, Kraut A, Lupia A, Mabry P, Madon T, Malhotra N, Mayo-Wilson E, McNutt M, Miguel E, Paluck EL, Simonsohn U, Soderberg C, Spellman BA, Turitto J, VandenBos G, Vazire S, Wagenmakers EJ, Wilson R, Yarkoni T. Promoting an open research culture. Science. 2015;348(6242):1422–1425. doi: 10.1126/science.aab2374. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [28].Zhang G, Jin Q, Jered McInerney D, Chen Y, Wang F, Cole CL, Yang Q, Wang Y, Malin BA, Peleg M, Wallace BC, Lu Z, Weng C, Peng Y. Leveraging generative AI for clinical evidence synthesis needs to ensure trustworthiness. J Biomed Inform. 2024. May;153:104640. doi: 10.1016/j.jbi.2024.104640. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [29].Jimenez Gutierrez B, McNeal N, Washington C, Chen Y, Li L, Sun H, Su Y. Thinking about GPT-3 in-context learning for biomedical IE? Think again. In: Findings of the Association for Computational Linguistics: EMNLP 2022. Stroudsburg, PA, USA: Association for Computational Linguistics; 2022. p. 4497–4512. doi: 10.18653/v1/2022.findings-emnlp.329. [DOI] [Google Scholar]
- [30].Tadros T, Krishnan GP, Ramyaa R, Bazhenov M. Sleep-like unsupervised replay reduces catastrophic forgetting in artificial neural networks. Nat Commun. 2022. Dec;13(1):7742. doi: 10.1038/s41467-022-34938-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [31].Hu EJ, Shen Y, Wallis P, Allen-Zhu Z, Li Y, Wang S, Wang L, Chen W. LoRA: Low-Rank Adaptation of Large Language Models. arXiv. 2021. Jun. [Google Scholar]
- [32].Cochrane. The Cochrane Library. Accessed: 2023-9-30. [Google Scholar]
- [33].Fabbri AR, Kryściński W, McCann B, Xiong C, others. Summeval: Re-evaluating summarization evaluation. Transactions of the. 2021. [Google Scholar]
- [34].Wolf T, Debut L, Sanh V, Chaumond J, Delangue C, Moi A, Cistac P, Rault T, Louf R, Funtowicz M, Davison J, Shleifer S, von Platen P, Ma C, Jernite Y, Plu J, Xu C, Le Scao T, Gugger S, Drame M, Lhoest Q, Rush A. Transformers: State-of-the-Art Natural Language Processing. In: Liu Q, Schlangen D, editors. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. Online: Association for Computational Linguistics; 2020. p. 38–45. doi: 10.18653/v1/2020.emnlp-demos.6. [DOI] [Google Scholar]
- [35].Paszke A, Gross S, Chintala S, Chanan G, Yang E, DeVito Z, Lin Z, Desmaison A, Antiga L, Lerer A. Automatic differentiation in PyTorch. In: Neural Information Processing Systems (NIPS 2017); 2017. p. 1–4. [Google Scholar]
- [36].Mangrulkar S, Gugger S, Debut L, Belkada Y, Paul S, Bossan B. PEFT: State-of-the-art Parameter-Efficient Fine-Tuning methods. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
The data can be accessed at https://github.com/ebmlab/MedReview.