Abstract
Evidence-based medicine promises to improve the quality of healthcare by empowering clinical decisions and practices with the best available evidence. The rapid growth of clinical evidence, which can be obtained from various sources, poses a challenge in collecting, appraising, and synthesizing the evidential information. Recent advancements in generative AI, exemplified by large language models, hold promise in facilitating the arduous task. However, developing accountable, fair, and inclusive models remains a complicated undertaking. In this perspective, we discuss the trustworthiness of generative AI in the context of automated evidence synthesis.
Keywords: evidence-based medicine, trustworthy generative AI, large language models, clinical evidence synthesis
Evidence-based Medicine (EBM) aims to improve healthcare decisions by integrating the best available evidence, clinical proficiency, and patient values. One widely utilized method for generating optimally accessible evidence is the systematic review of Randomized Controlled Trials (RCT). This involves identifying, appraising, and synthesizing evidence from relevant studies addressing a clinical question. While RCTs have been established as a reliable method for obtaining high-quality evidence for the safety and efficacy of medical products, their practical applicability is often curbed by cost constraints, ethical dilemmas, and logistical challenges. Consequently, some people have turned to real-world evidence, which is derived from real-world patients and observational data about them, such as electronic health records, insurance billing databases, and disease registries [1], for estimating treatment effects [2,3].
Clinical evidence synthesis can be defined as the act of grained analysis of evidence across publications to collect the most relevant evidence to a target study [4]. However, due to the rapid growth of the biomedical literature and clinical data, clinical evidence synthesis is struggling to keep up. In the meantime, the proliferation of misinformation or biased/incomprehensive summaries resulting from unreliable or contradicting evidence erodes the public’s trust in biomedical research [5,6]. Given such concerns, we need to develop new and trustworthy methods to reform systematic reviews.
In recent years, dramatic advances in Generative AI, particularly Large Language Models (LLMs), have demonstrated a remarkable potential for assisting in systematic reviews [7]. Recently, LLMs have been explored to summarize meta-analyses [8] and clinical trials [9,10]. Before the era of LLMs, AI methods were also deployed to extract evidential information [11–16] and retrieve publications on a given topic [17–19]. However, achieving trustworthy AI for clinical evidence synthesis remains a complex undertaking. In this perspective, we discuss the trustworthiness of generative AI and the associated challenges and recommendations in the context of fully and semi-automated clinical evidence synthesis (Figure 1).
Figure. 1.
Challenges and recommendations to achieve trustworthy evidence synthesis.
Accountability
In the context of summarizing clinical evidence, the accountability of an LLM refers to the model’s ability to faithfully summarize high-quality clinical trials and the corresponding meta-analyses. In the context of health and human services, faithful or factual AI refers to systems that generate content that is factually accurate so that it is exchangeable. [20]. When using the term trustworthiness, we emphasize not only the factualness or faithfulness but also the system’s reproducibility. While LLMs can generate semantically meaningful and grammatically correct textual outputs, these models can potentially yield factually incorrect outcomes [8]. These discrepancies can be classified as intrinsic or extrinsic hallucinations [21]. Intrinsic hallucinations refer to cases where “the generated output contradicts the input source.” By contrast, extrinsic hallucinations occur when the generated output can “neither be supported nor be contradicted by the source.”
The problem of factually incorrect outputs is partially due to differences in the sources of evidence and the resulting summary [21]. Particularly in evidence synthesis, the findings of the included clinical trials may not inherently align with the outcomes of the meta-analysis. Systematic reviews are designed to offer a statistical synthesis of the results of eligible clinical studies, rather than merely replicating them verbatim [22]. It is also plausible that crucial information from an included study might be omitted from the synthesis if perceived as offering low-quality evidence. Given the problems that end-to-end synthesis/summarization models have with aggregating information from contradictory sources [23], automatically generated summaries based solely on included clinical trials—without considering the meta-analysis—may not be reliable. Since the process of systematic review consists of multiple steps, directly utilizing LLMs as an end-to-end pipeline may increase the risk of incorrect outcomes due to errors in upstream tasks, such as searching, screening, and appraisal. The LLMs as an end-to-end system are also more challenging to understand than a system dedicated to a single subtask of the process.
Another factor contributing to the disparity between input evidence and summary is the accessibility of the included clinical studies, which is constrained by the copyrights of specific journals where the studies are published. For instance, if LLMs are instructed to summarize ten studies, yet only eight are publicly accessible and included as input, the models are compelled to generate speculative content to compensate for the missing sources of evidence. Therefore, comprehensive training data is critical for developing accountable models to summarize clinical evidence. All the information in the targeted summary should be referenced or derived from the included clinical trials, which are considered high-quality evidence. Additionally, reliable implementation of automatic meta-analysis workflow is critical to assure the correctness of statistically synthesized effect measures and their corresponding accuracy.
Further, it is important to determine how LLMs should be evaluated in evidence synthesis. While multiple automatic evaluation metrics have been developed for assessing AI-generated summaries, they do not correlate strongly with human expert evaluations concerning factual consistency, relevance, coherence, and fluency [8–10,24]. Thus, there is a need for a complete automatic evaluation protocol for evidence synthesis, as well as advanced research in developing evaluation metrics that better correlate with human judgments [25,26]. These automatic evaluation metrics should serve as a tool for complementing manual evaluations from domain experts and not replace them, especially considering the high-stakes nature of the task.
Beyond the above issues, parametric knowledge bias is a conspicuous problem [27]. This occurs when models depend on their intrinsic parametric knowledge, which is built up during training, rather than the information provided in the input source when generating summaries. To alleviate this issue, retrieval-augmented generation [28], which has been proposed to retrieve references and incorporate portions of these references into the completion, can be employed to enhance the accuracy and coherence of prompt completion. It enables healthcare providers to verify the evidence synthesis against referenced studies and assess the quality of the references [29].
Causality
Estimating treatment effects is important for informing clinical decisions. For decades, RCTs have been considered a gold standard, where patients are usually randomly assigned to the treatment or control group so that observable confounding factors are evenly distributed in the two groups. However, there are several limitations of RCTs that have yet to be recognized. For example, it is unethical to explore all possible outcomes if there is a risk of harm to a patient involved in the treatments. Also, statistically meaningful RCTs are challenging to administer for rare diseases. Thus, as a supplement to the evidence, observational data has been explored for estimating treatment effects [2]. Yet, despite the promising abundance of records in observational studies, the treated and untreated groups may not be directly comparable because of confounding factors. Even more problematic is that certain confounding factors, such as a patient’s familial situation or economic status, may not be observable. As such, causal inference remains a challenge in leveraging observational data.
LLMs have demonstrated numerous breakthroughs in making inferences based on patterns (e.g., writing poems or source code [30]) and hold promise in assisting causal inference tasks by identifying confounding factors or generating descriptions for causal relationships. However, it is notable that LLMs may still “exhibit unpredictable failure modes [in causal reference]” [31]. Furthermore, there is ongoing discussion regarding whether LLMs truly perform causal reasoning or mimic memorized responses. As such, research needs to be conducted into how to best characterize LLMs’ capacity for causality and understand their underlying mechanisms.
Transparency
In healthcare, it is critical that the systems are transparent due to their proximity to human lives and that patients understand how clinicians use these recommendations. In the context of LLMs, the challenges are unique, compared with traditional AI approaches, because of the complexity of the models, capability unpredictability [32], and proprietary technology.
Modern LLMs are typically based on neural models, which consist of multiple layers of interconnected neurons, and the relationship between input and output can be highly complex. It is thus challenging to understand how a specific summary is generated based on the input information, which poses a challenge in building transparent systems in evidence synthesis. This complexity is exemplified by the exposure bias problem, which refers to the difference in decoding behavior between the training and inference phases. It is common practice to train models to generate the next token conditioned on ground-truth prefix sequences. However, during inference, the model generates the next token based on the historical sequences it has generated. The model can be potentially biased to perform well only when conditioned on truth prefixes. This is also referred to as teacher forcing. This discrepancy can trigger progressively more erroneous generation, particularly as the target sequence lengthens. While the problem was identified almost a decade ago [33], it is still unclear to what extent exposure bias affects the quality of model output [34].
In recent years, researchers in AI and human-computer interaction (HCI) have focused on developing and evaluating different approaches to achieve transparency in clinical AI systems, such as model training techniques, standard protocols for documenting the data and model training process [35–37], techniques that explain the confidence level of AI predictions in ways consistent with how clinicians weigh uncertainty in clinical decision making and explain clinicians’ decisions to patients [38]. Additional work focused on establishing community guidelines that are informed by human-centered design principles for such applications that potentially influence health behaviors [39]. These prior studies established a solid foundation for addressing the challenges generative AI brings and, therefore, can be valuable for guiding the development of AI-generated summaries.
Regarding EBM, we need to create teams of diverse stakeholders, including, at a minimum, patients, healthcare practitioners, and policymakers. Representing and supporting their needs is critical to ensuring that the generative AI research community develops meaningful and respectful technologies. For instance, clinical studies summarized for patients should prioritize readability and comprehensibility. In contrast, summaries intended for healthcare practitioners should provide sufficient detail to support trustworthy decision-making. Additionally, the versions crafted for policymakers should highlight potential risks to the synthesis process and discuss their broader implications.
Finally, developing models with baked-in structures is crucial to achieving transparency. For example, Saha et al. [40] use a systematically organized list of binary trees that symbolically represent the sequential generative process of a summary from the source document and highlight significant pieces of evidence that influence the synthesis, and Ramprasad et al. [10] separate conditions, interventions, and outcomes in input RCTs to yield an aspect-wise interpretable summary. This also means that transparency is not an afterthought but a significant element that’s diligently crafted and assessed with the expertise of domain professionals.
Fairness
LLMs offer significant benefits for addressing biases in clinical trials and enhancing research inclusivity. For example, LLMs can process and analyze vast amounts of data from diverse populations. This ensures that the findings are more representative and applicable to a wider range of patient groups. Moreover, LLMs can process information in multiple languages, facilitating accessibility to clinical trial data and evidence synthesis for researchers and practitioners worldwide. However, it is important to note that while some LLMs can generate documentation on par with clinicians, there is evidence that LLMs can also propagate biased results [41]. For example, there may be a need for LLMs to better capture the demographic diversity of clinical conditions to prevent generating clinical vignettes that perpetuate stereotypes about demographic presentations [42]. The presence of biases in the LLM’s training data can also perpetuate or amplify these biases in its analysis and summaries. In the healthcare domain, this is particularly concerning, as biased data can lead to unfair and harmful outcomes. Moreover, LLMs may not fully understand the cultural, social, and ethical contexts of clinical trials, potentially leading to oversimplified or inappropriate conclusions that might not be fair or applicable to all patient groups. Therefore, the development, deployment, and use of generative AI should strive for fairness.
While quality assurance for clinical studies is orthogonal to model training and inference, the underlying bias can be amplified in the synthesized evidence summaries based on the clinical studies. Subsequently, the spread of mainstream misinformation and the fast publication of invalid evidence has dramatically undermined the public’s trust in biomedical research [43]. Even though the evidence synthesis task might be less prone to biases in evidence-based medicine, it is still crucial to remain cautious. One possible way to mitigate these issues is to construct prompts (i.e., instructions to LLMs) explicitly requesting to avoid bias. However, such a bias-free prompt is unlikely to be common practice because biases can significantly influence the overall quality of the model development and may be present more broadly and deeply within the LLMs.
In the context of real-world evidence, it has been found that machine learning (ML) models may perform poorly on under-represented groups [44,45]. Such situations where a patient group is under-tested compared to others are called disparate censorship. For training supervised learning models, patient outcomes are labeled based on diagnostic test results. While medical records list diseases that patients have been diagnosed with, the absence of a diagnosis does not mean the patients do not have such a disease. Assuming undiagnosed patients are healthy can lead to the development of biased ML models that produce incorrect summaries, particularly for patient groups that have limited access to healthcare [44,45]. Clinical decisions based on biased generations, e.g., evidence summaries, can further harm the already under-tested population [46]. The research community needs to focus on assessing the LLMs to ensure they do not exhibit biases, discrimination, and stigmatization towards individuals or groups [42,47].
Model Generalizability
Another crucial component of achieving trustworthiness for generative AI is generalizability. The models must behave reliably and reproducibly while minimizing unintentional and unexpected consequences, even when the operating environment differs from the training environment.
While most popular LLMs are trained using diverse text resources and have demonstrated considerable proficiency, they need a deeper understanding of specialized fields, particularly in the clinical domain. For instance, domain-specific LLMs may technically comprehend clinical knowledge, such as the UMLS terminology, more than their generalist counterparts [48,49]. To this end, we should prioritize constructing domain-specific LLMs and benchmarks [50], as generic benchmarks are no longer of primary importance in evidence-based medicine. So far, multiple LLMs have been developed for medical applications and open-sourced, such as BioBERT, ClinicalBERT, PubMedBERT, BioMedLM, and GeneGPT [51–55]. In addition, many companies adapted their close-sourced models for medical applications, such as GPT-3.5, GPT-4 by OpenAI, and MedPaLM by Google [56–58]. These LLMs were pre-trained on datasets that do not include the most recent knowledge and information in the medical domain, which is a fast-evolving field. As such, developing domain-specific models is an on-going process and a high priority.
Another challenge for generalizability in evidence synthesis is the ability to process long inputs. Even though there is a trend toward increasing the maximum length of input tokens, the extended context window may still need to be adequately long to encompass all clinical trials or notes involved in an evidence summary. To combat this issue, several strategies have been proposed, such as chunking the document, hierarchical summarization, and iterative refinement by looping over multiple sections or documents [59,60]. A potential solution for summarizing multiple clinical trials that cannot fit in a single prompt is maintaining historical interaction records. These records can be used to process clinical trials in batches. Additionally, exploring hierarchical summarization mechanisms could enable synthesizing evidence from concise summaries of the clinical trials, rather than directly from the trials themselves.
Data Privacy and Governance
There are numerous concerns over how an LLM may share and consume patient information. However, it is still being determined if patients would consent to use information about them in an LLM - particularly if they are not informed about, or unable to comprehend, what the LLM would be used for. In this respect, if patient information is to be relied upon, there need to be clear descriptions of anticipated uses - even if such a communication indicates that it is unknown what such services may be.
Even if patients consent to transfer information about them to an LLM, they may not wish to have potentially identifying or stigmatizing information about them disclosed to or retained by the LLM. While LLMs are designed to create probabilistic representations of the relationships between features, if not designed appropriately, they may memorize training instances supplied to them [61]. As such, when the LLM is queried, it may be possible for the end user to determine the training data, which could have legal implications [62]. Even if individual-level records cannot be recovered from an LLM, it may also be possible that a patient can be detected as a contributor to the training data [63]. Such membership inferences could be problematic if, for instance, the LLM has been fine-tuned on patients diagnosed with a sensitive clinical phenomenon, such as a sexually transmitted disease. There have been various investigations into incorporating formal privacy principles, such as differential privacy[62], into the construction and use of LLMs; however, it is currently unclear how such noise-based mechanisms influence clinical trial synthesis capabilities. In this respect, it is critical to continue research into how best to train and apply LLMs in a manner that respects the privacy of the patients upon whom the technology is based.
Patient Safety
Safety also remains a pressing concern when utilizing clinical evidence summarized by LLMs, as any inaccuracies could have far-reaching implications for high-stakes healthcare decisions. To mitigate these risks, generative AI should come equipped with protective measures that facilitate a backup plan in case of any complications. As we continue to advance in AI, the goal is for LLMs to transform from simple tools to collaborative partners operating in sync with human experts. While using LLMs to quickly analyze and synthesize large amounts of evidence could expedite the evidence synthesis process, a secure and reliable architecture is essential to ensure that expert meta-analyzers can trust these AI-generated summaries.
One possible solution revolves around a synergy between humans and generative AI, with a focus on their mutual safety. This collaborative framework would involve human experts frequently reviewing and offering feedback on summaries produced by LLMs, thereby resulting in iterative improvements. In this way, humans and generative AI can develop trust in each other and continue to evolve concurrently, enhancing their teamwork. We believe this integration will enhance capabilities beyond what humans alone can achieve, introduce new capacities, and, crucially, embody “high confidence.” Accordingly, they should exhibit reliable or predictable performance, demonstrate tolerance towards environmental and configuration faults, and maintain resilience against adversarial attacks.
Lawfulness and Regulation
Finally, generative AI should unequivocally adhere to all pertinent laws and regulations, including international humanitarian and human rights laws, a cardinal principle that carries particular significance when applying generative AI for evidence synthesis. This entails recognizing the nuanced legal landscape governing AI deployment, which may diverge from one state or country to another. Thus, it becomes imperative to construct a comprehensive legal framework addressing the accountability associated with actions taken and recommendations made by AI systems.
Under the General Data Protection Regulation (Article 9), collecting and processing sensitive personal data is subject to strict regulations [64]. The North Atlantic Treaty Organization (NATO) has established guidelines promoting the responsible utilization of AI, including a principle emphasizing that AI applications will be created and employed in compliance with both domestic and international legal frameworks [65]. Recently, the United States Congress has conducted a series of testimonies and hearings involving AI companies and AI researchers to address matters pertaining to AI regulation and governance [66,67]. The European Union has also initiated the process of creating regulations regarding developing and using generative AI [68]. In alignment with these movements, LLMs for evidence synthesis and beyond must also be developed with these legal challenges in mind to safeguard patients, clinicians, and AI developers from any unintended repercussions.
Conclusion
Generative AI for clinical evidence synthesis has already made a positive impact and is set to continue doing so in a way we cannot imagine. However, it is equally vital that risks and other adverse effects are properly mitigated. To this end, building generative AI systems that are genuinely trustworthy is crucial. This document outlines several directions that efforts could take to achieve trustworthy generative AI, ideally working in unity and overlapping in their functionality.
While this document primarily targets evidence synthesis, some principles could prove beneficial in other domains. The present moment necessitates the creation of a culture of Trustworthy AI within the JBI community, enabling the benefits of AI to be fully realized within our healthcare system.
Acknowledgments
This research was sponsored by the National Library of Medicine grant R01LM009886, R01LM014344, R01LM014306, and the National Center for Advancing Clinical and Translational Science award UL1TR001873. It was also supported by the NIH Intramural Research Program, National Library of Medicine.
Footnotes
Conflict of Interest
None.
References
- 1.Sherman RE, Anderson SA, Dal Pan GJ, Gray GW, Gross T, Hunter NL, et al. Real-World Evidence - What Is It and What Can It Tell Us? N Engl J Med. 2016;375: 2293–2297. [DOI] [PubMed] [Google Scholar]
- 2.Schuemie MJ, Hripcsak G, Ryan PB, Madigan D, Suchard MA. Empirical confidence interval calibration for population-level effect estimation studies in observational healthcare data. Proc Natl Acad Sci U S A. 2018;115: 2571–2577. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Gershman B, Guo DP, Dahabreh IJ. Using observational data for personalized medicine when clinical trial evidence is limited. Fertil Steril. 2018;109: 946–951. [DOI] [PubMed] [Google Scholar]
- 4.Wallace BC, Dahabreh IJ, Schmid CH, Lau J, Trikalinos TA. Chapter 12 - Modernizing Evidence Synthesis for Evidence-Based Medicine. In: Greenes RA, editor. Clinical Decision Support (Second Edition). Oxford: Academic Press; 2014. pp. 339–361. [Google Scholar]
- 5.Carlisle JB. False individual patient data and zombie randomised controlled trials submitted to Anaesthesia. Anaesthesia. 2021;76: 472–479. [DOI] [PubMed] [Google Scholar]
- 6.Van Noorden R. Medicine is plagued by untrustworthy clinical trials. How many studies are faked or flawed? Nature. 2023;619: 454–458. [Google Scholar]
- 7.Peng Y, Rousseau JF, Shortliffe EH, Weng C. AI-generated text may have a role in evidence-based medicine. Nat Med. 2023;29: 1593–1594. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Tang L, Sun Z, Idnay B, Nestor JG, Soroush A, Elias PA, et al. Evaluating large language models on medical evidence summarization. NPJ Digit Med. 2023;6: 158. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Wallace BC, Saha S, Soboczenski F, Marshall IJ. Generating (Factual?) Narrative Summaries of RCTs: Experiments with Neural Multi-Document Summarization. AMIA Jt Summits Transl Sci Proc. 2021;2021: 605–614. [PMC free article] [PubMed] [Google Scholar]
- 10.Ramprasad S, Marshall IJ, McInerney DJ, Wallace BC. Automatically Summarizing Evidence from Clinical Trials: A Prototype Highlighting Current Challenges. Proc Conf Assoc Comput Linguist Meet. 2023;2023: 236–247. [PMC free article] [PubMed] [Google Scholar]
- 11.Zhang G, Roychowdhury D, Li P, Wu H-Y, Zhang S, Li L, et al. Identifying Experimental Evidence in Biomedical Abstracts Relevant to Drug-Drug Interactions. Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics. New York, NY, USA: Association for Computing Machinery; 2018. pp. 414–418. [Google Scholar]
- 12.Zhang S, Wu H, Wang L, Zhang G, Rocha LM, Shatkay H, et al. Translational drug–interaction corpus. Database. 2022;2022: baac031. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Kang T, Sun Y, Kim JH, Ta C, Perotte A, Schiffer K, et al. EvidenceMap: a three-level knowledge representation for medical evidence computation and comprehension. J Am Med Inform Assoc. 2023;30: 1022–1031. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Turfah A, Liu H, Stewart LA, Kang T, Weng C. Extending PICO with Observation Normalization for Evidence Computing. Stud Health Technol Inform. 2022;290: 268–272. [DOI] [PubMed] [Google Scholar]
- 15.Chen Z, Liu H, Liao S, Bernard M, Kang T, Stewart LA, et al. Representation and Normalization of Complex Interventions for Evidence Computing. Stud Health Technol Inform. 2022;290: 592–596. [DOI] [PubMed] [Google Scholar]
- 16.Kang T, Turfah A, Kim J, Perotte A, Weng C. A neuro-symbolic method for understanding free-text medical evidence. J Am Med Inform Assoc. 2021;28: 1703–1711. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Zhang G, Bhattacharya M, Wu HY, Li P. Identifying articles relevant to drug-drug interaction: Addressing class imbalance. 2017 IEEE Inter Conf on Bioinfo and Biomed (BIBM). 2017. Available: https://ieeexplore.ieee.org/abstract/document/8217818/ [Google Scholar]
- 18.Jiang X, Ringwald M, Blake JA, Arighi C, Zhang G, Shatkay H. An effective biomedical document classification scheme in support of biocuration: addressing class imbalance. Database. 2019;2019. doi: 10.1093/database/baz045 [DOI] [Google Scholar]
- 19.Li P, Jiang X, Zhang G, Trabucco JT, Raciti D, Smith C, et al. Corrigendum to: Utilizing image and caption information for biomedical document classification. Bioinformatics. 2021;37: 3389. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.U.S. DEPARTMENT OF HEALTH & HUMAN SERVICES Trustworthy AI Playbook. 2021. Available: https://www.hhs.gov/sites/default/files/hhs-trustworthy-ai-playbook.pdf [Google Scholar]
- 21.Ji Z, Lee N, Frieske R, Yu T, Su D, Xu Y, et al. Survey of Hallucination in Natural Language Generation. ACM Comput Surv. 2023;55: 1–38. [Google Scholar]
- 22.Savović J, Jones HE, Altman DG, Harris RJ, Jüni P, Pildal J, et al. Influence of reported study design characteristics on intervention effect estimates from randomized, controlled trials. Ann Intern Med. 2012;157: 429–438. [DOI] [PubMed] [Google Scholar]
- 23.DeYoung J, Martinez SC, Marshall IJ, Wallace BC. Do Multi-Document Summarization Models Synthesize? arXiv [cs.CL]. 2023. Available: http://arxiv.org/abs/2301.13844 [Google Scholar]
- 24.Shaib C, Li M, Joseph S, Marshall I, Li JJ, Wallace B. Summarizing, Simplifying, and Synthesizing Medical Evidence using GPT-3 (with Varying Success). Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). Toronto, Canada: Association for Computational Linguistics; 2023. pp. 1387–1407. [Google Scholar]
- 25.Fabbri AR, Kryściński W, McCann B, Xiong C, Socher R, Radev D. SummEval: Re-evaluating summarization evaluation. Trans Assoc Comput Linguist. 2021;9: 391–409. [Google Scholar]
- 26.Wang LL, Otmakhova Y, DeYoung J, Truong TH, Kuehl B, Bransom E, et al. Automated Metrics for Medical Multi-Document Summarization Disagree with Human Evaluations. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Toronto, Canada: Association for Computational Linguistics; 2023. pp. 9871–9889. [Google Scholar]
- 27.Longpre S, Perisetla K, Chen A, Ramesh N, DuBois C, Singh S. Entity-Based Knowledge Conflicts in Question Answering. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Online and Punta Cana, Dominican Republic: Association for Computational Linguistics; 2021. pp. 7052–7063. [Google Scholar]
- 28.Lewis P, Perez E, Piktus A, Petroni F, Karpukhin V, Goyal N, et al. Retrieval-augmented generation for knowledge-intensive NLP tasks. Proceedings of the 34th International Conference on Neural Information Processing Systems. Red Hook, NY, USA: Curran Associates Inc.; 2020. pp. 9459–9474. [Google Scholar]
- 29.Jin Q, Leaman R, Lu Z. Retrieve, Summarize, and Verify: How Will ChatGPT Affect Information Seeking from the Medical Literature? J Am Soc Nephrol. 2023;34: 1302–1304. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Chen M, Tworek J, Jun H, Yuan Q, de Oliveira Pinto HP, Kaplan J, et al. Evaluating Large Language Models Trained on Code. arXiv [cs.LG]. 2021. Available: http://arxiv.org/abs/2107.03374 [Google Scholar]
- 31.Kıcıman E, Ness R, Sharma A, Tan C. Causal Reasoning and Large Language Models: Opening a New Frontier for Causality. arXiv [cs.AI]. 2023. Available: http://arxiv.org/abs/2305.00050 [Google Scholar]
- 32.Ganguli D, Hernandez D, Lovitt L, Askell A, Bai Y, Chen A, et al. Predictability and Surprise in Large Generative Models. Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency. New York, NY, USA: Association for Computing Machinery; 2022. pp. 1747–1764. [Google Scholar]
- 33.Bengio S, Vinyals O, Jaitly N, Shazeer N. Scheduled sampling for sequence prediction with recurrent Neural networks. Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 1. Cambridge, MA, USA: MIT Press; 2015. pp. 1171–1179. [Google Scholar]
- 34.He T, Zhang J, Zhou Z, Glass J. Exposure Bias versus Self-Recovery: Are Distortions Really Incremental for Autoregressive Text Generation? Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Online and Punta Cana, Dominican Republic: Association for Computational Linguistics; 2021. pp. 5087–5102. [Google Scholar]
- 35.Doshi-Velez F, Kim B. Towards A Rigorous Science of Interpretable Machine Learning. arXiv [stat.ML]. 2017. Available: http://arxiv.org/abs/1702.08608 [Google Scholar]
- 36.Du M, Liu N, Hu X. Techniques for interpretable machine learning. Commun ACM. 2019;63: 68–77. [Google Scholar]
- 37.Zhao H, Chen H, Yang F, Liu N, Deng H, Cai H, et al. Explainability for Large Language Models: A Survey. arXiv [cs.CL]. 2023. Available: http://arxiv.org/abs/2309.01029 [Google Scholar]
- 38.Basu C, Vasu R, Yasunaga M, Yang Q. Med-EASi: Finely Annotated Dataset and Models for Controllable Simplification of Medical Texts. arXiv [cs.CL]. 2023. Available: http://arxiv.org/abs/2302.09155 [Google Scholar]
- 39.Levy M, Pauzner M, Rosenblum S, Peleg M. Achieving trust in health-behavior-change artificial intelligence apps (HBC-AIApp) development: A multi-perspective guide. J Biomed Inform. 2023;143: 104414. [DOI] [PubMed] [Google Scholar]
- 40.Saha S, Zhang S, Hase P, Bansal M. Summarization Programs: Interpretable Abstractive Summarization with Neural Modular Trees . arXiv [cs.CL]. 2022. Available: http://arxiv.org/abs/2209.10492 [Google Scholar]
- 41.Vera Liao Q, Vaughan JW. AI Transparency in the Age of LLMs: A Human-Centered Research Roadmap. arXiv [cs.HC]. 2023. Available: http://arxiv.org/abs/2306.01941 [Google Scholar]
- 42.Zack T, Lehman E, Suzgun M, Rodriguez JA, Celi LA, Gichoya J, et al. Coding inequity: Assessing GPT-4’s potential for perpetuating racial and gender biases in healthcare. bioRxiv. 2023. doi: 10.1101/2023.07.13.23292577 [DOI] [Google Scholar]
- 43.Bromme R, Mede NG, Thomm E, Kremer B, Ziegler R. An anchor in troubled times: Trust in science before and within the COVID-19 pandemic. PLoS One. 2022;17: e0262823. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Buolamwini J, Gebru T. Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classification. In: Friedler SA, Wilson C, editors. Proceedings of the 1st Conference on Fairness, Accountability and Transparency. PMLR; 23–24 Feb 2018. pp. 77–91. [Google Scholar]
- 45.Obermeyer Z, Powers B, Vogeli C, Mullainathan S. Dissecting racial bias in an algorithm used to manage the health of populations. Science. 2019;366: 447–453. [DOI] [PubMed] [Google Scholar]
- 46.Chang T, Sjoding MW, Wiens J. Disparate Censorship & Undertesting: A Source of Label Bias in Clinical Machine Learning. Proc Mach Learn Res. 2022;182: 343–390. [PMC free article] [PubMed] [Google Scholar]
- 47.Poulain R, Bin Tarek MF, Beheshti R. Improving Fairness in AI Models on Electronic Health Records: The Case for Federated Learning Methods. Proceedings of the 2023 ACM Conference on Fairness, Accountability, and Transparency. New York, NY, USA: Association for Computing Machinery; 2023. pp. 1599–1608. [Google Scholar]
- 48.Lehman E, Hernandez E, Mahajan D, Wulff J, Smith MJ, Ziegler Z, et al. Do We Still Need Clinical Language Models? arXiv [cs.CL]. 2023. Available: http://arxiv.org/abs/2302.08091 [Google Scholar]
- 49.Singhal K, Azizi S, Tu T, Mahdavi SS, Wei J, Chung HW, et al. Publisher Correction: Large language models encode clinical knowledge. Nature. 2023;620: E19. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50.Tian S, Jin Q, Yeganova L, Lai P-T, Zhu Q, Chen X, et al. Opportunities and challenges for ChatGPT and large language models in biomedicine and health. Brief Bioinform. 2023;25. doi: 10.1093/bib/bbad493 [DOI] [Google Scholar]
- 51.Huang K, Altosaar J, Ranganath R. ClinicalBERT: Modeling Clinical Notes and Predicting Hospital Readmission. arXiv [cs.CL]. 2019. Available: http://arxiv.org/abs/1904.05342 [Google Scholar]
- 52.Gu Y, Tinn R, Cheng H, Lucas M, Usuyama N, Liu X, et al. Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing. ACM Trans Comput Healthcare. 2021;3: 1–23. [Google Scholar]
- 53.Lee J, Yoon W, Kim S, Kim D, Kim S, So CH, et al. BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics. 2020;36: 1234–1240. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 54.Jin Q, Yang Y, Chen Q, Lu Z. GeneGPT: Augmenting Large Language Models with Domain Tools for Improved Access to Biomedical Information. ArXiv. 2023. Available: https://www.ncbi.nlm.nih.gov/pubmed/37131884 [Google Scholar]
- 55.Venigalla A, Frankle J, Carbin M. Biomedlm: a domain-specific large language model for biomedical text. MosaicML Accessed: Dec. 2022. [Google Scholar]
- 56.Brown T, Mann B, Ryder N, Subbiah M, Kaplan JD, Dhariwal P, et al. Language models are few-shot learners. Adv Neural Inf Process Syst. 2020;33: 1877–1901. [Google Scholar]
- 57.OpenAI R. Gpt-4 technical report. arxiv 2303.08774. View in Article. 2023. [Google Scholar]
- 58.Singhal K, Azizi S, Tu T, Mahdavi SS, Wei J, Chung HW, et al. Large language models encode clinical knowledge. Nature. 2023;620: 172–180. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 59.Topsakal O, Akinci TC. Creating large language model applications utilizing langchain: A primer on developing llm apps fast. of the International Conference on Applied …. 2023. Available: https://www.researchgate.net/profile/Oguzhan-Topsakal/publication/372669736_Creating_Large_Language_Model_Applications_Utilizing_LangChain_A_Primer_on_Developing_LLM_Apps_Fast/links/64d114a840a524707ba4a419/Creating-Large-Language-Model-Applications-Utilizing-LangChain-A-Primer-on-Developing-LLM-Apps-Fast.pdf [Google Scholar]
- 60.Ma C, Zhang WE, Guo M, Wang H, Sheng QZ. Multi-document Summarization via Deep Learning Techniques: A Survey. ACM Comput Surv. 2022;55: 1–37. [Google Scholar]
- 61.Biderman S, Prashanth US, Sutawika L, Schoelkopf H, Anthony Q, Purohit S, et al. Emergent and Predictable Memorization in Large Language Models. arXiv [cs.CL]. 2023. Available: http://arxiv.org/abs/2304.11158 [Google Scholar]
-
62.Bender EM, Gebru T, McMillan-Major A, Shmitchell S. On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?
. Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency. New York, NY, USA: Association for Computing Machinery; 2021. pp. 610–623. [Google Scholar] - 63.Duan H, Dziedzic A, Yaghini M, Papernot N, Boenisch F. On the privacy risk of in-context learning. [cited 22 Sep 2023]. Available: https://trustnlpworkshop.github.io/papers/13.pdf [Google Scholar]
- 64.Art. 9 GDPR – Processing of special categories of personal data - General Data Protection Regulation (GDPR). In: General Data Protection Regulation (GDPR) [Internet]. [cited 22 Jan 2024]. Available: https://gdpr-info.eu/art-9-gdpr/ [Google Scholar]
- 65.Stanley-Lockman Z, Christie EH. An artificial intelligence strategy for NATO. NATO Review. 2021;25. Available: https://www.nato.int/docu/review/articles/2021/10/25/an-artificial-intelligence-strategy-fornato/index.html [Google Scholar]
- 66.Oversight of A.I.: Principles for regulation. 25 Jul 2023. [cited 22 Sep 2023]. Available: https://www.judiciary.senate.gov/committee-activity/hearings/oversight-of-ai-principles-for-regulation [Google Scholar]
- 67.Governing AI through acquisition and procurement. In: Committee on Homeland Security & Governmental Affairs [Internet]. U.S. Senate Committee on Homeland Security and Governmental Affairs Committee; 6 Sep 2023. [cited 22 Sep 2023]. Available: https://www.hsgac.senate.gov/hearings/governing-ai-through-acquisition-and-procurement-2/ [Google Scholar]
- 68.EU AI Act: first regulation on artificial intelligence. 6 Aug 2023. [cited 26 Sep 2023]. Available: https://www.europarl.europa.eu/news/en/headlines/society/20230601STO93804/eu-ai-act-first-regulation-on-artificial-intelligence [Google Scholar]

