Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2025 May 1.
Published in final edited form as: J Biomed Inform. 2024 Apr 10;153:104640. doi: 10.1016/j.jbi.2024.104640

Leveraging generative AI for clinical evidence synthesis needs to ensure trustworthiness

Gongbo Zhang a, Qiao Jin b, Denis Jered McInerney c, Yong Chen d, Fei Wang e,f, Curtis L Cole e,g, Qian Yang h, Yanshan Wang i, Bradley A Malin j,k,l, Mor Peleg m, Byron C Wallace c, Zhiyong Lu b, Chunhua Weng a,*, Yifan Peng e,*
PMCID: PMC11217921  NIHMSID: NIHMS2002077  PMID: 38608915

Abstract

Evidence-based medicine promises to improve the quality of healthcare by empowering medical decisions and practices with the best available evidence. The rapid growth of medical evidence, which can be obtained from various sources, poses a challenge in collecting, appraising, and synthesizing the evidential information. Recent advancements in generative AI, exemplified by large language models, hold promise in facilitating the arduous task. However, developing accountable, fair, and inclusive models remains a complicated undertaking. In this perspective, we discuss the trustworthiness of generative AI in the context of automated summarization of medical evidence.

Keywords: Evidence-based medicine, Trustworthy generative AI, Large language models, Medical evidence summarization

1. Introduction

Evidence-based Medicine (EBM) aims to improve healthcare decisions by integrating the best available evidence, clinical proficiency, and patient values. One widely utilized method for generating optimally accessible evidence is the systematic review of Randomized Controlled Trials (RCT). This involves identifying, appraising, and synthesizing evidence from relevant studies addressing a clinical question. In addition, in the current era of big data, there is a growing recognition of the need to broaden our focus beyond RCTs and embrace diverse evidence sources. For example, Real World Data (RWD), such as electronic health records, insurance billing databases, and disease registries [1], generates vital Real World Evidence. It offers unique and current insights into treatment effectiveness in broader patient populations, as well as long-term effects and rare side effects, often overlooked in RCTs [2,3]. Given the sheer volume of RCTs, RWD, and other data sources, there is a critical need to summarize the evidence into a more digestible and comprehensive format to facilitate the accessibility and applicability of the findings.

Clinical evidence summarization can be defined as the act of collecting, extracting, appraising, and synthesizing relevant and reliable evidence to answer a specific clinical question. However, due to the rapid growth of the biomedical literature and clinical data, clinical evidence summarization is struggling to keep up. In the meantime, the proliferation of misinformation or biased/incomprehensive summaries resulting from unreliable or contradicting evidence erodes the public’s trust in biomedical research [4,5]. Given such concerns, we need to develop new and trustworthy methods to reform systematic reviews.

In recent years, dramatic advances in Generative AI, particularly Large Language Models (LLMs), have demonstrated a remarkable potential for assisting in systematic reviews [6]. Recently, LLMs have been explored to summarize meta-analyses [7] and clinical trials [8,9]. Before the era of LLMs, AI methods were also deployed to extract evidential information [1015] and retrieve publications on a given topic [1618]. However, achieving trustworthy AI for clinical evidence summarization remains a complex undertaking. In this perspective, we discuss the trustworthiness of generative AI and the associated challenges and recommendations in the context of fully and semi-automated medical evidence summarization (Fig. 1).

Fig. 1.

Fig. 1.

Challenges and recommendations to achieve trustworthy [19] evidence summarization.

2. Accountability

In the context of summarizing medical evidence, the accountability of an LLM refers to the model’s ability to faithfully summarize high-quality clinical trials and the corresponding meta-analyses. In the context of health and human services, faithful or factual AI refers to systems that generate content that is factually accurate so that it is exchangeable. [20]. When using the term trustworthiness, we emphasize not only the factualness or faithfulness but also the system’s reproducibility. While LLMs can generate semantically meaningful and grammatically correct textual outputs, these models can potentially yield factually incorrect outcomes [7]. These discrepancies can be classified as intrinsic or extrinsic hallucinations [21]. Intrinsic hallucinations refer to cases where “the generated output contradicts the input source.” By contrast, extrinsic hallucinations occur when the generated output can “neither be supported nor be contradicted by the source.”.

The problem of factually incorrect outputs is partially due to differences in the sources of evidence and the resulting summary [21]. Particularly in medical evidence summarization, the findings of the included clinical trials may not inherently align with the outcomes of the meta-analysis. Systematic reviews and meta-analyses are designed to offer a statistical synthesis of the results of eligible clinical studies, rather than merely replicating them verbatim [22]. It is also plausible that crucial information from an included study might be omitted from the synthesis if perceived as offering low-quality evidence. Given the problems that end-to-end summarization models have with aggregating information from contradictory sources [23], automatically generated summaries based solely on included clinical trials—without considering the meta-analysis—may not be reliable. Since the process of systematic review consists of multiple steps, directly utilizing LLMs as an end-to-end pipeline may increase the risk of incorrect outcomes due to errors in upstream tasks, such as searching, screening, and appraisal. The LLMs as an end-to-end system are also more challenging to understand than a system dedicated to a single subtask of the process.

Another factor contributing to the disparity between input evidence and summary is the accessibility of the included clinical studies, which is constrained by the copyrights of specific journals where the studies are published. For instance, if LLMs are instructed to summarize ten studies, yet only eight are publicly accessible and included as input, the models are compelled to generate speculative content to compensate for the missing sources of evidence. Therefore, comprehensive training data is critical for developing accountable models to summarize medical evidence. All the information in the targeted summary should be referenced or derived from the included clinical trials, which are considered high-quality evidence. Additionally, reliable implementation of automatic meta-analysis workflow is critical to assure the correctness of statistically synthesized effect measures and their corresponding accuracy.

Further, it is important to determine how LLMs should be evaluated in evidence summarization. While multiple automatic evaluation metrics have been developed for assessing AI-generated summaries, they do not correlate strongly with human expert evaluations concerning factual consistency, relevance, coherence, and fluency [79,24]. Thus, there is a need for a complete automatic evaluation protocol for evidence summarization, as well as advanced research in developing evaluation metrics that better correlate with human judgments [25,26]. These automatic evaluation metrics should serve as a tool for complementing manual evaluations from domain experts and not replace them, especially considering the high-stakes nature of the task.

Beyond the above issues, parametric knowledge bias is a conspicuous problem [27]. This occurs when models depend on their intrinsic parametric knowledge, which is built up during training, rather than the information provided in the input source when generating summaries. To alleviate this issue, retrieval-augmented generation [28], which has been proposed to retrieve references and incorporate portions of these references into the completion, can be employed to enhance the accuracy and coherence of prompt completion. It enables healthcare providers to verify the evidence synthesis against referenced studies and assess the quality of the references [29].

3. Causality

Estimating treatment effects is important for informing clinical decisions. For decades, RCTs have been considered a gold standard, where patients are usually randomly assigned to the treatment or control group so that observable confounding factors are evenly distributed in the two groups. However, there are several limitations of RCTs that have yet to be recognized. For example, it is unethical to explore all possible outcomes if there is a risk of harm to a patient involved in the treatments. Also, statistically meaningful RCTs are challenging to administer for rare diseases. Thus, as a supplement to the evidence, observational data has been explored for estimating treatment effects [2]. Yet, despite the promising abundance of records in observational studies, the treated and untreated groups may not be directly comparable because of confounding factors. Even more problematic is that certain confounding factors, such as a patient’s familial situation or economic status, may not be observable. As such, causal inference remains a challenge in leveraging observational data.

LLMs have demonstrated numerous breakthroughs in making inferences based on patterns (e.g., writing poems or source code [30]) and hold promise in assisting causal inference tasks by identifying confounding factors or generating descriptions for causal relationships. However, it is notable that LLMs may still “exhibit unpredictable failure modes [in causal reference]” [31]. Furthermore, there is ongoing discussion regarding whether LLMs truly perform causal reasoning or mimic memorized responses. As such, research needs to be conducted into how to best characterize LLMs’ capacity for causality and understand their underlying mechanisms.

4. Transparency

In healthcare, it is critical that the systems are transparent due to their proximity to human lives and that patients understand how clinicians use these recommendations. In the context of LLMs, the challenges are unique, compared with traditional AI approaches, because of the complexity of the models, capability unpredictability [32], and proprietary technology.

Modern LLMs are typically based on neural models, which consist of multiple layers of interconnected neurons, and the relationship between input and output can be highly complex. It is thus challenging to understand how a specific summary is generated based on the input information, which poses a challenge in building transparent systems in evidence summarization. This complexity is exemplified by the exposure bias problem, which refers to the difference in decoding behavior between the training and inference phases. It is common practice to train models to generate the next token conditioned on ground-truth prefix sequences. However, during inference, the model generates the next token based on the historical sequences it has generated. The model can be potentially biased to perform well only when conditioned on truth prefixes. This is also referred to as teacher forcing. This discrepancy can trigger progressively more erroneous generation, particularly as the target sequence lengthens. While the problem was identified almost a decade ago [33], it is still unclear to what extent exposure bias affects the quality of model output [34].

In recent years, researchers in AI and human–computer interaction (HCI) have focused on developing and evaluating different approaches to achieve transparency in clinical AI systems, such as model training techniques, standard protocols for documenting the data and model training process [3537], techniques that explain the confidence level of AI predictions in ways consistent with how clinicians weigh uncertainty in clinical decision making and explain clinicians’ decisions to patients [38]. Additional work focused on establishing community guidelines that are informed by human-centered design principles for such applications that potentially influence health behaviors [39]. These prior studies established a solid foundation for addressing the challenges generative AI brings and, therefore, can be valuable for guiding the development of AI-generated summaries.

Regarding EBM, we need to create teams of diverse stakeholders, including, at a minimum, patients, healthcare practitioners, and policymakers. Representing and supporting their needs is critical to ensuring that the generative AI research community develops meaningful and respectful technologies. For instance, clinical studies summarized for patients should prioritize readability and comprehensibility. In contrast, summaries intended for healthcare practitioners should provide sufficient detail to support trustworthy decision-making. Additionally, the versions crafted for policymakers should highlight potential risks to the summarization process and discuss their broader implications.

Finally, developing models with baked-in structures is crucial to achieving transparency. For example, Saha et al. [40] use a systematically organized list of binary trees that symbolically represent the sequential generative process of a summary from the source document and highlight significant pieces of evidence that influence the summarization, and Ramprasad et al. [9] separate conditions, interventions, and outcomes in input RCTs to yield an aspect-wise interpretable summary. This also means that transparency is not an afterthought but a significant element that’s diligently crafted and assessed with the expertise of domain professionals.

5. Fairness

LLMs offer significant benefits for addressing biases in synthesizing clinical trials and enhancing research inclusivity. For example, LLMs can process and analyze vast amounts of data from diverse populations. This ensures that the findings are more representative and applicable to a wider range of patient groups. Moreover, LLMs can process information in multiple languages, facilitating accessibility to clinical trial data and evidence synthesis for researchers and practitioners worldwide. However, it is important to note that while some LLMs can generate documentation on par with clinicians, there is evidence that LLMs can also propagate biased results [41]. For example, there may be a need for LLMs to better capture the demographic diversity of medical conditions to prevent generating clinical vignettes that perpetuate stereotypes about demographic presentations [42]. The presence of biases in the LLM’s training data can also perpetuate or amplify these biases in its analysis and summaries. In the healthcare domain, this is particularly concerning, as biased data can lead to unfair and harmful outcomes. Moreover, LLMs may not fully understand the cultural, social, and ethical contexts of clinical trials, potentially leading to oversimplified or inappropriate conclusions that might not be fair or applicable to all patient groups. Therefore, the development, deployment, and use of generative AI should strive for fairness.

While quality assurance for clinical studies is orthogonal to model training and inference, the underlying bias can be amplified in the synthesized evidence summaries based on the clinical studies. Subsequently, the spread of mainstream misinformation and the fast publication of invalid evidence has dramatically undermined the public’s trust in biomedical research [43]. Even though the summarization task might be less prone to biases in evidence-based medicine, it is still crucial to remain cautious. One possible way to mitigate these issues is to construct prompts (i.e., instructions to LLMs) explicitly requesting to avoid bias. However, such a bias-free prompt is unlikely to be common practice because biases can significantly influence the overall quality of the model development and may be present more broadly and deeply within the LLMs.

In the context of real-world evidence, it has been found that machine learning (ML) models may perform poorly on under-represented groups [44,45]. Such situations where a patient group is under-tested compared to others are called disparate censorship. For training supervised learning models, patient outcomes are labeled based on diagnostic test results. While medical records list diseases that patients have been diagnosed with, the absence of a diagnosis does not mean the patients do not have such a disease. Assuming undiagnosed patients are healthy can lead to the development of biased ML models that produce incorrect summaries, particularly for patient groups that have limited access to healthcare [44,45]. Clinical decisions based on biased generations, e.g., evidence summaries, can further harm the already under-tested population [46]. The research community needs to focus on assessing the LLMs to ensure they do not exhibit biases, discrimination, and stigmatization towards individuals or groups [42,47].

6. Model generalizability

Another crucial component of achieving trustworthiness for generative AI is generalizability. The models must behave reliably and reproducibly while minimizing unintentional and unexpected consequences, even when the operating environment differs from the training environment.

While most popular LLMs are trained using diverse text resources and have demonstrated considerable proficiency, they need a deeper understanding of specialized fields, particularly in the medical domain. For instance, domain-specific LLMs may technically comprehend medical knowledge, such as the UMLS terminology, more than their generalist counterparts [48,49]. To this end, we should prioritize constructing domain-specific LLMs and benchmarks [50], as generic benchmarks are no longer of primary importance in evidence-based medicine. So far, multiple LLMs have been developed for medical applications and open-sourced, such as BioBERT, ClinicalBERT, PubMedBERT, BioMedLM, and GeneGPT [5155]. In addition, many companies adapted their close-sourced models for medical applications, such as GPT-3.5, GPT-4 by OpenAI, and MedPaLM by Google [5658]. These LLMs were pre-trained on datasets that do not include the most recent knowledge and information in the medical domain, which is a fast-evolving field. As such, developing domain-specific models is an on-going process and a high priority.

Another challenge for generalizability in evidence summarization is the ability to process long inputs. Even though there is a trend toward increasing the maximum length of input tokens, the extended context window may still need to be adequately long to encompass all clinical trials or notes involved in an evidence summary. To combat this issue, several strategies have been proposed, such as chunking the document, hierarchical summarization, and iterative refinement by looping over multiple sections or documents [59,60]. A potential solution for summarizing multiple clinical trials that cannot fit in a single prompt is maintaining historical interaction records. These records can be used to process clinical trials in batches. Additionally, exploring hierarchical summarization mechanisms could enable synthesizing evidence from concise summaries of the clinical trials, rather than directly from the trials themselves.

7. Data privacy and governance

There are numerous concerns over how an LLM may share and consume patient information. However, it is still being determined if patients would consent to use information about them in an LLM – particularly if they are not informed about, or unable to comprehend, what the LLM would be used for. In this respect, if patient information is to be relied upon, there need to be clear descriptions of anticipated uses – even if such a communication indicates that it is unknown what such services may be.

Even if patients consent to transfer information about them to an LLM, they may not wish to have potentially identifying or stigmatizing information about them disclosed to or retained by the LLM. While LLMs are designed to create probabilistic representations of the relationships between features, if not designed appropriately, they may memorize training instances supplied to them [61]. As such, when the LLM is queried, it may be possible for the end user to determine the training data, which could have legal implications [62]. Even if individual-level records cannot be recovered from an LLM, it may also be possible that a patient can be detected as a contributor to the training data [63]. Such membership inferences could be problematic if, for instance, the LLM has been fine-tuned on patients diagnosed with a sensitive clinical phenomenon, such as a sexually transmitted disease. There have been various investigations into incorporating formal privacy principles, such as differential privacy [62], into the construction and use of LLMs; however, it is currently unclear how such noise-based mechanisms influence clinical trial summarization capabilities. In this respect, it is critical to continue research into how best to train and apply LLMs in a manner that respects the privacy of the patients upon whom the technology is based.

8. Patient safety

Safety also remains a pressing concern when utilizing clinical evidence summarized by LLMs, as any inaccuracies could have far-reaching implications for high-stakes healthcare decisions. To mitigate these risks, generative AI should come equipped with protective measures that facilitate a backup plan in case of any complications. As we continue to advance in AI, the goal is for LLMs to transform from simple tools to collaborative partners operating in sync with human experts. While using LLMs to quickly analyze and synthesize large amounts of evidence could expedite the evidence summarization process, a secure and reliable architecture is essential to ensure that expert meta-analyzers can trust these AI-generated summaries.

One possible solution revolves around a synergy between humans and generative AI, with a focus on their mutual safety. This collaborative framework would involve human experts frequently reviewing and offering feedback on summaries produced by LLMs, thereby resulting in iterative improvements. In this way, humans and generative AI can develop trust in each other and continue to evolve concurrently, enhancing their teamwork. We believe this integration will enhance capabilities beyond what humans alone can achieve, introduce new capacities, and, crucially, embody “high confidence.” Accordingly, they should exhibit reliable or predictable performance, demonstrate tolerance towards environmental and configuration faults, and maintain resilience against adversarial attacks.

9. Lawfulness and regulation

Finally, generative AI should unequivocally adhere to all pertinent laws and regulations, including international humanitarian and human rights laws, a cardinal principle that carries particular significance when applying generative AI for evidence summarization. This entails recognizing the nuanced legal landscape governing AI deployment, which may diverge from one state or country to another. Thus, it becomes imperative to construct a comprehensive legal framework addressing the accountability associated with actions taken and recommendations made by AI systems.

Under the General Data Protection Regulation (Article 9), collecting and processing sensitive personal data is subject to strict regulations [64]. The North Atlantic Treaty Organization (NATO) has established guidelines promoting the responsible utilization of AI, including a principle emphasizing that AI applications will be created and employed in compliance with both domestic and international legal frameworks [65]. Recently, the United States Congress has conducted a series of testimonies and hearings involving AI companies and AI researchers to address matters pertaining to AI regulation and governance [66,67]. The European Union has also initiated the process of creating regulations regarding developing and using generative AI [68]. In alignment with these movements, LLMs for evidence summarization and beyond must also be developed with these legal challenges in mind to safeguard patients, clinicians, and AI developers from any unintended repercussions.

10. Conclusion

Generative AI for medical evidence summarization has already made a positive impact and is set to continue doing so in a way we cannot imagine. However, it is equally vital that risks and other adverse effects are properly mitigated. To this end, building generative AI systems that are genuinely trustworthy is crucial. This document outlines several directions that efforts could take to achieve trustworthy generative AI, ideally working in unity and overlapping in their functionality.

While this document primarily targets evidence summarization, some principles could prove beneficial in other domains. The present moment necessitates the creation of a culture of Trustworthy AI within the JBI community, enabling the benefits of AI to be fully realized within our healthcare system.

Acknowledgments

This research was sponsored by the National Library of Medicine grant R01LM009886, R01LM014344, R01LM014306, 2R01LM012086, R01LM013772, the National Center for Advancing Clinical and Translational Science award UL1TR001873, and the National Science Foundation (NSF) grant 1750978. It was also supported by the NIH Intramural Research Program, National Library of Medicine.

Footnotes

Declaration of competing interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

CRediT authorship contribution statement

Gongbo Zhang: Writing – review & editing, Writing – original draft. Qiao Jin: Writing – review & editing, Writing – original draft. Denis Jered McInerney: Writing – review & editing. Yong Chen: Writing – review & editing. Fei Wang: Writing – review & editing. Curtis L. Cole:. Qian Yang: Writing – review & editing. Yanshan Wang: Writing – review & editing. Bradley A Malin: Writing – review & editing. Mor Peleg: Writing – review & editing, Conceptualization. Byron C. Wallace: Writing – review & editing. Zhiyong Lu: Writing – review & editing, Writing – original draft, Funding acquisition. Chunhua Weng: Writing – review & editing, Writing – original draft, Supervision, Project administration, Investigation, Funding acquisition, Conceptualization. Yifan Peng: Writing – review & editing, Writing – original draft, Visualization, Supervision, Project administration, Investigation, Funding acquisition, Conceptualization.

References

  • [1].Sherman RE, Anderson SA, Dal Pan GJ, Gray GW, Gross T, Hunter NL, et al. , Real-world evidence - what is it and what can it tell us? N Engl. J. Med. 375 (2016) 2293–2297. [DOI] [PubMed] [Google Scholar]
  • [2].Schuemie MJ, Hripcsak G, Ryan PB, Madigan D, Suchard MA, Empirical confidence interval calibration for population-level effect estimation studies in observational healthcare data, Proc. Natl. Acad. Sci. USA 115 (2018) 2571–2577. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [3].Gershman B, Guo DP, Dahabreh IJ, Using observational data for personalized medicine when clinical trial evidence is limited, Fertil Steril. 109 (2018) 946–951. [DOI] [PubMed] [Google Scholar]
  • [4].Carlisle JB, False individual patient data and zombie randomised controlled trials submitted to Anaesthesia, Anaesthesia. 76 (2021) 472–479. [DOI] [PubMed] [Google Scholar]
  • [5].Van Noorden R, Medicine is plagued by untrustworthy clinical trials. How many studies are faked or flawed? Nature 619 (2023) 454–458. [DOI] [PubMed] [Google Scholar]
  • [6].Peng Y, Rousseau JF, Shortliffe EH, Weng C, AI-generated text may have a role in evidence-based medicine, Nat. Med. 29 (2023) 1593–1594. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [7].Tang L, Sun Z, Idnay B, Nestor JG, Soroush A, Elias PA, et al. , Evaluating large language models on medical evidence summarization, NPJ Digit Med. 6 (2023) 158. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [8].Wallace BC, Saha S, Soboczenski F, Marshall IJ, Generating (factual?) Narrative summaries of RCTs: Experiments with neural multi-document summarization, AMIA Jt Summits Transl Sci Proc. 2021 (2021) 605–614. [PMC free article] [PubMed] [Google Scholar]
  • [9].Ramprasad S, Marshall IJ, McInerney DJ, Wallace BC, Automatically summarizing evidence from clinical trials: A prototype highlighting current challenges, Proc Conf Assoc Comput Linguist Meet. 2023 (2023) 236–247. [PMC free article] [PubMed] [Google Scholar]
  • [10].Zhang G, Roychowdhury D, Li P, Wu H-Y, Zhang S, Li L, et al. Identifying experimental evidence in biomedical abstracts relevant to drug-drug interactions. In: Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics. New York, NY, USA: Association for Computing Machinery; 2018. pp. 414–418. [Google Scholar]
  • [11].Zhang S, Wu H, Wang L, Zhang G, Rocha LM, Shatkay H, et al. Translational drug–interaction corpus. Database. 2022;2022: baac031. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [12].Kang T, Sun Y, Kim JH, Ta C, Perotte A, Schiffer K, et al. , EvidenceMap: a three-level knowledge representation for medical evidence computation and comprehension, J. Am. Med. Inform. Assoc. 30 (2023) 1022–1031. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [13].Turfah A, Liu H, Stewart LA, Kang T, Weng C, Extending PICO with observation normalization for evidence computing, Stud Health Technol Inform. 290 (2022) 268–272. [DOI] [PubMed] [Google Scholar]
  • [14].Chen Z, Liu H, Liao S, Bernard M, Kang T, Stewart LA, et al. , Representation and normalization of complex interventions for evidence computing, Stud Health Technol Inform. 290 (2022) 592–596. [DOI] [PubMed] [Google Scholar]
  • [15].Kang T, Turfah A, Kim J, Perotte A, Weng C, A neuro-symbolic method for understanding free-text medical evidence, J. Am. Med. Inform. Assoc. 28 (2021) 1703–1711. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [16].Zhang G, Bhattacharya M, Wu HY, Li P, Identifying articles relevant to drug-drug interaction: addressing class imbalance, in: 2017 IEEE Inter Conf on Bioinfo and Biomed (BIBM), 2017. Available: https://ieeexplore.ieee.org/abstract/document/8217818/. [Google Scholar]
  • [17].Jiang X, Ringwald M, Blake JA, Arighi C, Zhang G, Shatkay H, An effective biomedical document classification scheme in support of biocuration: addressing class imbalance, Database (2019), 10.1093/database/baz045. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [18].Li P, Jiang X, Zhang G, Trabucco JT, Raciti D, Smith C, et al. , Corrigendum to: utilizing image and caption information for biomedical document classification, Bioinformatics 37 (2021) 3389. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [19].Mayer RC, Davis JH, Schoorman FD, An integrative model of organizational trust, Acad. Manage. Rev. 20 (1995) 709–734. [Google Scholar]
  • [20].U.S. DEPARTMENT OF HEALTH & HUMAN SERVICES Trustworthy AI Playbook. 2021. Available: https://www.hhs.gov/sites/default/files/hhs-trustworthy-ai-playbook.pdf.
  • [21].Ji Z, Lee N, Frieske R, Yu T, Su D, Xu Y, et al. , Survey of hallucination in natural language generation, ACM Comput. Surv. 55 (2023) 1–38. [Google Scholar]
  • [22].Savović J, Jones HE, Altman DG, Harris RJ, Jüni P, Pildal J, et al. , Influence of reported study design characteristics on intervention effect estimates from randomized, controlled trials, Ann. Intern. Med. 157 (2012) 429–438. [DOI] [PubMed] [Google Scholar]
  • [23].DeYoung J, Martinez SC, Marshall IJ, Wallace BC, Do multi-document summarization models synthesize? arXiv [cs.CL]. Available: http://arxiv.org/abs/2301.13844. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [24].Shaib C, Li M, Joseph S, Marshall I, Li JJ, Wallace B, Summarizing, simplifying, and synthesizing medical evidence using GPT-3 (with varying success). In: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). Toronto, Canada: Association for Computational Linguistics; 2023. pp. 1387–1407. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [25].Fabbri AR, Kryściński W, McCann B, Xiong C, Socher R, Radev D, SummEval: Re-evaluating summarization evaluation, Trans Assoc Comput Linguist. 9 (2021) 391–409. [Google Scholar]
  • [26].Wang LL, Otmakhova Y, DeYoung J, Truong TH, Kuehl B, Bransom E, et al. , Automated metrics for medical multi-document summarization disagree with human evaluations. In: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Toronto, Canada: Association for Computational Linguistics; 2023. pp. 9871–9889. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [27].Longpre S, Perisetla K, Chen A, Ramesh N, DuBois C, Singh S, Entity-based knowledge conflicts in question answering, in: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Online and Punta Cana, Dominican Republic: Association for Computational Linguistics; 2021. pp. 7052–7063. [Google Scholar]
  • [28].Lewis P, Perez E, Piktus A, Petroni F, Karpukhin V, Goyal N, et al. , Retrieval-augmented generation for knowledge-intensive NLP tasks, in: Proceedings of the 34th International Conference on Neural Information Processing Systems. Red Hook, NY, USA: Curran Associates Inc.; 2020. pp. 9459–9474. [Google Scholar]
  • [29].Jin Q, Leaman R, Lu Z, Retrieve, summarize, and verify: how will ChatGPT Affect information seeking from the medical literature? J. Am. Soc. Nephrol. 34 (2023) 1302–1304. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [30].Chen M, Tworek J, Jun H, Yuan Q, de Oliveira Pinto HP, Kaplan J, et al. , Evaluating large language models trained on code. arXiv [cs.LG]. 2021. Available: http://arxiv.org/abs/2107.03374. [Google Scholar]
  • [31].Kıcıman E, Ness R, Sharma A, Tan C, Causal reasoning and large language models: opening a new frontier for causality. arXiv [cs.AI]. 2023. Available: http://arxiv.org/abs/2305.00050. [Google Scholar]
  • [32].Ganguli D, Hernandez D, Lovitt L, Askell A, Bai Y, Chen A, et al. , Predictability and surprise in large generative models, in: Proceedings of the 2022 ACM conference on fairness, accountability, and transparency, New York, NY, USA: Association for Computing Machinery, 2022, pp. 1747–1764. [Google Scholar]
  • [33].Bengio S, Vinyals O, Jaitly N, Shazeer N, Scheduled sampling for sequence prediction with recurrent Neural networks, in: Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 1. MIT Press, Cambridge, MA, USA, 2015. pp. 1171–1179. [Google Scholar]
  • [34].He T, Zhang J, Zhou Z, Glass J, Exposure bias versus self-recovery: are distortions really incremental for autoregressive text generation? In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Online and Punta Cana, Dominican Republic: Association for Computational Linguistics, 2021, pp. 5087–5102. [Google Scholar]
  • [35].Doshi-Velez F, Kim B, Towards A rigorous science of interpretable machine learning. arXiv [stat.ML]. 2017. Available: http://arxiv.org/abs/1702.08608. [Google Scholar]
  • [36].Du M, Liu N, Hu X, Techniques for interpretable machine learning, Commun ACM. 63 (2019) 68–77. [Google Scholar]
  • [37].Zhao H, Chen H, Yang F, Liu N, Deng H, Cai H, et al. Explainability for large language models: a survey. arXiv [cs.CL]. 2023. Available: http://arxiv.org/abs/2309.01029. [Google Scholar]
  • [38].Basu C, Vasu R, Yasunaga M, Yang Q, Med-EASi: finely annotated dataset and models for controllable simplification of medical texts, arXiv [cs.CL]. 2023. Available: http://arxiv.org/abs/2302.09155. [Google Scholar]
  • [39].Levy M, Pauzner M, Rosenblum S, Peleg M, Achieving trust in health-behavior-change artificial intelligence apps (HBC-AIApp) development: a multi-perspective guide, J. Biomed. Inform. 143 (2023) 104414. [DOI] [PubMed] [Google Scholar]
  • [40].Saha S, Zhang S, Hase P, Bansal M, Summarization programs: interpretable abstractive summarization with neural modular trees. arXiv [cs.CL]. 2022. Available: http://arxiv.org/abs/2209.10492. [Google Scholar]
  • [41].Vera Liao Q, Vaughan JW, AI transparency in the age of LLMs: a human-centered research roadmap. arXiv [cs.HC]. 2023. Available: http://arxiv.org/abs/2306.01941. [Google Scholar]
  • [42].Zack T, Lehman E, Suzgun M, Rodriguez JA, Celi LA, Gichoya J, et al. , Coding inequity: assessing GPT-4’s potential for perpetuating racial and gender biases in healthcare, bioRxiv (2023), 10.1101/2023.07.13.23292577. [DOI] [Google Scholar]
  • [43].Bromme R, Mede NG, Thomm E, Kremer B, Ziegler R, An anchor in troubled times: Trust in science before and within the COVID-19 pandemic, PLoS One. 17 (2022) e0262823. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [44].Buolamwini J, Gebru T, Gender shades: intersectional accuracy disparities in commercial gender classification. In: Friedler SA, Wilson C, editors. Proceedings of the 1st Conference on Fairness, Accountability and Transparency. PMLR; 23–24 Feb 2018. pp. 77–91. [Google Scholar]
  • [45].Obermeyer Z, Powers B, Vogeli C, Mullainathan S, Dissecting racial bias in an algorithm used to manage the health of populations, Science 366 (2019) 447–453. [DOI] [PubMed] [Google Scholar]
  • [46].Chang T, Sjoding MW, Wiens J, Disparate censorship & undertesting: a source of label bias in clinical machine learning, Proc Mach Learn Res. 182 (2022) 343–390. [PMC free article] [PubMed] [Google Scholar]
  • [47].Poulain R, Bin Tarek MF, Beheshti R, Improving fairness in AI models on electronic health records: The case for federated learning methods, in: Proceedings of the 2023 ACM Conference on Fairness, Accountability, and Transparency. New York, NY, USA: Association for Computing Machinery; 2023. pp. 1599–1608. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [48].Lehman E, Hernandez E, Mahajan D, Wulff J, Smith MJ, Ziegler Z, et al. , Do we still need clinical language models? arXiv [cs.CL]. 2023. Available: http://arxiv.org/abs/2302.08091. [Google Scholar]
  • [49].Singhal K, Azizi S, Tu T, Mahdavi SS, Wei J, Chung HW, et al. , Publisher correction: large language models encode clinical knowledge, Nature 620 (2023) E19. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [50].Tian S, Jin Q, Yeganova L, Lai P-T, Zhu Q, Chen X, et al. , Opportunities and challenges for ChatGPT and large language models in biomedicine and health, Brief Bioinform. 25 (2023), 10.1093/bib/bbad493. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [51].Huang K, Altosaar J, ClinicalBERT RR, Modeling clinical notes and predicting hospital readmission, Available: arXiv [cs.CL]. (2019) http://arxiv.org/abs/1904.05342. [Google Scholar]
  • [52].Gu Y, Tinn R, Cheng H, Lucas M, Usuyama N, Liu X, et al. , Domain-Specific language model pretraining for biomedical natural language processing, ACM Trans Comput Healthcare. 3 (2021) 1–23. [Google Scholar]
  • [53].Lee J, Yoon W, Kim S, Kim D, Kim S, So CH, et al. , BioBERT: a pre-trained biomedical language representation model for biomedical text mining, Bioinformatics 36 (2020) 1234–1240. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [54].Jin Q, Yang Y, Chen Q, GeneGPT LZ, Augmenting Large Language Models with Domain Tools for Improved Access to Biomedical Information, ArXiv Available (2023). https://www.ncbi.nlm.nih.gov/pubmed/37131884. [DOI] [PMC free article] [PubMed]
  • [55].Venigalla A, Frankle J, Carbin M, Biomedlm: a domain-specific large language model for biomedical text, MosaicML Accessed: Dec. (2022). [Google Scholar]
  • [56].Brown T, Mann B, Ryder N, Subbiah M, Kaplan JD, Dhariwal P, et al. , Language models are few-shot learners, Adv Neural Inf Process Syst. 33 (2020) 1877–1901. [Google Scholar]
  • [57].OpenAI R Gpt-4 technical report. arxiv 2303.08774. View in Article. 2023. [Google Scholar]
  • [58].Singhal K, Azizi S, Tu T, Mahdavi SS, Wei J, Chung HW, et al. , Large language models encode clinical knowledge, Nature 620 (2023) 172–180. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [59].Topsakal O, Akinci TC, Creating large language model applications utilizing langchain: A primer on developing llm apps fast. of the International Conference on Applied …. 2023. Available: https://www.researchgate.net/profile/Oguzhan-Topsakal/publication/372669736_Creating_Large_Language_Model_Applications_Utilizing_LangChain_A_Primer_on_Developing_LLM_Apps_Fast/links/64d114a840a524707ba4a419/Creating-Large-Language-Model-Applications-Utilizing-LangChain-A-Primer-on-Developing-LLM-Apps-Fast.pdf.
  • [60].Ma C, Zhang WE, Guo M, Wang H, Sheng QZ, Multi-document summarization via deep learning techniques: a survey, ACM Comput. Surv. 55 (2022) 1–37. [Google Scholar]
  • [61].Biderman S, Prashanth US, Sutawika L, Schoelkopf H, Anthony Q, Purohit S, et al. , Emergent and predictable memorization in large language models. arXiv [cs. CL]. 2023. Available: http://arxiv.org/abs/2304.11158. [Google Scholar]
  • [62].Bender EM, Gebru T, McMillan-Major A, Shmitchell S, On the dangers of stochastic parrots: can language models be too big? Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency. New York, NY, USA: Association for Computing Machinery; 2021. pp. 610–623. [Google Scholar]
  • [63].Duan H, Dziedzic A, Yaghini M, Papernot N, Boenisch F, On the privacy risk of in-context learning. [cited 22 Sep 2023]. Available: https://trustnlpworkshop.github.io/papers/13.pdf. [Google Scholar]
  • [64].Art. 9 GDPR – Processing of special categories of personal data - General Data Protection Regulation (GDPR). In: General Data Protection Regulation (GDPR) [Internet]. [cited 22 Jan 2024]. Available: https://gdpr-info.eu/art-9-gdpr/. [Google Scholar]
  • [65].Stanley-Lockman Z, Christie EH, An artificial intelligence strategy for NATO. NATO Review. 2021;25. Available: https://www.nato.int/docu/review/articles/2021/10/25/an-artificial-intelligence-strategy-for-nato/index.html.
  • [66].Oversight of AI: Principles for regulation. 25 Jul 2023. [cited 22 Sep 2023], Available: https://www.judiciary.senate.gov/committee-activity/hearings/oversight-of-ai-principles-for-regulation.
  • [67].Governing AI through acquisition and procurement. In: Committee on Homeland Security & Governmental Affairs [Internet]. U.S. Senate Committee on Homeland Security and Governmental Affairs Committee; 6 Sep 2023. [cited 22 Sep 2023]. Available: https://www.hsgac.senate.gov/hearings/governing-ai-through-acquisition-and-procurement-2/. [Google Scholar]
  • [68].EU AI Act: first regulation on artificial intelligence. 6 Aug 2023. [cited 26 Sep 2023]. Available: https://www.europarl.europa.eu/news/en/headlines/society/20230601STO93804/eu-ai-act-first-regulation-on-artificial-intelligence.

RESOURCES