For decades, surgeons have relied on a well-established hierarchy of evidence to guide clinical decision-making. From randomised controlled trials (RCTs) and systematic reviews with meta-analyses at the top, down to expert opinions at the bottom. The evidence pyramid - such as the one provided by the Oxford Centre for Evidence-Based Medicine (OCEBM) - has served as a reliable framework for evaluating the strength of evidence [1].
With the rapid advancement of artificial intelligence (AI), particularly in the form of large language models and machine learning systems, this traditional structure is being challenged. AI is now capable of synthesizing results and above all drafting well-written summaries and recommendations - tasks once reserved for human researchers.
It’s important to recognize that automation in evidence generation is not new. The Cochrane Handbook explicitly permits the use of automated tools in systematic reviews - provided the process is transparent, reproducible, and properly cited [2]. Consequently, there are many AI-powered automation tools for the evidence synthesis workflow [3]. Thus, the rise of large language models and Chatbots [4] at first glance may seem to promise a revolution in how we create and consume evidence. But it also raises a fundamental question: Who do we trust?
However, trust in evidence depends not just on what is produced, but also on how it is produced. Commonly, levels of evidence rank study designs from Level 1 (high-quality systematic reviews with meta-analyses and RCTs) to Level 5 (expert opinion without explicit critical appraisal) [1]. So where does AI-generated output belong?
Some have suggested evolving the traditional two-dimensional pyramid into a more nuanced structure. For example, a three-dimensional evidence pyramid, where the third axis represents the effort required to generate the evidence, and a fourth dimension that reflects the volume and clinical impact [5]. Such a model acknowledges the real-world complexity of modern evidence production.
Still, we must be clear: we must not change our scientific rigour because of AI. Artificial intelligence should help us to streamline this process– by generating high-quality evidence faster and better - but it must also follow the fundamental rules of science, especially reproducibility and transparency.
A Chatbot that answers a clinical question instantly, without transparency about data sources or methodology, may seem helpful - but it belongs at the base of the pyramid, on par with expert opinion (Level 5), or even below as a new “Level 6”. Without transparency and traceability, AI-generated outputs cannot be considered valid evidence - no matter how fast, confident, or convincing they may appear. Furthermore, without proper input, hallucination is still a major problem [4].
On the other hand, when AI is used transparently and reproducibly within a robust scientific framework - supported by proper documentation, critical appraisal, and human oversight - it becomes a powerful accelerator of high-quality research. In such cases, AI can help produce Level 1a evidence, such as systematic reviews with meta-analyses of RCTs, more efficiently. This AI-enhanced evidence deserves its place at the top of the pyramid - not because it bypasses the proven scientific methods, but because it upholds them.
In surgery, where clinical decisions carry immediate and often irreversible consequences, we need evidence that is both timely and trustworthy. AI cannot replace the clinical trial, nor the surgeon’s judgment. But if used wisely, it can help us reach sound conclusions faster, and with greater precision.
Ultimately, this is not about human versus machine - it’s about how we integrate novel tools into our daily practice as clinicians and/or researchers. And so, in the era of AI, the essential question remains: Who do you trust? The trial or the algorithm? We argue, not the trial or the algorithm alone - but the human expert who can evaluate and integrate both!
Declarations
Conflict of interest
The authors declare no other conflict of interest.
Disclosure
PP is a co-founder of the company EVIglance Inc. (www.EVIglance.com), providing software services for researchers to establish living AI-enhanced evidence maps. MW was funded by the German Research Foundation (DFG, Deutsche Forschungsgemeinschaft) as part of Germany’s Excellence Strategy– EXC 2050/1– Project ID 390696704– Cluster of Excellence “Centre for Tactile Internet with Human-in-the-Loop” (CeTI) of Technische Universität Dresden.
Footnotes
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
References
- 1.Oxford Centre for Evidence-Based Medicine OCEBM Levels of Evidence. https://www.cebm.ox.ac.uk/resources/levels-of-evidence/ocebm-levels-of-evidence
- 2.Higgins JPT, Thomas J, Chandler J et al (2019) Cochrane handbook for systematic reviews of interventions, 2nd edn. Wiley. https://www.wiley.com/en-us/Cochrane+Handbook+for+Systematic+Reviews+of+Interventions%2C+2nd+Edition-p-9781119536628
- 3.King’s College London Library Guides Systematic Reviews: AI and Automation. https://libguides.kcl.ac.uk/systematicreview/ai
- 4.Daniel Truhn D, Reis-Filho JS, Kather JN (2023) Large Language models should be used as scientific reasoning engines, not knowledge databases. Nat Med 29(12):2983–2984. 10.1038/s41591-023-02594-z [DOI] [PubMed] [Google Scholar]
- 5.Bellini V, Ori E, Coccolini F et al (2023) Evidence pyramid and artificial intelligence: a metamorphosis of clinical research. Discover Health Syst 2:40. 10.1007/s44250-023-00050-w [Google Scholar]
