Abstract
Clinicians face an ever-increasing volume of medical literature and are turning to large language models and “deep research” systems to retrieve, organize, and synthesize biomedical evidence. In our use of these tools, we have found them useful in producing coherent and comprehensive summaries and proposing testable hypotheses. However, the outputs of these models are prone to flattening evidentiary hierarchies, overgeneralizing across heterogeneous populations and comparators, and occasionally propagating hallucinated citations. These failure modes risk automation bias and erosion of transparency if introduced into clinical pathways without guardrails. Here, we propose a “clinician-in-the-loop” in which clinicians remain the gatekeepers for artificial intelligence-assisted synthesis. We outline three core duties for clinicians: (1) evidence weighting that privileges randomized trials, high-quality meta-analyses, and absolute risk communication; (2) contextual integration across pathophysiology, existing evidence, and patient populations; and (3) provenance and bias auditing through source verification, uncertainty reporting, and counter-summaries. We moreover explore how healthcare institutions, medical educators, policymakers, and publishers of medical literature can promote literacy and transparency regarding the use of “deep research” tools, including the implementation of reporting standards, provenance disclosures, and equity surveillance.
Keywords: Deep research, artificial intelligence, large language models, evidence synthesis, automation bias
As physicians-in-training, we are acutely aware of the explosive growth of medical information. Medical knowledge is forecasted to double approximately every 3 months, 1 PubMed alone added nearly 1.6 million new citations in 2023, 2 and a physician would need an unrealistic 20+ hours per day just to skim all relevant publications to stay current. 3 Faced with this overwhelming deluge, we find ourselves increasingly turning to large language models (LLMs)—particularly the latest large reasoning models (LRMs). These advanced LLMs go beyond mere summarization; they undertake structured, multi-step reasoning tasks to help us organize and synthesize complex medical literature. We are the first generation of clinicians to come of age alongside these technologies, and their so-called “deep research” capabilities offer a tantalizing vision: instant access to organized, synthesized medical knowledge.
Nonetheless, our enthusiasm recently met a sobering test during a literature review of the latest glaucoma therapies (a topic aligned with our specialization as ophthalmology residents). We tasked an LRM to summarize the newest clinical trials and came to an unsettling realization: the generated summary was fluent, authoritative, yet oddly hollow. It effortlessly aggregated relevant studies but failed to highlight meaningful patterns or provide a deeper analytical context. It was a well-organized collage, not a thoughtful synthesis.
Our experience illustrates a core truth about the current state of artificial intelligence (AI)-driven “deep research.” Tools such as ChatGPT and specialized platforms such as Elicit can analyze, summarize, and even draft manuscripts from vast troves of literature. Yet they remain incapable of genuine interpretative insights or conceptual breakthroughs. Still, with over a tenth of physicians now regularly using LLMs to generate summaries of medical research and standards of care, 4 it is paramount that physicians grasp the utility and limitations of “deep research.”
What is deep research?
Deep research refers to LRM-enabled workflows that extend beyond the single-pass text generation typical of earlier LLMs. Traditional LLMs produced responses in a single pass, relying on patterns learned during training rather than engaging with new information. While they can generate fluent text, they cannot consult new sources or verify their own reasoning. In contrast, deep research systems are designed to retrieve and critically reason over the most current biomedical literature and clinical guidelines. Two features are central to their approach: (1) retrieval-augmented generation to ground outputs in published sources along with provenance tracking with appropriate citations, and (2) chain-of-thought reasoning to break down complex questions into smaller, sequential steps. 5 The latter means that these models, through iterative processes, can check the coherence of their intermediate reasoning steps and revise outputs if inconsistencies are detected. In short, the resulting outputs are evidence-linked syntheses that double-check and self-correct, thereby enhancing transparency, reproducibility, and clinical applicability.
Opportunities for clinicians
The practical benefits of deep research are evident. In minutes, these systems can ostensibly produce comprehensive summaries and literature overviews that once demanded weeks of human effort. Elicit, for example, is a deep research tool that can create lists of citations based on a research query and allows for easy searching through these articles. 6 By highlighting key findings from a diverse range of papers, LRMs become a springboard for creative thinking and hypothesis generation. In institutions without digital libraries or subscription databases, the technology can potentially level the playing field: instead of struggling to access paywalled articles, deep research tools can retrieve and distill the essential evidence. Many LLMs can also generate tables and graphs and perform basic statistical analyses, further extending their utility in clinical and research settings.
Fluent answers missing insight
On closer inspection, the apparent speed and well-organized nature of deep research-generated content belie critical shortcomings. First, deep research often flattens the hierarchy of evidence. Through the filter of LRMs, case series, randomized controlled trials, observational studies, editorials from subject matter experts, and meta-analyses from small and large journals alike all emerge as equally compelling findings, stitched together with limited consideration of differences in sample size, study design rigor, and bias risk.
Studies have illustrated the limitations of LLMs in summarizing medical evidence: human reviewers identify critical omissions and misinterpretations in system-generated summaries across multiple clinical domains. 7 Beyond medicine, controlled tests of scientific summarization demonstrate a systematic tendency to over-generalize conclusions relative to source texts, reinforcing the risk that fluent language can mask distorted takeaways. 8
Moreover, the snippets recombined from existing literature by LRMs are based purely on statistical co-occurrence, without any internal grasp of pathophysiology or clinical nuance. They cannot distinguish between breakthroughs and incremental progress, nor can they determine when a hypothesis is outdated or contradicted. They do not challenge prevailing assumptions but merely rephrase what already exists. In short, these systems simulate understanding without performing conceptual synthesis.
The fundamental limitation was rigorously quantified in Apple's recent study, “The Illusion of Thinking,” which examined leading LRMs across carefully controlled puzzle environments. 9 The researchers identified a counterintuitive phenomenon termed a “scaling limit”: as task complexity increases, models initially devote more reasoning effort, but soon reach a threshold beyond which their effort and accuracy dramatically decline despite having ample remaining computing resources. Notably, these failures occurred even with straightforward logic puzzles, which lack the inherent ambiguity, complexity, and contextual noise found in clinical medicine. It is not hard to imagine how these same models might falter when faced with the considerably messier problems of clinical decision-making and medical research. In other words, as problems become increasingly complex, these models demonstrate behaviors fundamentally misaligned with human cognition, scientific inquiry, and patient-centered medical care.
Then, even with the most advanced LRMs, there are hallucinations, confidently fabricated PubMed identifiers with nonexistent DOIs, invented trial results, and even spurious “expert quotations” that cannot be found when the cited sources are scrutinized. One of the most frustrating aspects of our use of deep research is chasing down AI-generated citations of empty archives and dead ends, convincingly embedded within otherwise accurate information. This is not merely anecdotal—in medical contexts, published studies have documented high rates of fabricated or inaccurate references in chatbot-generated outputs. 10
Bias in AI outputs presents another challenge. Trained on a skewed archive of published work, these models perpetuate the same blind spots that are in the medical work they ingest: underrepresented patient groups remain less visible, minority perspectives go unheard, and dominant narratives grow more entrenched. 11
The clinician as gatekeeper
What our experience tells us is that deep research, at least in its current form, does not replace the clinician's judgment. The real promise lies in a “clinician-in-the-loop” approach: AI may perform a first pass at gathering relevant evidence and organizing raw material, but human experts must interpret the findings and decide which leads warrant further exploration.
Being a “clinician-in-the-loop” entails three specific responsibilities: (1) Evidence weighting: explicitly privileging stronger study designs (e.g. randomized trials and high-quality meta-analyses) over other sources; (2) Contextual integration: aligning aggregated claims with pathophysiology, population differences, practice standards, and trends over time; and (3) Bias and safety checks: screening for algorithmic bias and hallucinations as reflected in historical data.11–13 These duties may be familiar to clinicians, but AI's surface fluency means we must perform them more deliberately and transparently.
To help with this, today's clinician must also gain a high-level familiarity with the language of algorithms. It starts with a curiosity about how these tools learn: what datasets shape their outputs, and how their model architecture makes them confident yet prone to inventing facts. We can then subject AI-generated summaries to the same rigorous critique we apply to any peer-reviewed paper, asking whether the evidence comes from a randomized trial or a case series, the existence of conflicts of interest, and whether key subgroups are represented.
And suppose we brought AI into our existing culture of inquiry: what if, at the next journal club, alongside a presentation of the latest article from the American Journal of Ophthalmology, we also said, “Here's what my deep research model found—let's critique it together.” By sharpening our clinical judgment against the generated information, we won’t just keep pace with the growing flood of knowledge; we’ll shape it into meaningful insights.
The mandate as gatekeepers extends beyond the individual clinician. Medical educators should incorporate critical appraisal of AI outputs into their curricula. Journals and funding bodies should demand complete transparency whenever manuscripts lean on AI, whether it helped mine the literature or draft prose, and hold authors accountable for any undisclosed reliance. Peer reviewers should demand clear source citations and verify dubious claims. At the health system level, leadership should vet a “toolkit” of in-depth research services and establish protocols for monitoring AI-generated evidence.
Evidence from the literature shows that a thorough understanding of how AI reasons is essential to sound clinical decision-making. A randomized controlled trial found that providing hospital-based clinicians with systematically biased AI-diagnostic tools significantly decreased the clinicians’ diagnostic performance. 14 Systematic reviews have demonstrated how clinicians can become overconfident in machine outputs and less vigilant in verifying the validity of outputs.12,13 What this suggests is that AI may contribute to de-skilling when clinicians stop practicing core appraisal skills. Table 1 provides a checklist for the responsible use of AI-assisted research.
Table 1.
Gatekeeper checklist for artificial intelligence (AI)-assisted research.
| 1. Provenance: Can each key claim be traced to a peer-reviewed source? Routinely spot-check citations against full texts. |
| 2. Design weighting: Does the output distinguish randomized controlled trials, meta-analyses, observational studies, and expert opinion—and weight them accordingly? |
| 3. Numeracy: In cited studies, do effect sizes, confidence intervals, and absolute risks support the claim generated by the large language model? Flag “confidence without numbers.” |
| 4. Comparators and context: Does the AI-generated output respect clinical indications, study populations, and study variables in its synthesis? |
| 5. Contradictions: Are discordant trials acknowledged? If not, prompt for conflicting evidence and reassess. |
| 6. Bias and safety checks: Guard against automation bias; request a counter-summary (“argue the opposite”) and compare claims with sources. |
| 7. Equity: Ask explicitly about underrepresented subgroups and calibration across demographics; note data gaps. |
| 8. Actionability: Translate outputs into clinically verifiable next steps. |
| 9. Disclosure and accountability: If AI-shaped your synthesis, disclose this to colleagues or patients and retain human responsibility for conclusions. |
Two other aspects of being gatekeepers should be mentioned: patient trust and infrastructure equity. Patients’ perceptions of AI-influenced recommendations are heterogeneous, with common concerns over accuracy and accountability, and a general preference for clinician supervision.15,16 A pragmatic norm would be to disclose AI involvement in plain language and to confirm understanding with a brief teach-back. A study shows that transparency in the use of AI tools can improve patient trust without eliminating appropriate reliance. 17 Without attention to access, literacy, and representative data, AI tools can reinforce existing inequities. 11 Health systems and training programs should allocate resources for equitable access to vetted tools, offer faculty development in AI literacy, and audit for disparate impact. 18
Deep research is a powerful and exciting tool, but it is far from being an arbiter of truth. As we navigate the influx of medical literature and the emergence of LRMs, our human clinical judgement, paradoxically, becomes all the more crucial. We are aware of the accelerating evolution in the capabilities of LLMs, from simple text generation to multi-step analysis within three years—these advancements suggest that AI-driven reasoning itself will likely become significantly more sophisticated in the near future. Still, by rigorously scrutinizing these emerging tools, we not only preserve the compassionate, critical inquiry at the heart of our profession, but also translate the promise of deep research into better care for our patients.
Footnotes
ORCID iD: Henry Bair https://orcid.org/0000-0002-3422-0373
Funding: The authors received no financial support for the research, authorship, and/or publication of this article.
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
AI use statement: AI tools (ChatGPT and OpenAI) were used to improve the grammar and phrasing only. The content, references, analysis, and conclusions were entirely written and verified by the authors. The authors take full responsibility for the integrity and accuracy of the work. No text or citations were accepted without manual verification.
References
- 1.Densen P. Challenges and opportunities facing medical education. Trans Am Clin Climatol Assoc 2020; 122: 48. [PMC free article] [PubMed] [Google Scholar]
- 2.MEDLINE PubMed Production Statistics, https://www.nlm.nih.gov/bsd/medline_pubmed_production_stats.html?utm_source=chatgpt.com (2023, accessed 11 June 2025).
- 3.Porter J, Boyd C, Skandari MR, et al. Revisiting the time needed to provide adult primary care. J Gen Intern Med 2022; 38: 147–155. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Henry TA. 2 in 3 physicians are using health AI—up 78% from 2023. American Medical Association, https://www.ama-assn.org/practice-management/digital-health/2-3-physicians-are-using-health-ai-78-2023?utm_source=chatgpt.com (2025, accessed 11 June 2025). [Google Scholar]
- 5.Wei J, Wang X, Schuurmans D, et al. Chain-of-thought prompting elicits reasoning in large language models. Adv Neural Inf Process Syst 2022; 35: 24824–24837. [Google Scholar]
- 6.Elicit Help Center. Elicit.com, https://support.elicit.com/en/categories/146369 (2025). [Google Scholar]
- 7.Tang L, Sun Z, Idnay B, et al. Evaluating large language models on medical evidence summarization. npj Digit Med 2023; 6: 58. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Peters U, Chin-Yee B. Generalization bias in large language model summarization of scientific research. R Soc Open Sci 2025; 12: 241776. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Apple Machine Learning Research. The illusion of thinking. Apple.com, https://ml-site.cdn-apple.com/papers/the-illusion-of-thinking.pdf (2025) [Google Scholar]
- 10.Chen A, Chen DO. Accuracy of chatbots in citing journal articles. JAMA Netw Open 2023; 6: e2327647. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Obermeyer Z, Powers B, Vogeli C, et al. Dissecting racial bias in an algorithm used to manage the health of populations. Science 2019; 366: 447–453. [DOI] [PubMed] [Google Scholar]
- 12.Lyell D, Coiera E. Automation bias and verification complexity: A systematic review. J Am Med Inf Assoc 2017; 24: 423–431. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Goddard K, Roudsari A, Wyatt JC. Automation bias: A systematic review of frequency, effect mediators, and mitigators. J Am Med Inf Assoc 2012; 19: 121–127. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Jabbour S, Fouhey D, Shepard S, et al. Measuring the impact of AI in the diagnosis of hospitalized patients. JAMA 2023; 330: 2275. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Esmaeilzadeh P, Mirzaei T, Dharanikota S. Patients’ perceptions toward human–artificial intelligence interaction in health care: Experimental study. J Med Internet Res 2021; 23: e25856. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Moy S, Irannejad M, Manning SJ, et al. Patient perspectives on the use of artificial intelligence in health care: A scoping review. J Patient Cent Res Rev 2024; 11: 51–62. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Sakamoto T, Harada Y, Shimizu T. Facilitating trust calibration in artificial-intelligence-driven differential diagnoses list for physicians’ diagnostic accuracy: A quasi-experimental study. JMIR Form Res 2024; 8: e58666. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Cross JL, Choma MA, Onofrey JA. Bias in medical AI: Implications for clinical decision-making. PLoS Digit Health 2024; 3: e0000651. [DOI] [PMC free article] [PubMed] [Google Scholar]
