Retrieval‐augmented generation versus document‐grounded generation: a key distinction in large language models

Shunsuke Koga; Daisuke Ono; Amrom Obstfeld

doi:10.1002/2056-4538.70014

letter

. 2025 Jan 16;11(1):e70014. doi: 10.1002/2056-4538.70014

Retrieval‐augmented generation versus document‐grounded generation: a key distinction in large language models

Shunsuke Koga ^1,^✉, Daisuke Ono ¹, Amrom Obstfeld ¹

PMCID: PMC11736412 PMID: 39817433

We read with great interest the article by Hewitt et al., ‘Large language models as a diagnostic support tool in neuropathology’ [1]. The authors effectively applied large language models (LLMs) to interpreting the WHO classification of central nervous system tumors; however, we wish to address a technical aspect of their study that warrants clarification.

The authors described their approach as retrieval‐augmented generation (RAG). Based on the methods described, the study involved attaching a Word document containing the WHO diagnostic criteria to the prompt to guide its responses. We believe that this approach is more accurately described as document‐grounded generation rather than true RAG. Document‐grounded generation refers to methods where the model generates outputs explicitly based on a preprovided document, which serves as a static reference [2]. Unlike RAG, which retrieves information dynamically from external sources [3], document‐grounded generation relies entirely on data embedded in the input prompt at the time of execution. In this study, the WHO criteria were provided with the prompt, which allowed the model to use this information without real‐time retrieval. This method is a type of in‐context learning, relying on curated contextual data embedded in the input [4].

Our own work provides an example of in‐context learning in a different domain, namely image classification. We evaluated GPT‐4 Vision (GPT‐4V) for classifying histopathological images stained with tau immunohistochemistry, including neuritic plaques, astrocytic plaques, and tufted astrocytes [5]. Although GPT‐4V initially struggled, few‐shot learning with annotated examples, which is a specific application of in‐context learning, significantly improved its accuracy, matching that of a convolutional neural network model trained on a larger dataset. These findings demonstrate the utility of in‐context learning for both text‐based and image‐based tasks, with the latter presenting unique challenges for LLMs [6].

Although in‐context learning is an effective approach, it has limitations worth considering. Since this method uses static data that are preloaded data into the prompt, errors can occur if the information is outdated or inaccurate. In‐context learning may also lead to overfitting to the given context, limiting the model's ability to generalize to other scenarios. If the contextual data are overly complex, the model might misinterpret the information or fail to generate accurate outputs [7]. To ensure reliability, it is important to carefully select the input data, update it regularly, and consider these limitations when designing tasks.

In summary, clarifying the differences between RAG, document‐grounded generation, and in‐context learning is essential, especially for readers less familiar with these concepts. Nonetheless, we support the authors' conclusion that incorporating external data improves diagnostic performance. Their study, interpreted as an example of document‐grounded generation, demonstrates how LLMs can effectively assist in medical tasks when supported by well‐curated contextual data.

Author contributions

SK: conceptualization, drafting the manuscript. DO: reviewing and editing the manuscript. AO: reviewing and editing the manuscript.

Conflict of interest statement

No conflicts of interest were declared.

Acknowledgements

None.

References

1. Hewitt KJ, Wiest IC, Carrero ZI, et al. Large language models as a diagnostic support tool in neuropathology. J Pathol Clin Res 2024; 10: e70009. [DOI] [PMC free article] [PubMed] [Google Scholar]
2. Zhou K, Prabhumoye S, Black AW. A dataset for document grounded conversations. arXiv 2018. 10.48550/arXiv.1809.07358 [DOI]
3. Lewis P, Perez E, Piktus A, et al. Retrieval‐augmented generation for knowledge‐intensive NLP tasks. In: Proceedings of the 34th International Conference on Neural Information Processing Systems. Curran Associates: Vancouver, 2020; 793. [Google Scholar]
4. Brown TB, Mann B, Ryder N, et al. Language models are few‐shot learners. In: Proceedings of the 34th International Conference on Neural Information Processing Systems. Curran Associates: Vancouver, 2020; 159. [Google Scholar]
5. Ono D, Dickson DW, Koga S. Evaluating the efficacy of few‐shot learning for GPT‐4Vision in neurodegenerative disease histopathology: a comparative analysis with convolutional neural network model. Neuropathol Appl Neurobiol 2024; 50: e12997. [DOI] [PubMed] [Google Scholar]
6. Koga S, Du W. From text to image: challenges in integrating vision into ChatGPT for medical image interpretation. Neural Regen Res 2025; 20: 487–488. [DOI] [PMC free article] [PubMed] [Google Scholar]
7. Peng H, Wang X, Chen J, et al. When does in‐context learning fall short and why? A study on specification‐heavy tasks. arXiv 2023. 10.48550/arXiv.2311.08993 [DOI]

[cjp270014-bib-0001] 1. Hewitt KJ, Wiest IC, Carrero ZI, et al. Large language models as a diagnostic support tool in neuropathology. J Pathol Clin Res 2024; 10: e70009. [DOI] [PMC free article] [PubMed] [Google Scholar]

[cjp270014-bib-0002] 2. Zhou K, Prabhumoye S, Black AW. A dataset for document grounded conversations. arXiv 2018. 10.48550/arXiv.1809.07358 [DOI]

[cjp270014-bib-0003] 3. Lewis P, Perez E, Piktus A, et al. Retrieval‐augmented generation for knowledge‐intensive NLP tasks. In: Proceedings of the 34th International Conference on Neural Information Processing Systems. Curran Associates: Vancouver, 2020; 793. [Google Scholar]

[cjp270014-bib-0004] 4. Brown TB, Mann B, Ryder N, et al. Language models are few‐shot learners. In: Proceedings of the 34th International Conference on Neural Information Processing Systems. Curran Associates: Vancouver, 2020; 159. [Google Scholar]

[cjp270014-bib-0005] 5. Ono D, Dickson DW, Koga S. Evaluating the efficacy of few‐shot learning for GPT‐4Vision in neurodegenerative disease histopathology: a comparative analysis with convolutional neural network model. Neuropathol Appl Neurobiol 2024; 50: e12997. [DOI] [PubMed] [Google Scholar]

[cjp270014-bib-0006] 6. Koga S, Du W. From text to image: challenges in integrating vision into ChatGPT for medical image interpretation. Neural Regen Res 2025; 20: 487–488. [DOI] [PMC free article] [PubMed] [Google Scholar]

[cjp270014-bib-0007] 7. Peng H, Wang X, Chen J, et al. When does in‐context learning fall short and why? A study on specification‐heavy tasks. arXiv 2023. 10.48550/arXiv.2311.08993 [DOI]

PERMALINK

Retrieval‐augmented generation versus document‐grounded generation: a key distinction in large language models

Shunsuke Koga

Daisuke Ono

Amrom Obstfeld

Author contributions

Conflict of interest statement

Acknowledgements

References

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Retrieval‐augmented generation versus document‐grounded generation: a key distinction in large language models

Shunsuke Koga

Daisuke Ono

Amrom Obstfeld

Author contributions

Conflict of interest statement

Acknowledgements

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases