Skip to main content
International Journal of Nursing Sciences logoLink to International Journal of Nursing Sciences
. 2025 Oct 19;12(6):516–523. doi: 10.1016/j.ijnss.2025.10.005

Nursing Retrieval-Augmented Generation: Retrieval augmented generation for nursing question answering with large language models

Liping Xiong a, Qiqiao Zeng a, Weixiang Luo b, Ronghui Liu c,
PMCID: PMC12684753  PMID: 41367595

Abstract

Objective

This study aimed to develop a Nursing Retrieval-Augmented Generation (NurRAG) system based on large language models (LLMs) and to evaluate its accuracy and clinical applicability in nursing question answering.

Methods

A multidisciplinary team consisting of nursing experts, artificial intelligence researchers, and information engineers collaboratively designed the NurRAG framework following the principles of retrieval-augmented generation. The system included four functional modules: 1) construction of a nursing knowledge base through document normalization, embedding, and vector indexing; 2) nursing question filtering using a supervised classifier; 3) semantic retrieval and re-ranking for evidence selection; and 4) evidence-conditioned language model generation to produce citation-based nursing answers. The system was securely deployed on hospital intranet servers using Docker containers. Performance evaluation was conducted with 1,000 expert-verified nursing question–answer pairs. Semantic fidelity was assessed using Recall Oriented Understudy for Gisting Evaluation – Longest Common Subsequence (ROUGE-L), and clinical correctness was measured using Accuracy.

Results

The NurRAG system achieved significant improvements in both semantic fidelity and answer accuracy compared with conventional large language models. For ChatGLM2-6B, ROUGE-L increased from (30.73 ± 1.48) % to (64.27 ± 0.27) %, and accuracy increased from (49.08 ± 0.92) % to (75.83 ± 0.35) %. For LLaMA2-7B, ROUGE-L increased from (28.76 ± 0.89) % to (60.33 ± 0.21) %, and accuracy increased from (43.27 ± 0.83) % to (73.29 ± 0.33) %. All differences were statistically significant (P < 0.001). A quantitative case analysis further demonstrated that NurRAG effectively reduced hallucinated outputs and generated evidence-based, guideline-concordant nursing responses.

Conclusion

The NurRAG system integrates domain-specific retrieval with LLMs generation to provide accurate, reliable, and traceable evidence-based nursing answers. The findings demonstrate the system’s feasibility and potential to improve the accuracy of clinical knowledge access, support evidence-based nursing decision-making, and promote the safe application of artificial intelligence in nursing practice.

Keywords: Evidence-based nursing, Large language models, Nursing knowledge base, Question-answering system, Retrieval-augmented generation

What is known?

  • Nurses increasingly require timely, accurate, evidence-based clinical answers, yet traditional keyword-based retrieval systems lack sufficient semantic understanding and precision.

  • Although the large language models (LLMs) show promise for question answering, their clinical utility remains limited due to risks of inaccurate or hallucinated responses, especially within specialized nursing domains.

What is new?

  • We developed the Nursing Retrieval-Augmented Generation (NurRAG), a retrieval-augmented generation framework that integrates a nursing-specific classifier and a curated knowledge base to improve the accuracy and reliability of LLM-generated nursing responses significantly.

  • Empirical evaluation confirmed that NurRAG substantially outperforms standard LLM approaches, offering a practical and clinically relevant method to support nursing decision-making and laying the groundwork for future intelligent nursing systems.

1. Introduction

The rapid advancement of medical technology and the increasingly complex demands for patient care have imposed unprecedented information management challenges on healthcare professionals [1,2]. From a global perspective, nursing constitutes the largest group in the health workforce, and escalating workload intensifies the need for timely, evidence-based knowledge access at the point of care [3]. As the first responders to dynamic clinical changes, nurses require rapid and accurate access to up-to-date guidance across diverse scenarios—such as triage, patient education, care planning, and interprofessional coordination [4,5]. However, conventional question-answering (QA) systems, which rely on database and knowledge graph-based keyword matching, exhibit inherent limitations in balancing recall and precision rates when processing large-scale data, while lacking robust semantic comprehension capabilities [6,7].

Recent progress in large language models (LLMs) has opened up new opportunities for accessing clinical knowledge [8]. State-of-the-art general-purpose LLMs include GPT-4o [9], Google’s Gemini1.5 [10], Meta’s Llama3 [11], Deepseek-V3 [12], and Qwen2.5 [13]. In the medical domain, these general models have been adapted and specialized into medical large language models (Med-LLMs) through domain-specific pretraining or fine-tuning on clinical texts. Notable examples include Medical Pathways Language Model (Med-PaLM) [14], a fine-tuned variant of PaLM that achieves expert-level performance in clinical QA, and Biomedical Generative Pre-trained Transformer (BioGPT) [15], designed for biomedical text generation using PubMed data. Open-source efforts such as PubMed Central Large Language Model Meta AI (PMC-LLaMA) [16] and Clinical-Camel [17] leverage large-scale medical corpora to enhance diagnostic reasoning and patient note understanding. Additionally, task-specific models, such as PsyChat [18] for mental health counseling and PneumoLLM [19] for pneumoconiosis diagnosis, demonstrate the trend toward specialization. Multimodal extensions, such as PathChat [20], further integrate histopathology imaging with textual reports, and SlideChat [21] for whole-slide pathology Image Understanding, highlighting the evolving capabilities of Med-LLMs in real-world clinical workflows. However, foundational reviews caution that, despite strong benchmark scores, LLMs remain limited for autonomous clinical QA due to hallucinations, brittleness to input quantity, calibration issues, bias, and inadequate evidence grading [[22], [23], [24]]. These constraints are particularly consequential in nursing contexts that require traceability, guideline concordance, and assurances of patient safety.

Against this background, retrieval-augmented generation (RAG) has emerged as a pragmatic way to enhance LLMs by grounding generation in evidence retrieved from external sources [25,26]. The RAG framework significantly increases the accuracy and credibility of generated content by integrating indexing, retrieval, and generation processes, especially for knowledge-intensive tasks [27,28]. Recent literature has increasingly focused on modular RAG architectures to improve flexibility and scalability by incorporating various strategies to enhance their components, such as integrating search modules for similarity searches and fine-tuning retrievers [29]. Typical approaches include reorganizing RAG modules [30] and rearranging RAG pipelines [31]. The trend towards modular RAG methods is becoming more widespread, as these techniques facilitate sequential processing across components and enable end-to-end integrated training. Examples of such methods include FlashRAG [32], Telco-RAG [33], and AutoRAG [34]. Corrective RAG scores retrieved documents with lightweight evaluators and expands recall via large-scale web search to improve robustness against noisy evidence [35]. Query-focused strategies explicitly rewrite, decompose, or disambiguate questions to improve multi-hop reasoning and retrieval alignment [36]; dynamic-relevance frameworks further boost recall and accuracy through two-stage retrieval and compact classifiers [37]. Nevertheless, nursing QA specifically requires traceable, guideline-concordant responses grounded in verified cognitive knowledge bases rather than free-form conversational output [8].

Motivated by these observations, we develop NurRAG, a nursing-oriented, citation-first RAG framework that operationalizes evidence-grounded QA. NurRAG comprises four modules: 1) a domain knowledge base construction module that curates and versions guidelines, institutional protocols, evidence summaries, and other authoritative sources; 2) a nursing question filtering module that screens out non-nursing queries to ensure domain relevance; 3) a nursing knowledge retrieval module that identifies and re-ranks semantically aligned passages for precise grounding; and 4) a LLM generation module that conditions on the retrieved evidence to produce citation-controlled answers with embedded safety mechanisms. The objective of this study is to enhance nurses’ ability to access timely, accurate, and trustworthy clinical knowledge under high workload conditions, thereby advancing the development of intelligent, safe, and evidence-based decision support systems in nursing practice. Ultimately, this work aspires to provide a validated and scalable QA system that supports the responsible integration of artificial intelligence into busy nursing workflows.

2. Methods

2.1. Study design

This study adopted a two-phase methodological design to develop and preliminarily apply a retrieval-augmented QA system for nursing practice. The research was conducted in a large tertiary hospital. The first phase focused on system construction, including team formation, establishment of the theoretical framework, and development of the NurRAG system, which integrates knowledge base construction, question classification, knowledge retrieval, and evidence-conditioned answer generation. The second phase involved preliminary application and evaluation of the system, aimed at assessing its performance and feasibility in supporting nursing decision-making.

2.2. Ethical consideration

This study was conducted in accordance with the ethical principles for research involving human data, with strict adherence to data privacy, institutional governance, and responsible use of artificial intelligence in clinical contexts. All nursing documents used for system development, including clinical records, guidelines, and standard operating procedures (SOPs), were fully de-identified before analysis, with any personal identifiers such as names, codes, or timestamps removed in compliance with institutional and international data protection standards. The NurRAG system was deployed on secure local hospital servers within the internal network to prevent any external data transmission or online model training using patient information. Because the study relied solely on de-identified institutional data and involved no direct patient contact or intervention, informed consent was waived with approval from the Ethics Committee of Shenzhen People’s Hospital (Approval No. LL-KY-2024508-01). All AI-generated responses are intended to assist, rather than replace, professional nursing judgment; users are reminded that licensed practitioners must make the final clinical decisions. To ensure safety and reliability, expert reviews are conducted periodically, and a human-in-the-loop mechanism remains active throughout system updates and deployments.

2.3. Development of the system

2.3.1. Establishment of the team

The development of NurRAG was completed by a six-member multidisciplinary team integrating clinical and technical expertise. The clinical group, led by L. Xiong, included two nurses responsible for collecting, cleaning, and processing nursing data. The technical group, headed by R. Liu, also comprised two members, focusing on algorithmic development, system design, and secure deployment. Q. Zeng, serving as the head nurse, coordinated clinical validation and ensured the system’s alignment with nursing workflows, while W. Luo, the director of the nursing department, provided overall supervision and ensured compliance with institutional governance and ethical standards.

2.3.2. The sources of nursing data

The nursing data used in this study were derived from a combination of structured and unstructured sources, ensuring both clinical relevance and evidence-based reliability [1]. The nursing data were collected and processed by the clinical team. Structured data included de-identified electronic nursing records and institutional SOPs. Unstructured data consisted of nursing guidelines, expert consensus statements, peer-reviewed journal articles, and nursing textbooks. Clinical data were retrieved from the hospital’s internal databases, obtained with the approval of the ethics committee. External literature sources, including nursing journals, guidelines, and expert consensus documents, were collected through recognized academic databases such as PubMed, CNKI, and Wanfang Data. Representative examples include the Chinese Clinical Nursing Practice Guidelines (2024 Edition) and textbook chapters from Fundamentals of Nursing (FA Davis Company, 2016).

All materials were collected between January and March 2025, resulting in a total of 25,600 nursing documents, including 15,300 de-identified nursing records, 3,200 clinical guidelines, 2,000 peer-reviewed articles, and other validated reference materials such as care manuals and medication safety protocols. All data were subsequently digitized, cleaned, and standardized for integration into the NurRAG framework.

2.3.3. NurRAG framework

2.3.3.1. Nursing knowledge base construction

The nursing knowledge base was built by L. Xiong’s clinical team and independently reviewed by Q. Zeng and W. Luo to ensure clinical validity and consistency with evidence-based practice. As shown in Fig. 1(a), the construction followed a standardized pipeline: first, source acquisition and normalization, in which institutional protocols, Standard Operating Procedures, de-identified clinical records, nursing guidelines, expert consensus documents, journal articles, and textbooks were collected, converted to UTF-8 plain text, and cleaned by optical character recognition where needed, with removal of headers, tables of contents, figures, and duplicate passages using rule-based filters and regular expressions; second, semantic-preserving segmentation, where texts were split into coherent chunks with a target length of about 256 characters and an overlap of about 32 characters to retain local context; third, metadata annotation, including document provenance, department, publication or approval date, version identifier, and evidence level, to support traceable retrieval.

Fig. 1.

Fig. 1

Architecture of the NurRAG framework. (a) Constructing a nursing knowledge base via document embedding and vector storage. (b) Filtering non-nursing questions using a classifier. (c) Retrieving relevant nursing knowledge through vector matching. (d) Generating evidence-based answers using LLMs. LLMs = large language models.

2.3.3.2. Nursing question classifier

To ensure that the system responds only to clinically appropriate and nursing-relevant queries, a question classifier was developed to act as an intelligent triage mechanism, aligning with how triage nurses prioritize incoming concerns based on clinical scope. This module plays a critical role in filtering out questions that fall outside the scope of nursing practice, thereby preventing LLMs from generating responses that may be inaccurate, misleading, or irrelevant in clinical settings. LLMs such as ChatGPT [38] often struggle with questions from highly specialized or interdisciplinary fields, particularly when such queries extend beyond the models’ training data. To address this, we define the complete set of input questions as Qa, the subset of questions relevant to nursing as R, and the subgroup that the model can answer professionally as D, where DRQa. The classifier operates as a filter Nt(·) to extract nursing-relevant questions from the complete input, and it can be described as R=Nt(Qa). Where Nt(·) determines whether a user query falls within the nursing domain, as shown in Fig. 1 (b). Irrelevant questions are excluded from further processing to reduce risks.

To implement this filter, we employ a bidirectional encoder representations from transformers (BERT-based binary classification model trained on labeled examples of nursing and non-nursing questions. The classifier encodes the input question and predicts a binary label indicating relevance. This decision mechanism effectively ensures that only domain-appropriate queries proceed to the retrieval and generation stages.

As illustrated in Appendix A, the question text is first transformed into contextual embeddings using the BERT network. The representation corresponding to the [CLS] token is then passed through a fully connected neural network to produce a binary decision: 1 for nursing-relevant and 0 for irrelevant. This filtering step is crucial for ensuring the safety, specificity, and clinical reliability of the system, thereby effectively preventing the generation of responses to ambiguous or out-of-domain queries. In our implementation, the BERT encoder outputs a 768-dimensional vector for each question. This vector is passed through a fully connected classifier consisting of two hidden layers with dimensions 384 and 768, followed by a final output layer of size two corresponding to the binary classification decision. This module was developed and optimized by the technical team led by R. Liu.

2.3.3.3. Nursing knowledge retrieval

Effective knowledge retrieval is essential for ensuring that the system provides accurate, evidence-based answers grounded in real-world nursing practice. In this step, as shown in Fig. 1(c), both the user’s question and the knowledge base content are mapped into a continuous vector space using a domain-adapted embedding model. This allows the system to identify semantically similar text fragments, rather than relying on superficial keyword matching, thereby aligning more closely with how nurses search for relevant clinical information in practice. Computing similarity scores between the query vector and each vectorized knowledge chunk performs the initial retrieval. Formally, this process can be expressed as KQ=KjKj:=argjmax(sim(q,Kj)). Where sim(·) denotes a similarity function such as cosine similarity or FAISS-based L2 distance, q represents a nursing-relevant question, and Kj denotes the j-th knowledge chunk in the database. KQ is the resulting set of top-matched knowledge fragments most semantically aligned with the input question. To enable efficient retrieval over large-scale nursing corpora, a dense passage retrieval framework with a dual-encoder architecture is employed.

In our implementation, the general text embedding model (GTE-large-zh [39]) replaces the traditional BERT encoder, generating 1024-dimensional dense vectors capable of capturing nuanced semantic relationships across both Chinese and English nursing texts [39]. The similarity between the query and knowledge vectors is computed using the FAISS library [40], which supports rapid approximate nearest-neighbor search based on distance. This enables real-time access to relevant nursing content even across thousands of knowledge chunks.

Given the variability in retrieval precision, the top-matched results may still contain noise or suboptimal matches. To address this, we introduce a re-ranking stage that reorders the initially retrieved content based on finer-grained semantic relevance. This can be formulated as KQr=KQii:=argisort(Scores(q,KQi)). Where KQr represents the re-ranked set of knowledge chunks and Scores(·) measures the contextual alignment between the question q and each initially retrieved fragment KQi. The indices i correspond to the sorted order of these scores, ranking the retrieved chunks by semantic relevance. We utilize the BGE-reranker-large [43] model for this purpose, which is based on the XLM-RoBERTa architecture [44] and trained on multilingual QA datasets.

2.3.3.4. Nursing question answering

Following the retrieval of question-relevant knowledge, the system enters the answer generation stage, in which as shown in Fig. 1(d), retrieved knowledge is used to construct context-enhanced prompts for LLMs. This step bridges structured nursing knowledge with fluent natural language generation, enabling the system to deliver responses that are both evidence-informed and clinically meaningful.

In practice, the top-ranked knowledge chunks KQr obtained through the semantic retrieval and re-ranking process are formatted alongside the user’s question into a structured prompt. This prompt serves as a guide for the LLM to generate a response grounded in validated nursing sources, much like a nurse would consult clinical guidelines or hospital protocols when answering a colleague’s question. Rather than generating answers based solely on general language patterns, the model incorporates retrieved knowledge through a weighted input mechanism. Chunks with higher similarity scores are given greater influence, allowing the model to focus on the most relevant information. The process can be formally described as Ak=LLMs(W·KQr),WScores. Where Ak represents the final answer generated by the LLMs, KQr is the set of ranked knowledge chunks, and W is a virtual weight matrix proportional to the relevance scores between each chunk and the input question. This weighting ensures that the response reflects the most contextually appropriate and clinically accurate content. The entire nursing QA module was developed and optimized by the technical team led by R. Liu.

2.4. Evaluation of the system

2.4.1. Evaluation dataset and LLMs

The evaluation of the NurRAG system was conducted on April 20, 2025, using a benchmark dataset of 1,000 nursing QA pairs. These QA pairs were primarily derived from real-world clinical practice and supplemented by standardized diagnostic frameworks. The design of the questions followed the taxonomy of the North American Nursing Diagnosis Association (NANDA), specifically its Human Response Patterns, which define 128 standardized nursing diagnoses [41]. For instance, one representative question was: “For the nursing diagnosis of pain related to elevated intraocular pressure, what are the appropriate nursing interventions?”. Nursing researchers initially drafted queries, then cross-referenced with institutional SOPs and evidence-based guidelines. All QA pairs were reviewed and validated by a panel of experienced clinical nursing experts, who assessed clarity, consistency, and alignment with practice standards. Ambiguities such as variations in symptom interpretation or differences in departmental practices were resolved through expert consensus. This evaluation dataset, therefore, reflects both practical nursing knowledge and standardized diagnostic reasoning, forming a robust basis for assessing the performance of the proposed RAG-based system.

For model evaluation, two representative open-source LLMs were employed as foundational baselines: ChatGLM2-6B [45] and LLaMA2-7B [46]. ChatGLM2-6B, developed by Tsinghua University and Zhipu AI, is optimized for bilingual (Chinese–English) dialogue and efficient inference on limited hardware resources. LLaMA2-7B, released by Meta AI, is an autoregressive transformer model with strong generalization ability and wide adaptability to downstream tasks. Both models were selected for their open accessibility, balanced performance, and feasibility for local deployment in secure hospital environments.

2.4.2. Evaluation metrics

To assess the quality and clinical reliability of the system-generated nursing responses, two widely adopted evaluation metrics were employed: Recall Oriented Understudy for Gisting Evaluation – Longest Common Subsequence (ROUGE-L) [42] and Accuracy (ACC). These metrics offer complementary perspectives—one focused on semantic similarity to expert references, the other on overall clinical correctness—ensuring a balanced and rigorous evaluation.

ROUGE-L measures how well the key phrases and sentence structures in the generated answer align with the expert-approved reference response. It is susceptible to word order and is widely used to evaluate natural language generation tasks where maintaining the integrity of meaning is critical. ROUGE-L calculates recall and precision based on the length of the Longest Common Subsequence (LCS) shared between the generated and reference answers. Given a reference answer X and a generated answer Y, with respective lengths m and n, the recall is defined as RA=LCS(X,) Y/m, and the precision as PA=LCS(X,Y)/n. The final ROUGE-L score is computed as ROUGE-L=(1+β2)×RA×PA/(RA+β2×PA). Where β is a weighting parameter controlling the relative importance of recall versus precision. In this study, ROUGE-L serves as a proxy for semantic alignment, indicating how closely the system-generated answer captures essential information from the reference.

ACC, by contrast, is a stricter, clinically oriented metric. It measures the percentage of answers that are deemed entirely correct by expert reviewers—defined as responses that fully capture the intent and content of the reference answer without omission or distortion. This inclusive matching strategy ensures that clinical appropriateness is maintained. Mathematically, accuracy is calculated as ACC=CR/Num. Where CR is the number of correctly generated answers, and Num is the total number of evaluated instances. A higher ACC indicates a higher rate of clinically acceptable responses, reflecting the model's practical utility in real-world nursing scenarios.

In addition to quantitative evaluation, a case-based quantitative study was conducted to assess further clinical interpretability and safety. Representative nursing questions were selected, and system-generated answers were compared with baseline LLM outputs. Two senior nursing experts independently reviewed each case to determine adherence to clinical guidelines and correction of hallucinated or unsafe content.

2.4.3. Data analysis

All statistical analyses were performed to examine whether the improvements achieved by the NurRAG-based framework were statistically significant compared with the LLM-based framework. Descriptive statistics, including mean and standard deviation, were first calculated to summarize the ROUGE-L and Accuracy scores across five repeated experiments. Subsequently, an independent samples t-test was conducted to compare the two frameworks for each LLM. A two-tailed significance level of P<0.05 was adopted to determine statistical significance. All analyses were performed using SPSS version 23.0.

3. Results

3.1. Description of the system

3.1.1. System deployment and operating environment

A computing server was used to conduct the nursing QA system. The hardware configuration consisted of two NVIDIA GeForce RTX 3090 GPUs and a 104-core Intel (R) Xeon (R) Gold 6230R CPU, running at 2.10 GHz, which provided sufficient computational power for real-time inference and retrieval tasks. The software environment consisted of Ubuntu 20.04 LTS as the operating system, PyTorch 2.1 as the primary deep learning framework, and CUDA 12.1 for GPU acceleration.

For deployment, the nursing QA system was containerized using Docker to ensure portability, isolation, and easy maintenance. The application was deployed within the hospital’s secure local network environment, with strict access control and data encryption to comply with institutional privacy and security regulations. Authorized users can access the system flexibly through any browser within the intranet by entering the assigned IP address and port number (e.g., 192.168.63.122:6066), enabling convenient use across multiple computers without compromising data security.

3.1.2. Interactive system interface

As illustrated in Appendix B, the interactive visual interface of the nursing QA system is designed to provide both flexibility and usability for frontline nurses. The top-left section contains two main functions: “Dialogue” and “Knowledge Base Management.” The “Dialogue” mode enables real-time QA through direct interaction, while the “Knowledge Base Management” mode allows users to upload files online to establish and maintain a private knowledge base.

In the middle-left part of the interface, the “Select Dialogue Mode” panel offers several options, including “LLM-based Dialogue,” “Knowledge Base + LLM Dialogue,” “Search Engine Dialogue,” and “Custom Agent Question Answering,” thereby accommodating different knowledge integration strategies. At the bottom center, the “Select LLM Model” panel offers model choices, including LLaMA2-7B, ChatGLM2-6B, and Baichuan2-7B. LLaMA2-7B is primarily suited for English-language queries, whereas ChatGLM2-6B and Baichuan2-7B are optimized for Chinese-language understanding and answering. On the right side, the real-time dialogue area enables nurses to input their clinical questions, select a dialogue mode and model, and receive immediate evidence-informed decision support. This design ensures the system can flexibly adapt to different clinical scenarios while maintaining ease of use for nurses.

3.2. The step of evaluating the system

Before presenting the experimental results, it is essential to outline the evaluation procedure used to assess the effectiveness and reliability of the NurRAG framework. The evaluation was designed to comprehensively examine the system from two complementary perspectives: large-scale comparative experiments using standardized nursing QA pairs, and fine-grained quantitative analysis of representative clinical cases.

3.2.1. Comparative study

To evaluate the effectiveness of the proposed NurRAG framework, comparative experiments were conducted between the baseline LLM-only approach and the NurRAG-enhanced approach. In the baseline condition, questions were directly input into the LLMs (LLaMA2-7B and ChatGLM2-6B) without retrieval support. In contrast, in the NurRAG condition, the same questions were processed through classification and retrieval modules before being used for evidence-conditioned generation. Each experiment was repeated five times to ensure statistical reliability.

Independent samples t-tests were conducted to determine whether the observed performance differences between the two frameworks were statistically significant. Results showed that NurRAG achieved substantial improvements in both semantic fidelity and clinical correctness across models. For ChatGLM2-6B, the mean ROUGE-L score increased from (30.73 ± 1.48) % to (64.27 ± 0.27) % (t = −49.92, P < 0.001), while Accuracy rose from (49.08 ± 0.92) % to (75.83 ± 0.35) % (t = −60.84, P < 0.001). Similarly, for LLaMA2-7B, ROUGE-L improved from (28.76 ± 0.89) % to (60.33 ± 0.21) % (t = −76.81, P < 0.001), and Accuracy increased from (43.27 ± 0.83) % to (73.29 ± 0.33) % (t = −75.16, P < 0.001). These results indicate that the NurRAG framework significantly enhances both linguistic quality and clinical accuracy compared with standard LLM-based QA. Furthermore, the markedly smaller standard deviations under the NurRAG condition demonstrate greater output stability and consistency across repeated trials.

3.2.2. Quantitative study

To further demonstrate the clinical value and practical effectiveness of the proposed NurRAG framework, we conducted a case-based quantitative study. This section contrasts the answer generated by a vanilla LLM with the response generated under the NurRAG framework for a representative clinical nursing query. The aim is to illustrate how the integration of evidence retrieval and domain-specific knowledge can mitigate hallucinations in LLMs and improve answer reliability in real-world scenarios.

Question: “What are the recommended nursing interventions for a patient diagnosed with acute glaucoma experiencing elevated intraocular pressure?”

LLM response: “To reduce intraocular pressure, administer 5 mg of oral timolol and advise the patient to lie flat with eyes closed. Avoid fluid intake for 24 h.”

This response contains hallucinations and clinically unsafe suggestions. Timolol is typically administered in ophthalmic (eye drop) form rather than orally, and the dosage of “5 mg oral” is not a standard dosage. Furthermore, advising a patient to avoid all fluid intake for 24 h lacks evidence and could be harmful.

NurRAG response (Evidence-enhanced via guideline retrieval). Recommended nursing interventions include: 1) administering topical timolol maleate 0.5 % as prescribed to lower intraocular pressure; 2) maintaining the patient in a semi-Fowler’s position to promote aqueous humor drainage; 3) monitoring visual acuity and pupil size regularly; 4) instructing the patient to avoid activities that increase intraocular pressure, such as straining or coughing. Evidence sourced from: 1) Chinese Glaucoma Guidelines (2020); 2) Expert Consensus on the Diagnosis and Treatment of Acute Attacks of Primary Angle-Closure Glaucoma in Zhejiang Province (2024); 3) European Glaucoma Society Guidelines (5th Edition); 4) Standard Operating Procedure for Ophthalmic Nursing Techniques.

The NurRAG-generated answer avoids unsafe recommendations and adheres to evidence-based practice. It references accurate medication routes, avoids incorrect dosages, and includes actionable, guideline-supported nursing care measures. This case study demonstrates that LLMs, when used without domain knowledge, are prone to fabricating clinical content that may appear plausible but is factually incorrect or unsafe. In contrast, the NurRAG framework substantially improves response reliability by integrating authoritative clinical references through retrieval-augmented generation. The evidence-grounded response not only reflects standard nursing practices but also ensures the safety, interpretability, and trustworthiness required for clinical application. This highlights the practical advantage of NurRAG in bridging LLMs with validated nursing knowledge.

4. Discussion

This study introduces NurRAG, a domain-specific Retrieval-Augmented Generation framework tailored for nursing QA, which addresses a critical gap in the safe and accurate application of LLMs in clinical nursing settings. The evaluation demonstrated that integrating domain-grounded retrieval substantially improved both the semantic quality and correctness of model-generated nursing responses. Statistical analysis using independent samples t-tests confirmed that these improvements were highly significant (P < 0.001), indicating that NurRAG effectively mitigates hallucination and enhances alignment with nursing guidelines. Compared with standard LLM outputs, NurRAG provided more coherent, evidence-based answers and exhibited greater stability across repeated trials. These improvements, coupled with the low standard deviation across repeated trials, underscore the consistency and robustness of the NurRAG system. By embedding guideline-based knowledge into the response generation process, NurRAG enhances the interpretability and clinical relevance of LLM outputs.

In addition to the quantitative evaluation, a case-based analysis further highlighted the strengths of NurRAG in mitigating hallucinations. When presented with a clinically relevant nursing question on acute glaucoma, the baseline LLM produced an unsafe and inaccurate recommendation, including incorrect drug dosage and harmful advice regarding fluid intake. In contrast, the NurRAG-enhanced system generated guideline-concordant, evidence-based interventions that aligned with international and regional clinical standards. This comparison not only demonstrates NurRAG’s capacity to correct misleading or hazardous outputs but also emphasizes its role in delivering actionable, trustworthy, and safety-aware responses. Such findings suggest that beyond improving numerical performance metrics, NurRAG provides tangible clinical value by ensuring that generated answers adhere to best practices and reduce the risk of misinformation at the bedside.

The proposed framework holds considerable practical value for real-world nursing decision support. Unlike general-purpose LLMs, which often generate hallucinated or non-domain-specific content, NurRAG explicitly restricts the model’s generation to validated nursing knowledge sources, such as clinical guidelines, SOPs, and expert-reviewed patient care records. This domain-grounded mechanism ensures that responses are not only fluent but also clinically appropriate and effective. Moreover, NurRAG shows strong adaptability to clinical environments. Its modular design—comprising knowledge base construction, question classification, vector retrieval, and answer generation—facilitates system maintenance and future expansion. Furthermore, the use of two publicly available LLMs—LLaMA2-7B for English and ChatGLM2-6B for Chinese—enables the framework to handle bilingual nursing queries effectively while maintaining a lightweight computational footprint suitable for local deployment. This makes NurRAG an accessible and scalable solution for mid-sized hospitals or clinical research institutions lacking access to high-performance computing infrastructure.

This research builds upon and extends prior studies in the RAG field. Traditional RAG frameworks—such as FlashRAG [32], CRAG [35], RQ-RAG [36], and AutoRAG [34]— have proven effective in knowledge-intensive tasks by improving generation quality through document retrieval and indexing. However, they often encounter limitations such as irrelevant content, hallucinations, and difficulty handling complex queries. While prior research introduced innovations such as modular RAG pipelines [34], reranking strategies [47], and multi-hop reasoning [36], few have explored their applicability in highly specialized domains such as nursing. By tailoring the knowledge retrieval process to the structure and semantics of clinical documents and introducing domain-aware filtering, NurRAG addresses these limitations and narrows the gap between generic RAG methods and specialized nursing applications.

The study provides valuable insights for both future research and real-world clinical applications. From a technical perspective, NurRAG demonstrates the feasibility and effectiveness of adapting modular RAG architectures to high-stakes domains, such as nursing, where accuracy, traceability, and clinical safety are of paramount importance. From a clinical perspective, the system can serve as a decision support tool for frontline nurses, enabling rapid access to standardized procedures, care protocols, and best-practice guidelines. In nursing education, NurRAG also offers the potential for interactive learning systems where students can ask questions and receive evidence-based responses grounded in actual clinical documentation.

While NurRAG exhibits strong performance in nursing QA, two notable limitations should be acknowledged. First, the underlying knowledge base is static and sourced exclusively from a single tertiary hospital. This restricts the breadth and diversity of included clinical guidelines and care protocols. The presence of outdated guidelines may also reduce the accuracy and reliability of generated responses, particularly in areas of clinical practice that evolve rapidly. Second, the QA dataset, although rigorously constructed and reviewed by clinical experts, may reflect a bias toward commonly encountered nursing scenarios. As a result, specific subspecialties, such as psychiatric nursing, neonatal intensive care, or palliative care, may be underrepresented in the workforce. Addressing these limitations will be crucial to enhancing the system’s generalizability and ensuring its applicability across a broader range of nursing domains.

Future research should focus on enhancing the adaptability and specialization of the NurRAG framework. This includes establishing regular updates to the nursing knowledge base, such as every three months, to incorporate the latest guidelines and clinical data, ensuring the system remains current and clinically relevant. Additionally, creating small, domain-specific knowledge bases for underrepresented nursing specialties can help reduce bias and improve the precision of answers for niche clinical queries. These improvements will support the broader deployment of NurRAG in diverse nursing contexts and enhance its value as both a clinical decision-support tool and an educational resource.

5. Conclusion

In clinical nursing practice, the increasing volume and complexity of information pose significant challenges for frontline nurses seeking timely and accurate knowledge support. Conventional information retrieval systems often fall short in delivering precise, context-relevant answers, while general-purpose LLMs are prone to hallucinated responses due to limited exposure to domain-specific knowledge. To address these limitations, this study introduced NurRAG, a retrieval-augmented generation framework specifically designed for nursing question answering. By integrating knowledge base construction, question classifier, knowledge retrieval, and answer generation with LLMs, NurRAG enhances the factual accuracy and contextual relevance of generated responses. Comparative experiments demonstrate that NurRAG outperforms vanilla LLM-based methods in both ROUGE-L and accuracy metrics. A quantitative case study further validated the system’s ability to reduce hallucinations by anchoring answers in verified nursing knowledge. These findings suggest that NurRAG holds substantial promise for enhancing clinical decision support, promoting evidence-based nursing practice, and reducing the cognitive burden on nurses in fast-paced environments.

CRediT authorship contribution statement

Liping Xiong: Conceptualization, Methodology, Validation, Formal analysis, Investigation, Data curation, Writing – original draft, Writing – review & editing. Qiqiao Zeng: Conceptualization, Methodology, Investigation, Resources, Writing – review & editing, Supervision, Project administration. Weixiang Luo: Conceptualization, Methodology, Funding acquisition, Writing – review & editing, Supervision, Project administration. Ronghui Liu: Conceptualization, Methodology, Validation, Formal analysis, Investigation, Data Curation, Software, Writing – review & editing, Visualization.

Data availability statement

The datasets generated during and/or analyzed during the current study are available from the corresponding author upon reasonable request.

Funding

This study was supported by the Young and Middle-aged Research Fund Project of Shenzhen People’s Hospital (Grant No. SYHL2024-N0010) and the Shenzhen Basic Research Program (General Program, Grant No. JCYJ20240813104409013). The funding organizations had no role in the design, implementation, data collection, analysis, or interpretation of the survey.

Declaration of competing interest

The authors declare there is no conflict of interest.

Acknowledgments

The authors thank all the contributing participants for their support and complete the study.

Footnotes

Peer review under responsibility of Chinese Nursing Association.

Appendices

Supplementary data to this article can be found online at https://doi.org/10.1016/j.ijnss.2025.10.005.

Appendices. Supplementary data

The following are the Supplementary data to this article:

Multimedia component 1
mmc1.docx (16.4KB, docx)
Multimedia component 2
mmc2.docx (187.9KB, docx)

References

  • 1.Fu M.R., Kurnat-Thoma E., Starkweather A., Henderson W.A., Cashion A.K., Williams J.K., et al. Precision health: a nursing perspective. Int J Nurs Sci. 2020;7(1):5–12. doi: 10.1016/j.ijnss.2019.12.008. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Sivarajkumar S., Mohammad H.A., Oniani D., Roberts K., Hersh W., Liu H.F., et al. Clinical information retrieval: a literature review. J Healthc Inform Res. 2024;8(2):313–352. doi: 10.1007/s41666-024-00159-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Agyeman-Manu K., Ghebreyesus T.A., Maait M., Rafila A., Tom L., Lima N.T., et al. Prioritizing the health and care workforce shortage: protect, invest, together. Lancet Glob Health. 2023;11(8):e1162–e1164. doi: 10.1016/s2214-109x(23)00224-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Adhikari R., Smith P. Global nursing workforce challenges: time for a paradigm shift. Nurse Educ Pract. 2023;69 doi: 10.1016/j.nepr.2023.103627. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Tong L., Niu Y.R., Zhou L.X., Jin S., Wang Y.L., Xiao Q. Perceptions and experiences of generative artificial intelligence training to support research for Chinese nurses: a qualitative focus group study. Int J Nurs Sci. 2025;12(3):210–217. doi: 10.1016/j.ijnss.2025.04.010. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Calijorne Soares M.A., Parreiras F.S. A literature review on question answering techniques, paradigms and systems. J King Saud Univ Comput Inf Sci. 2020;32(6):635–646. doi: 10.1016/j.jksuci.2018.08.005. [DOI] [Google Scholar]
  • 7.Liu J.L., Liu F., Fang J.B., Liu S.R. The application of chat generative Pre-trained transformer in nursing education. Nurs Outlook. 2023;71(6) doi: 10.1016/j.outlook.2023.102064. [DOI] [PubMed] [Google Scholar]
  • 8.Ni Z.X., Peng R., Zheng X.F., Xie P. Embracing the future: integrating ChatGPT into China’s nursing education system. Int J Nurs Sci. 2024;11(2):295–299. doi: 10.1016/j.ijnss.2024.03.006. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Hurst A., Lerer A., Goucher A.P., Perelman A., Ramesh A., Clark A., et al. Gpt-4o system card. arXiv Preprint arXiv:241021276 2024. https://arxiv.org/abs/2410.21276. [Accessed 18 July 2025].
  • 10.Team G., Georgiev P., Lei V.I., Burnell R., Bai L., Gulati A., et al. Gemini 1.5: unlocking multimodal understanding across millions of tokens of context. arXiv Preprint arXiv:240305530 2024. https://arxiv.org/abs/2403.05530. [Accessed 18 July 2025].
  • 11.Dubey A, Jauhri A, Pandey A, Kadian A, Al-Dahle A, Letman A, et al. The llama 3 herd of models. arXiv e-Prints 2024:arXiv–2407.2024. https://arxiv.org/abs/2407.21783. [Accessed 18 July 2025].
  • 12.Liu A., Feng B., Xue B., Wang B., Wu B., Lu C., et al. Deepseek-v3 technical report. arXiv Preprint arXiv:241219437 2024. https://arxiv.org/abs/2412.19437. [Accessed 18 July 2025].
  • 13.Yang A., Li A., Yang B., Zhang B., Hui B., Zheng B., et al. Qwen3 technical report. arXiv Preprint arXiv:250509388 2025. https://arxiv.org/abs/2505.09388. [Accessed 18 July 2025].
  • 14.Singhal K., Azizi S., Tu T., Mahdavi S.S., Wei J., Chung H.W., et al. Large language models encode clinical knowledge. Nature. 2023;620(7972):172–180. doi: 10.1038/s41586-023-06291-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Luo R.Q., Sun L.A., Xia Y.C., Qin T., Zhang S., Poon H., et al. BioGPT: generative pre-trained transformer for biomedical text generation and mining. Brief Bioinform. 2022;23(6) doi: 10.1093/bib/bbac409. [DOI] [PubMed] [Google Scholar]
  • 16.Wu C.Y., Lin W.X., Zhang X.M., Zhang Y., Xie W.D., Wang Y.F. PMC-LLaMA: toward building open-source language models for medicine. J Am Med Inform Assoc. 2024;31(9):1833–1843. doi: 10.1093/jamia/ocae045. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Hager P., Jungmann F., Holland R., Bhagat K., Hubrecht I., Knauer M., et al. Evaluation and mitigation of the limitations of large language models in clinical decision-making. Nat Med. 2024;30(9):2613–2622. doi: 10.1038/s41591-024-03097-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Qiu H.C., Li A.Q., Ma L.Z., Lan Z.Z. 2024 27th international conference on computer supported cooperative work in design (CSCWD) IEEE; Tianjin, China: 2024. PsyChat: a client-centric dialogue system for mental health support; pp. 2979–2984. 2024. [DOI] [Google Scholar]
  • 19.Song M.Y., Wang J.R., Yu Z.H., Wang J.X., Yang L., Lu Y.T., et al. PneumoLLM: harnessing the power of large language model for pneumoconiosis diagnosis. Med Image Anal. 2024;97 doi: 10.1016/j.media.2024.103248. [DOI] [PubMed] [Google Scholar]
  • 20.Lu M.Y., Chen B.W., Williamson D.F.K., Chen R.J., Zhao M., Chow A.K., et al. A multimodal generative AI copilot for human pathology. Nature. 2024;634(8033):466–473. doi: 10.1038/s41586-024-07618-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Chen Y., Wang G., Ji Y., Li Y., Ye J., Li T., et al. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR) 2025. SlideChat: a large vision-language assistant for whole-slide pathology image understanding; pp. 5134–5143. [DOI] [Google Scholar]
  • 22.Hadi A.L., Tran E., Nagarajan B., Kirpalani A. Evaluation of ChatGPT as a diagnostic tool for medical learners and clinicians. PLoS One. 2024;19(7) doi: 10.1371/journal.pone.0307383. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Williams C.Y.K., Miao B.Y., Kornblith A.E., Butte A.J. Evaluating the use of large language models to provide clinical recommendations in the emergency department. Nat Commun. 2024;15:8236. doi: 10.1038/s41467-024-52415-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Bedi S., Liu Y.T., Orr-Ewing L., Dash D., Koyejo S., Callahan A., et al. Testing and evaluation of health care applications of large language models: a systematic review. JAMA. 2025;333(4):319. doi: 10.1001/jama.2024.21700. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Fan W.Q., Ding Y.J., Ning L.B., Wang S.J., Li H.Y., Yin D.W., et al. Proceedings of the 30th ACM SIGKDD conference on knowledge discovery and data mining. ACM; Barcelona Spain: 2024. A survey on RAG meeting LLMs: towards retrieval-augmented large language models; pp. 6491–6501. [DOI] [Google Scholar]
  • 26.Ke Y.H., Jin L.Y., Elangovan K., Abdullah H.R., Liu N., Sia A.T.H., et al. Retrieval augmented generation for 10 large language models and its generalizability in assessing medical fitness. npj Digit Med. 2025;8:187. doi: 10.1038/s41746-025-01519-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Gao Y., Xiong Y., Gao X., Jia K., Pan J., Bi Y., et al. Retrieval-augmented generation for large language models: a survey. arXiv Preprint arXiv:231210997. 2023;2 doi: 10.1109/aixset62544.2024.00030. [DOI] [Google Scholar]
  • 28.Ma X., Gong Y., He P., Zhao H., Duan N. Query rewriting for retrieval-augmented large language models. arXiv Preprint arXiv:230514283 2023. https://arxiv.org/abs/2305.14283. [Accessed 25 March 2025].
  • 29.Muludi K., Fitria K.M., Triloka J., Sutedi Retrieval-augmented generation approach: document question answering using large language model. Int J Adv Comput Sci Appl. 2024;15(3):776–785. doi: 10.14569/ijacsa.2024.0150379. [DOI] [Google Scholar]
  • 30.Yu W., Iter D., Wang S., Xu Y., Ju M., Sanyal S., et al. Generate rather than retrieve: large language models are strong context generators. arXiv Preprint arXiv:220910063 2022. https://arxiv.org/abs/2209.10063. [Accessed 25 March 2025].
  • 31.Shao Z., Gong Y., Shen Y., Huang M., Duan N., Chen W. Enhancing retrieval-augmented large language models with iterative retrieval-generation synergy. arXiv Preprint arXiv:230515294 2023. https://arxiv.org/abs/2305.15294. [Accessed 25 March 2025].
  • 32.Jin J.J., Zhu Y.T., Dou Z.C., Dong G.T., Yang X.Y., Zhang C.H., et al. Companion proceedings of the ACM on web conference 2025. ACM; Sydney NSW Australia: 2025. FlashRAG: a modular toolkit for efficient retrieval-augmented generation research; pp. 737–740. [DOI] [Google Scholar]
  • 33.Bornea A.-L., Ayed F., De Domenico A., Piovesan N., Maatouk A. Telco-RAG: navigating the challenges of retrieval-augmented language models for telecommunications. arXiv Preprint arXiv:240415939 2024. https://arxiv.org/abs/2404.15939. [Accessed 25 March 2025].
  • 34.Kim D., Kim B., Han D., Eibich M. AutoRAG: automated framework for optimization of retrieval augmented generation pipeline. arXiv Preprint arXiv:241020878 2024. https://arxiv.org/abs/2410.20878. [Accessed 25 March 2025].
  • 35.Yan S.-Q., Gu J.-C., Zhu Y., Ling Z.-H. Corrective retrieval augmented generation. arXiv Preprint arXiv:240115884 2024. https://arxiv.org/abs/2401.15884. [Accessed 25 March 2025].
  • 36.Chan C.-M., Xu C., Yuan R., Luo H., Xue W., Guo Y., et al. Rq-rag: learning to refine queries for retrieval augmented generation. arXiv Preprint arXiv:240400610 2024. https://arxiv.org/abs/2404.00610. [Accessed 25 March 2025].
  • 37.Hei Z., Liu W., Ou W., Qiao J., Jiao J., Song G., et al. Dr-rag: applying dynamic document relevance to retrieval-augmented generation for question-answering. arXiv Preprint arXiv:240607348 2024. https://arxiv.org/abs/2406.07348. [Accessed 25 March 2025].
  • 38.Brown T., Mann B., Ryder N., Subbiah M., Kaplan J.D., Dhariwal P., et al. Language models are few-shot learners. NeurIPS. 2020;33:1877–1901. doi: 10.48550/arXiv.2005.14165. [DOI] [Google Scholar]
  • 39.Li Z., Zhang X., Zhang Y., Long D., Xie P., Zhang M. Towards general text embeddings with multi-stage contrastive learning. arXiv Preprint arXiv:230803281 2023. https://arxiv.org/abs/2308.03281. [Accessed 25 March 2025].
  • 40.Johnson J., Douze M., Jegou H. Billion-scale similarity search with GPUs. IEEE Trans Big Data. 2021;7(3):535–547. doi: 10.1109/tbdata.2019.2921572. [DOI] [Google Scholar]
  • 41.International N. John Wiley & Sons; 2014. Nursing diagnoses 2012-14: definitions and classification. [Google Scholar]
  • 42.Lin C.-Y. Rouge: a package for automatic evaluation of summaries. Text summarization branches out 2004;74–81. https://aclanthology.org/W04-1013/. [Accessed 25 March 2025].
  • 43.Xiao S., Liu Z., Zhang P., Muennighoff N. Proceedings of the 47th international ACM SIGIR conference on research and development in information retrieval. 2024. C-pack: packaged resources to advance general chinese embedding 2023; pp. 641–649. [DOI] [Google Scholar]
  • 44.Conneau A., Khandelwal K., Goyal N., Chaudhary V., Wenzek G., Guzmán F., et al. Proceedings of the 58th annual meeting of the association for computational linguistics. ACL; Online. Stroudsburg, PA, USA: 2020. Unsupervised cross-lingual representation learning at scale; pp. 8440–8451. [DOI] [Google Scholar]
  • 45.Touvron H., Lavril T., Izacard G., Martinet X., Lachaux M.-A., Lacroix T., et al. Llama: open and efficient foundation language models. arXiv Preprint arXiv:230213971 2023. https://arxiv.org/abs/2302.13971. [Accessed 18 July 2025].
  • 46.Glm T, Zeng A, Xu B, Wang B, Zhang C, Yin D, et al. ChatGLM: a family of large language models from GLM-130B to GLM-4 all tools 2024. https://arxiv.org/abs/2406.12793. [Accessed 25 March 2025].
  • 47.Eibich M., Nagpal S., Fred-Ojala A. ARAGOG: advanced RAG output grading. arXiv Preprint arXiv:240401037 2024. https://arxiv.org/abs/2404.01037. [Accessed 18 July 2025].

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Multimedia component 1
mmc1.docx (16.4KB, docx)
Multimedia component 2
mmc2.docx (187.9KB, docx)

Data Availability Statement

The datasets generated during and/or analyzed during the current study are available from the corresponding author upon reasonable request.


Articles from International Journal of Nursing Sciences are provided here courtesy of Chinese Nursing Association

RESOURCES