Skip to main content
Journal of Pathology Informatics logoLink to Journal of Pathology Informatics
. 2025 Sep 29;19:100520. doi: 10.1016/j.jpi.2025.100520

Retrieval-augmented generation for interpreting clinical laboratory regulations using large language models

Suparna Nanua a, Raven Steward c, Benjamin Neely b, Michael Datto b, Kenneth Youens a,c,
PMCID: PMC12616094  PMID: 41244595

Abstract

Large language models (LLMs) have demonstrated strong performance on general knowledge tasks, but they have important limitations as standalone tools for question answering in specialized domains where accuracy and consistency are critical. Retrieval-augmented generation (RAG) is a strategy in which LLM outputs are grounded in dynamically retrieved source documents, offering advantages in accuracy, explainability, and maintainability. We developed and evaluated a custom RAG system called Raven, designed to answer laboratory regulatory questions using the part of the Code of Federal Regulations (CFR) pertaining to laboratory (42 CFR Part 493) as an authoritative source. Raven employed a vector search pipeline and a LLM to generate grounded responses via a chatbot–style interface. The system was tested using 103 synthetic laboratory regulatory questions, 88 of which were explicitly addressed in the CFR. Compared to answers generated manually by a board-certified pathologist, Raven's responses were judged to be totally complete and correct in 92.0% of those 88 cases, with little irrelevant content and a low potential for regulatory or medical error. Performance declined significantly on questions not addressed in the CFR, confirming the system's grounding in the source documents. Most suboptimal responses were attributable to faulty source document retrieval rather than model hallucination or misinterpretation. These findings demonstrate that a basic RAG system can produce useful, accurate, and verifiable answers to complex regulatory questions. With appropriate safeguards and with thoughtful integration into user workflows, tools like Raven may serve as valuable decision-support systems in laboratory medicine and other knowledge-intensive healthcare domains.

Keywords: Large language model, Retrieval-augmented generation, Clinical decision support, Laboratory administration

Graphical abstract

Unlabelled Image

Background and significance

Large language models (LLMs) like those powering ChatGPT (OpenAI, San Francisco, California, United States) and Gemini (Google, Mountain View, California, United States) use a machine learning architecture called a transformer to generate textual output. A key innovation of transformer architecture is the self-attention mechanism, which permits the model to weight the relative importance of words in a sequence, thereby improving contextual understanding.1 When trained on vast datasets, the transformer architecture enables LLMs to display remarkable proficiency in comprehending language and generating text that closely mimics human expression.

LLMs excel at tasks such as document summarization, sentiment analysis, text classification, translation, text generation, and question answering.2 The potential for intelligent, interactive, and authoritative LLMs to serve as always-available consultants in the medical setting has generated some compelling speculation.3 Indeed, the most capable models demonstrate remarkable accuracy on general knowledge and common-sense type questions,4 as well as good performance on some specialized knowledge benchmarks such as United States Medical Licensing Exam-style medical questions.5

Despite these successes, LLMs suffer from important limitations as standalone sources for question answering when accuracy and consistency are paramount. First, a model's proficiency at question answering is linked to its training data. LLMs perform best when tasks reflect information within their training corpora, but can falter when faced with questions requiring precise and highly technical information not included in the training set.6 Second, LLM output can be non-deterministic, which means that due to the probabilistic strategies used by the models during text generation, the same input can produce different outputs across multiple trials.7,8 Third, LLMs are susceptible to “hallucinations”, instances in which they produce plausible-sounding but factually incorrect content. Whereas some hallucinations originate from inaccuracies in training data, they more fundamentally arise because LLMs generate responses that are statistically likely according to patterns in their training data. Practically speaking, this approach optimizes for responses that “sound right” rather than specifically for accuracy. For these and other reasons, studies done so far to evaluate the potential of LLMs in for knowledge-critical medical tasks such as clinical decision-support have shown mixed results, with some limitations evident in the models' ability to handle highly specialized decision-making.9, 10, 11, 12, 13

Various strategies have been applied to improve the performance of LLMs for authoritative question answering in knowledge-intensive domains. One widely used approach is called retrieval-augmented generation (RAG). With RAG, LLM responses are “grounded” by dynamically retrieving relevant information from an authoritative external data source, and then providing that information to the LLM at the time of response generation.14 This technique has been successfully applied in healthcare applications, and the approach can mitigate some of the limitations of LLMs discussed above.15, 16, 17, 18 RAG can compensate for outdated training data and can reduce (but probably not eliminate) hallucinations when the LLM is confronted with queries about highly specialized knowledge.19 Critically, unlike LLMs alone, RAG permits tracing and verification of the exact provenance of source information, which can be a requirement in domains like healthcare where explainability and the ability to verify accuracy are critical requirements.20 RAG systems are also more amenable to answering questions over rapidly evolving knowledge sets than are LLMs alone, as it is generally easier, faster, and cheaper to update an external data source than to retrain an LLM.

Laboratory medicine is a knowledge-intensive medical specialty with a vast corpus of evolving technical literature that must be understood, synthesized, and applied. An important subdomain of laboratory management is regulatory compliance. In the United States, the Centers for Medicare and Medicaid Services (CMS) enforce the Clinical Laboratory Improvement Amendments (CLIA), which establish quality standards for all laboratory testing performed for diagnosis, prevention, or treatment of disease in humans.21 These standards are codified in the Code of Federal Regulation (CFR), specifically 42 CFR Part 493.22 Entities such as the College of American Pathologists (CAP), the Commission on Office Laboratory Accreditation (COLA), the Joint Commission (TJC), and the AABB are granted “deemed” status by CMS, and can then certify that laboratories meet CLIA standards. These entities develop inspection processes and publish evolving sets of requirements that laboratories must follow to obtain and maintain accreditation.23 Practically, ensuring regulatory compliance is a major task in any well-run laboratory, requiring regular monitoring of and adaptation to a set of highly complex and evolving standards. Questions often arise about the correct interpretation and implementation of specific regulations, and laboratorians consult regulatory texts, accreditation checklists, and human experts to ensure ongoing compliance. The ongoing need for authoritative, up-to-date, and readily accessible regulatory information raises the question of whether an LLM could relieve some of this cognitive and administrative burden.24

In this article, we describe the architecture and performance of a proof-of-concept RAG application built to answer laboratory regulatory questions using the contents of 42 CFR Part 493 as an authoritative external data source.

Material and methods

Software architecture

Raven is an application designed and implemented by the study investigators that consists of a vector database (a specialized type of database that stores lists of numbers, or vectors, representing stored items, which allows retrieval of items with similar meaning rather than matching exact attributes or keywords) containing the contents of 42 CFR Part 493, orchestration software to perform database queries and to send and receive relevant information to and from an LLM, and a web-based graphical user interface. The application uses a chatbot-like presentation that permits users to ask laboratory regulatory questions, with the LLM providing answers based directly on relevant parts of 42 CFR 493 provided to the model as reference material. Key steps included are listed below.

Source text extraction and segmentation

The entirety of 42 CFR Part 493 was downloaded as a single HTML file from the ecfr.gov website.22 The HTML document was converted into Markdown format using the html2text Python library.25 Regular expressions were used to remove internal links and irrelevant document metadata. Document sections were identified and segmented based on section heading tags using customized regex-based parsing. Each resulting document segment represented one section of the CFR, e.g., 42 CFR § 493.15. The set of all segments was restructured into a comma-separated value document format with two columns: “Source”, containing the document section heading, and “Content”, containing the document section text.

Source document storage

To facilitate retrieval and display of the original 42 CFR Part 493 text to users alongside the LLM-generated responses, the segmented but otherwise unaltered source documents were saved in a SQLite database.26

Vector embedding storage

Each source document segment was further divided into chunks of up to 2000 tokens (about 1500 words of English prose) using the langchain text_splitter library. Nearly all (97.3%) of the source document segments each fit within a single chunk. Each chunk was then embedded using the OpenAI text-embedding-ada-002 model.27,28 The resulting embeddings and source document chunks were stored in a Qdrant vector database collection.29

Once the source and vector databases were populated with the 42 CFR Part 493 content, a langchain-based pipeline was used to orchestrate query processing, database retrieval, and response generation by the LLM.30 The key steps include:

Query embedding and retrieval

Each user query, i.e., laboratory regulatory question, was vector-embedded using the OpenAI text-embedding-ada-002 model. The Qdrant database then used cosine similarity for nearest-neighbor search to identify and retrieve the five source document chunks most relevant to the user query.

LLM response generation

The retrieved source document chunks and the user query were combined into a unified LLM query that was sent to the GPT-4 LLM (snapshot gpt-4-0314) via the OpenAI API, along with a system prompt as follows:

You are an AI assistant for answering questions about laboratory regulatory matters. You are given the following extracted text from a list of regulations and a question. Provide a professional and complete answer. Base your answer solely on the information provided in the prompts. Do not make up answers or provide answers from sources other than the extracted text. Provide a reference for each assertion you make. If you don't know the answer, say ‘I am not sure.’ If the question is not about laboratory policies or regulations, inform them that you are tuned to only answer questions about laboratory regulations.

<Relevant source document chunks>

<User query>

Model temperature was set to 0. The result was an LLM-generated response grounded in the relevant source document content.

User interface

A web-based chatbot-like user interface was developed using Streamlit.31 Alongside the LLM-generated answer to each user query, the original text of the relevant source documents was displayed after retrieval from the SQLite database (Fig. 1).

Fig. 1.

Fig. 1

Raven user interface. A sample question and answer are displayed. Source document segments corresponding to the retrieved chunks of the CFR are shown in the left-hand pane; clicking the down arrow displays the complete text of a segment.

Source code for the application is available at https://github.com/kyouens/raven_public.

Performance testing

To test Raven, the investigators manually generated 103 laboratory regulatory questions (Supplemental Data Table 1). The questions were authored to reflect a range of regulatory inquiries that might realistically be encountered in practice, and the question set was designed to span a wide variety of sections across 42 CFR Part 493. By design, 88 of the questions were items explicitly addressed in the text of 42 CFR Part 493 and 15 questions were items not explicitly addressed in the text of 42 CFR Part 493. Each question was then answered two times, once by Raven, and once manually by a board-certified clinical pathologist (SN) with extensive laboratory medical director experience using the 42 CFR Part 493 website for reference.22 Each of Raven's answers was then graded against the manual answer by the same investigator (SN) using a set of descriptive measures (Table 1) meant to address the accuracy, completeness, and usability of the automated answer, the potential for significant medical or regulatory error, and the efficiency of the automated method.

Table 1.

Descriptive performance measures.

How complete is the Raven-generated response?
  • 1.

    Totally complete

  • 2.

    Mostly complete

  • 3.

    Mostly incomplete

  • 4.

    Totally incomplete

How correct is the Raven-generated response?
  • 1.

    Totally correct

  • 2.

    Mostly correct

  • 3.

    Mostly incorrect

  • 4.

    Totally incorrect

Did Raven include irrelevant information from 42 CFR Part 493 in the answer?
  • 0.

    No

  • 1.

    Yes, a minor amount

  • 2.

    Yes, a major amount

To what extent would Raven's response create the potential for a significant regulatory or medical error?
  • 0.

    No potential

  • 1.

    Minor potential

  • 2.

    Major potential

How much time would you have to spend to edit Raven's response to be suitable for distribution to a laboratorian?
  • 0.

    No time (acceptable as-is)

  • 1.

    Less than 5 min

  • 2.

    5–10 min

  • 3.

    More than 10 min

Data analysis

Descriptive statistics were prepared for the following variables on the 88 questions explicitly addressed by 42 CFR 493: Raven's response completeness and correctness, the amount of irrelevant information Raven generated, the potential for Raven's response to cause a significant regulatory or medical error, the time required to manually edit Raven's response before it would be suitable for distribution to a laboratorian, and the time required for both Raven and the human reviewer to answer each question. Levene's test was used to compare the variability of response time between Raven and the human reviewer. For questions in which Raven's response was not optimal, responses were reviewed, and each error was classified as having most likely occurred due to faulty information retrieval, to misinterpretation of prompt information by the LLM (such as misunderstanding the question or misinterpreting the provided source document chunks), or to hallucination. Cases were classified as faulty information retrieval when the relevant source text was present in the database but was not provided to the LLM in the prompt.

Spearman correlation coefficients were calculated for the following variable pairs: Raven's answer correctness vs. answer completeness; question length vs. Raven's answer completeness and answer correctness; human reviewer response time vs. Raven's answer correctness and completeness; Raven's response time vs. its own answer correctness and completeness; and the potential for a significant regulatory or medical error vs. Raven's answer correctness and completeness.

Raven's performance on questions explicitly addressed by 42 CFR 493 was compared on a subset of metrics to its performance on the 15 questions not addressed by the regulations. Descriptive statistics for answer completeness and correctness were prepared for this subset, and Mann–Whitney U tests were performed to determine whether answer correctness and completeness differed based on whether the question content was explicitly addressed by the 42 CFR 493. An alpha level of 0.05 was used to define statistical significance.

Results

Descriptive statistics

Raven's responses were judged to be totally complete in 92.0% (81 of 88), mostly complete in 3.4% (3 of 88), mostly incomplete in 2.3% (2 of 88), and totally incomplete in 2.3% (2 of 88) of cases (Fig. 2A). Raven's responses were judged to be totally correct in 92.0% (81 of 88), mostly correct in 2.3% (2 of 88), mostly incorrect in 3.4% (3 of 88), and totally incorrect in 2.3% (2 of 88) of cases (Fig. 2B). Raven did not include any irrelevant information in 96.6% (85 of 88) of cases, included a minor amount in 2.3% (2 of 88), and a major amount in 1.1% (1 of 88) of cases (Fig. 2C). Raven's responses posed no potential for medical or regulatory harm in 95.4% (84 of 88) of cases, minor potential in 2.3% (2 of 88), and major potential in 2.3% (2 of 88) of cases (Fig. 2D). Raven's responses were judged acceptable as-is, requiring no editing before use by a laboratorian, in 90.1% (80 of 88) of cases. In 4.5% (4 of 88) of cases, minor edits requiring less than 5 min would be needed and in 4.5% (4 of 88) of cases, moderate edits requiring 5–10 min would be needed (Fig. 2E).

Fig. 2.

Fig. 2

Descriptive statistics.

On average, Raven answered each question in 14.5 s (SD = 9.3). Human response times averaged 134.3 s per question (SD = 200.4). Raven was, on average, 119.8 s faster than the human investigator, and human response times were much more variable (SD = 200.4 vs. 9.3), as confirmed by a significant difference in variances (Levene's test: F = 24.14, p < 0.001).

Correlations

There was a strong positive correlation between Raven's answer correctness and completeness (Spearman's ρ = 0.859, p < 0.001). No significant correlation was observed between question length and either Raven's response completeness (Spearman's ρ = −0.116, p = 0.28) or correctness (Spearman's ρ = −0.165, p = 0.12). There was no statistically significant correlation between human response time and Raven's correctness (Spearman's ρ = 0.054 p = 0.61) or completeness (ρ = 0.132, p = 0.22). Similarly, no significant correlation was observed between Raven's response time and either its correctness (ρ = 0.078, p = 0.47) or completeness (ρ = 0.052, p = 0.63). There was no statistically significant correlation between human response time and Raven's response time (Spearman's ρ = 0.167, p = 0.12). There was a statistically significant correlation between Raven's answer completeness and the potential for a significant regulatory or medical error (Spearman's ρ = 0.57, p < 0.001). Similarly, Raven's answer correctness was also strongly correlated with the potential for significant regulatory or medical error (ρ = 0.74, p < 0.001).

Performance on questions not explicitly addressed by 42 CFR 493

For questions covered by the regulations, 92.0% (81 of 88) responses were rated as totally complete, 3.4% (3 of 88) as mostly complete, 2.3% (2 of 88) as mostly incomplete, and 2.3% (2 of 88) as totally incomplete. For questions not addressed by 42 CFR 493, 46.7% (7 of 15) were rated as totally complete, 13.3% (2 of 15) as mostly complete, 33.3% (5 of 15) as mostly incomplete, and 6.7% (1 of 15) as totally incomplete.

For questions covered by the regulations 92.0% (81 of 88) responses were rated as totally correct, 2.3% (2 of 88) as mostly correct, 3.4% (3 of 88) as mostly incorrect, and 2.3% (2 of 88) as totally incorrect. For questions not addressed by 42 CFR 493, 40.0% (6 of 15) were rated as totally correct, 20.0% (3 of 15) as mostly correct, 20.0% (3 of 15) as mostly incorrect, and 20.0% (3 of 15) as totally incorrect.

Additionally, note that the system prompt directed the model to base its answer solely on information provided in the prompt, and to refuse an answer if the source document text provided in the prompt was insufficient. Raven successfully followed this instruction in 53.3% (8 of 15 cases). In the remaining 46.7% (7 of 15), Raven provided an answer that was irrelevant or only tangentially relevant to the question, but that was appropriately grounded in the content provided in the prompt.

In summary, Raven performed better on questions explicitly addressed by 42 CFR 493 compared to those that were not (Fig. 3). Completeness and correctness were significantly worse for questions not addressed by 42 CFR 493, as assessed by Mann–Whitney U tests (completeness: U = 960.5, p < 0.0001; correctness: U = 1003.5, p < 0.0001).

Fig. 3.

Fig. 3

Fig. 3

Relative performance on questions not addressed by 42 CFR 493.

Analysis of non-optimal answers

In total, 9.1% (8 of 88) responses were deemed suboptimal in one or more categories. For Raven's responses that were not totally complete, 71.4% (5 of 7) were due to faulty information retrieval and 28.6% (2 of 7) to misinterpretation. Likewise, for Raven's responses that were not totally correct, 71.4% (5 of 7) were due to faulty information retrieval and 28.6% (2 of 7) to misinterpretation. For Raven's responses that included irrelevant information, 66.7% (2 of 3) were due to faulty information retrieval and 33.3% (1 of 3) to misinterpretation. For Raven's responses that posed potential for medical or regulatory harm, 50.0% (2 of 4) were primarily due to faulty information retrieval and 50.0% (2 of 4) to misinterpretation. For Raven's responses that would require editing before use by a laboratorian, 62.5% (5 of 8) were due to faulty information retrieval and 37.5% (3 of 8) to misinterpretation. In aggregate, 62.5% (5 of 8) suboptimal responses were due to faulty information retrieval and 37.5% (3 of 8) to misinterpretation. For all five examples of faulty information retrieval, the document chunks required to answer the question were clearly not provided to the LLM in the prompt. For the three examples of misinterpretation, 33.3% (1 of 3) appeared to be caused by the LLM misunderstanding the questioner's intent, and 66.6% (2 of 3) were due to actual misinterpretation of the source document content. None of the suboptimal responses were due to hallucination.

Discussion

Overview of study findings

This study demonstrates that a RAG system can generate accurate, complete, and useful answers to most laboratory regulatory questions more quickly than a human expert. Raven's responses were judged to be totally complete and correct in 92.0% of cases when the content of the question was addressed in the source documents. Performance declined significantly when Raven was asked to answer questions that were not addressed in 42 CFR 493, confirming the system's grounding in the source documents. Importantly, for every response, Raven provided the original text of the source document alongside the LLM-generated answer, which permits users both to understand the provenance of the LLM-generated response and to independently confirm its accuracy.20

Technical challenges

The analysis of suboptimal answers generated by Raven showed that the dominant failure mode was faulty information retrieval rather than reasoning failure or hallucination by the LLM itself. In other words, Raven sometimes failed to retrieve the most relevant source document chunks from the vector database to provide to the LLM. Faulty retrieval is a well-known challenge with RAG systems, and has many potential causes.32, 33, 34

An intuitive strategy to address this challenge is simply to increase the number or size of the source document chunks provided to the LLM to increase the likelihood that the relevant information is included in the prompt. Indeed, some newer LLMs support extremely long context windows of well over 1 million tokens (about 750,000 words, or the length of several full-length novels), raising the question of whether the entire source document corpus could simply be passed to the model in a single prompt. However, there are important tradeoffs. The long context approach can be expensive and slow.35 Importantly, LLMs also tend to perform better when only relevant information is provided, rather than searching for a “needle in a haystack” among a large amount of irrelevant information.36

Therefore, an important problem with RAG systems is how best to optimize source document chunking.37 If source documents are too minutely fragmented, important contextual information can be lost. If source document chunks are too long, the “needle in a haystack” problem discussed above can occur; important information is buried within large, mostly irrelevant chunks of content, making it harder for the LLM to extract what matters. Fortunately for this project, the CFR is a highly structured document divided into a neatly labeled hierarchy of chapters, parts, subparts, and sections. This type of document is highly amenable to RAG because the source document can be readily parsed into short, semantically meaningful segments that readily fit within a single vector-embedded chunk, minimizing loss of context before vector embedding, storage, and retrieval. By contrast, document segmentation and chunking can be much more challenging to optimize when source document structure is not consistent.

To summarize, factors negatively impacting information retrieval such as suboptimal source document chunking or ineffective vector similarity searches will degrade performance, and are probably more important to the overall performance of a RAG system than is the performance of the underlying LLM.32 Raven is a basic (or a so-called “naïve”) RAG application, but various advanced RAG techniques have been developed to address these limitations, such as combining multiple search methods or breaking the retrieval phase into multiple steps and using LLMs to reason over and refine search results before passing the relevant information to the LLM for answer generation.37

A less common failure mode in the present study was the LLM misunderstanding the questioner's intent or misunderstanding relevant source document contents. One practical mitigation would be to improve the clarity of the questions themselves, perhaps using a structured template to guide user input or by refining the system prompt to encourage the model to ask clarifying follow-up questions when needed.38 Another potential solution would be to change the underlying LLM model, either by selecting a more capable model or by fine-tuning a model on a corpus of domain-specific questions and answers, though fine-tuning would dramatically increase development complexity.39

In the present study, we did not identify any instances of hallucination by the LLM. This is most likely a combined result of the clear prompt instructions, e.g., “Do not make up answers or provide answers from sources other than the extracted text”, the source document-grounded nature of the responses, and the low model temperature setting, which minimizes generative variability.

Practical challenges

In a developmental setting, building a working custom RAG application using open-source software libraries and public LLM application programming interfaces (APIs) is relatively straightforward. Implementing such a system in a real-world environment involves several practical challenges.

In the clinical laboratory setting, the stakes for regulatory compliance are high. Faulty interpretation of regulatory requirements can result in citations by inspectors, revoked laboratory accreditation, or even in direct patient harm. A similar point could be made for almost any important knowledge-based task in a clinical setting. It is not safe to assume that RAG applications will perform consistently when the underlying LLMs, vector search techniques, document segmentation and chunking techniques, prompt contents, or other technical parameters are modified, or when the nature of the information retrieval and processing task is changed. Thus, understanding the limitations and performance characteristics of a specific application technology stack on a specific, targeted knowledge-based task is critical before implementation.15

Developers and users should be aware of automation bias, a cognitive bias in which people tend to excessively trust automated systems.40 It is important to understand that even a generally well-performing application such as Raven can be significantly and unpredictably wrong. It can be very difficult, particularly for a user lacking domain expertise, to detect such errors in real time. Thus, for the time being, we believe the best role for such a tool in the clinical setting is as an adjunct to—not a replacement for—the expertise and judgment of a qualified professional. In a high-stakes clinical environment, the output of tools like Raven should be viewed with some degree of skepticism and directly cross-checked against source document contents. To help guard against automation bias, a production system should incorporate appropriate warnings about the tool's limitations in the user interface, and institutions should teach users about the strengths and weaknesses of the tool before implementation.

In essence, tools like Raven are clinical decision-support tools, and real-world implementation should adhere to established principles for good clinical decision-support: delivering the right information, to the right person, in the right format, through the right channel, at the right time.41 It is unclear whether a website with a chatbot-like user interface is a good channel for laboratory professionals; it may be preferable to embed the underlying technology into existing technologies or to integrate with locally ubiquitous communication platforms like Microsoft Teams. Future efforts should include refinement of the application using principles of user-centered design.42

The viability of custom LLM-based applications in healthcare institutions also depends heavily on local information technology policies and support. Most healthcare organizations will not permit the use of public APIs for sensitive data, even if the data do not include protected health information. This means that LLMs, databases, and other components of applications like Raven may need to be locally hosted within the institution's protected internal network or in a secure cloud environment. These and other considerations create barriers to adoption, including the need for infrastructure procurement, data governance policies, internal compliance reviews, and the like.

Finally, whereas 42 CFR 493 provides a regulatory foundation for laboratory quality and compliance, note that laboratory administrators more often use proprietary copyrighted materials like CAP Accreditation Checklists or Joint Commission Accreditation Manuals for real-world guidance. The CFR was selected for this project because it is freely and publicly available, but usage of copyrighted materials may require permission by the rights holders. At the time this study was conducted, it was unclear whether relevant rights holders would allow their property to be adapted for use in a RAG application, even as a proof-of-concept.

Study limitations

One major limitation of the study is that it benchmarked Raven's performance against the judgment and performance of a single human expert, which may have introduced bias in grading the system's performance. To address this limitation, future study should include an external reference standard against which both the human expert and RAG system can be compared. Another limitation of the study is that the regulatory questions used in the evaluation were contrived. Though designed to mimic real-world questions, performance data would be more valuable if performance were measured across a set of actual, real-world laboratory regulatory questions.

Conclusions

This study demonstrates that a RAG system can deliver accurate, complete, and usable answers to domain-specific regulatory questions in laboratory medicine. RAG systems offer a useful combination of flexibility, maintainability, and explainability. With appropriate safeguards, such systems have broad potential as decision-support tools across laboratory medicine and in other healthcare domains.

Declaration of generative AI and AI-assisted technologies in the writing process

During the preparation of this work the authors used ChatGPT in order to check for coherence and overall fluency of the content. After using this tool/service, the authors reviewed and edited the content as needed and take full responsibility for the content of the published article.

Funding

This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.

Declaration of competing interest

The authors have nothing to declare.

Footnotes

Appendix A

Supplementary data to this article can be found online at https://doi.org/10.1016/j.jpi.2025.100520.

Appendix A. Supplementary data

Supplementary material 1

Supplemental table 1. Regulatory question set with Raven's answers.

mmc1.xlsx (46.2KB, xlsx)

References

  • 1.Vaswani A., Shazeer N., Parmar N., et al. Proceedings of the 31st International Conference on Neural Information Processing Systems. NIPS’17. Curran Associates Inc; 2017. Attention is all you need; pp. 6000–6010. [Google Scholar]
  • 2.Janumpally R., Nanua S., Ngo A., Youens K. Generative artificial intelligence in graduate medical education. Front Med. 2025:11. doi: 10.3389/fmed.2024.1525604. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Lee P., Goldberg C., Kohane I. Pearson Education; 2023. The AI Revolution in Medicine: GPT-4 and Beyond. [Google Scholar]
  • 4.Bian N., Han X., Sun L., et al. In: Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024) Calzolari N., Kan M.Y., Hoste V., Lenci A., Sakti S., Xue N., editors. ELRA and ICCL; 2024. ChatGPT is a knowledgeable but inexperienced solver: an investigation of commonsense problem in large language models; pp. 3098–3110.https://aclanthology.org/2024.lrec-main.276/ Accessed May 4, 2025. [Google Scholar]
  • 5.Gilson A., Safranek C.W., Huang T., et al. How does ChatGPT perform on the United States Medical Licensing Examination (USMLE)? The implications of large language models for medical education and knowledge assessment. JMIR Med Educ. 2023;9 doi: 10.2196/45312. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Ke Y.H., Jin L., Elangovan K., et al. Retrieval augmented generation for 10 large language models and its generalizability in assessing medical fitness. NPJ Digit Med. 2025;8:187. doi: 10.1038/s41746-025-01519-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Atil B., Aykent S., Chittams A., et al. April 2, 2025. Non-Determinism of “Deterministic” LLM Settings. Published online. [DOI] [Google Scholar]
  • 8.Ouyang S., Zhang J.M., Harman M., Wang M. An empirical study of the non-determinism of ChatGPT in code generation. ACM Trans Softw Eng Methodol. 2025;34(2):1–28. doi: 10.1145/3697010. [DOI] [Google Scholar]
  • 9.Ahmed W., Saturno M., Rajjoub R., et al. ChatGPT versus NASS clinical guidelines for degenerative spondylolisthesis: a comparative analysis. Eur Spine J Off Publ Eur Spine Soc Eur Spinal Deform Soc Eur Sect Cerv Spine Res Soc. 2024;33(11):4182–4203. doi: 10.1007/s00586-024-08198-6. [DOI] [PubMed] [Google Scholar]
  • 10.Nietsch K.S., Shrestha N., Mazudie Ndjonko L.C., et al. Can large language models (LLMs) predict the appropriate treatment of acute hip fractures in older adults? Comparing appropriate use criteria with recommendations from ChatGPT. J Am Acad Orthop Surg Glob Res Rev. 2024;8(8) doi: 10.5435/JAAOSGlobal-D-24-00206. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Sandmann S., Riepenhausen S., Plagwitz L., Varghese J. Systematic analysis of ChatGPT, Google search and Llama 2 for clinical decision support tasks. Nat Commun. 2024;15(1):2050. doi: 10.1038/s41467-024-46411-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Kao H.J., Chien T.W., Wang W.C., Chou W., Chow J.C. Assessing ChatGPT’s capacity for clinical decision support in pediatrics: a comparative study with pediatricians using KIDMAP of Rasch analysis. Medicine (Baltimore) 2023;102(25) doi: 10.1097/MD.0000000000034068. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Lahat A., Sharif K., Zoabi N., et al. Assessing generative pretrained transformers (GPT) in clinical decision-making: comparative analysis of GPT-3.5 and GPT-4. J Med Internet Res. 2024;26 doi: 10.2196/54571. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Lewis P., Perez E., Piktus A., et al. Proceedings of the 34th International Conference on Neural Information Processing Systems. NIPS ‘20. Curran Associates Inc; 2020. Retrieval-augmented generation for knowledge-intensive NLP tasks; pp. 9459–9474. [Google Scholar]
  • 15.Liu S., McCoy A.B., Wright A. Improving large language model applications in biomedicine with retrieval-augmented generation: a systematic review, meta-analysis, and clinical development guidelines. J Am Med Inform Assoc JAMIA. 2025;32(4):605–615. doi: 10.1093/jamia/ocaf008. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Gargari O.K., Habibi G. Enhancing medical AI with retrieval-augmented generation: a mini narrative review. Digit Health. 2025;11 doi: 10.1177/20552076251337177. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Perkins G., Anderson N.W., Spies N.C. Retrieval-augmented generation salvages poor performance from large language models in answering microbiology-specific multiple-choice questions. McAdam AJ, ed. J Clin Microbiol. 2025;63(3) doi: 10.1128/jcm.01624-24. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Hewitt K.J., Wiest I.C., Carrero Z.I., et al. Large language models as a diagnostic support tool in neuropathology. J Pathol Clin Res. 2024;10(6) doi: 10.1002/2056-4538.70009. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Magesh V., Surani F., Dahl M., Suzgun M., Manning C., Ho D. Hallucination-free? Assessing the reliability of leading AI legal research tools. J Empir Leg Stud. April 23, 2025 doi: 10.1111/jels.12413. Published online. [DOI] [Google Scholar]
  • 20.Rajpurkar P., Chen E., Banerjee O., Topol E.J. AI in health and medicine. Nat Med. 2022;28(1):31–38. doi: 10.1038/s41591-021-01614-0. [DOI] [PubMed] [Google Scholar]
  • 21.Graden K.C., Bennett S.A., Delaney S.R., Gill H.E., Willrich M.A.V. A high-level overview of the regulations surrounding a clinical laboratory and upcoming regulatory challenges for laboratory developed tests. Lab Med. 2021;52(4):315–328. doi: 10.1093/labmed/lmaa086. [DOI] [PubMed] [Google Scholar]
  • 22.eCFR:: 42 CFR Part 493 -- Laboratory Requirements. March 1, 2025. https://www.ecfr.gov/current/title-42/chapter-IV/subchapter-G/part-493 Accessed.
  • 23.Accreditation Checklists. College of American Pathologists; April 19, 2025. https://www.cap.org/laboratory-improvement/accreditation/accreditation-checklists Accessed. [Google Scholar]
  • 24.Shean R.C., Genzen J.R., Spies N.C. Large language models lack sufficient performance to provide definitive regulatory guidance. J Appl Lab Med. April 18, 2025 doi: 10.1093/jalm/jfaf047. Published online. jfaf047. [DOI] [PubMed] [Google Scholar]
  • 25.Alir3z4/html2text: Convert HTML to Markdown-formatted text. March 1, 2025. https://github.com/Alir3z4/html2text Accessed.
  • 26.SQLite Home Page. March 1, 2025. https://www.sqlite.org/ Accessed.
  • 27.langchain-text-splitters: 0.3.6 —Image 1 LangChain documentation. March 1, 2025. https://python.langchain.com/api_reference/text_splitters/index.html Accessed.
  • 28.OpenAI Platform. March 1, 2025. https://platform.openai.com Accessed.
  • 29.Qdrant - Vector Database. May 2, 2025. https://qdrant.tech/ Accessed.
  • 30.Introduction | Image 2 LangChain. March 1, 2025. https://python.langchain.com/docs/introduction/ Accessed.
  • 31.Streamlit • A Faster Way to Build and Share Data Apps. March 1, 2025. https://streamlit.io/ Accessed.
  • 32.Lewis P., Perez E., Piktus A., et al. April 12, 2021. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. Published online. [DOI] [Google Scholar]
  • 33.Salemi A., Zamani H. Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM; 2024. Evaluating retrieval quality in retrieval-augmented generation; pp. 2395–2400. [DOI] [Google Scholar]
  • 34.Barnett S., Kurniawan S., Thudumu S., Brannelly Z., Abdelrazek M. Proceedings of the IEEE/ACM 3rd International Conference on AI Engineering - Software Engineering for AI. CAIN ‘24. Association for Computing Machinery; 2024. Seven failure points when engineering a retrieval augmented generation system; pp. 194–199. [DOI] [Google Scholar]
  • 35.Long context | Gemini API. Google AI for Developers. Accessed May 2, 2025. https://ai.google.dev/gemini-api/docs/long-context
  • 36.Liu N.F., Lin K., Hewitt J., et al. Lost in the middle: how language models use long contexts. Trans Assoc Comput Linguist. 2024;12:157–173. doi: 10.1162/tacl_a_00638. [DOI] [Google Scholar]
  • 37.Gao Y., Xiong Y., Gao X., et al. March 27, 2024. Retrieval-Augmented Generation for Large Language Models: A Survey. Published online. [DOI] [Google Scholar]
  • 38.Xiong G., Jin Q., Wang X., Zhang M., Lu Z., Zhang A. Improving retrieval-augmented generation in medicine with iterative follow-up questions. Pac Symp Biocomput Pac Symp Biocomput. 2025;30:199–214. doi: 10.1142/9789819807024_0015. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.Anisuzzaman D.M., Malins J.G., Friedman P.A., Attia Z.I. Fine-tuning large language models for specialized use cases. Mayo Clin Proc Digit Health. 2025;3(1) doi: 10.1016/j.mcpdig.2024.11.005. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40.Goddard K., Roudsari A., Wyatt J.C. Automation bias: a systematic review of frequency, effect mediators, and mitigators. J Am Med Inform Assoc JAMIA. 2011;19(1):121. doi: 10.1136/amiajnl-2011-000089. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.Sirajuddin A.M., Osheroff J.A., Sittig D.F., Chuo J., Velasco F., Collins D.A. Implementation pearls from a new guidebook on improving medication use and outcomes with clinical decision support: effective CDS is essential for addressing healthcare performance improvement imperatives. J Healthc Inf Manag. 2009;23(4):38. [PMC free article] [PubMed] [Google Scholar]
  • 42.Jung I.C., Schuler K., Zerlik M., Grummt S., Sedlmayr M., Sedlmayr B. Overview of basic design recommendations for user-centered explanation interfaces for AI-based clinical decision support systems: a scoping review. Digit Health. 2025;11 doi: 10.1177/20552076241308298. 20552076241308298. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary material 1

Supplemental table 1. Regulatory question set with Raven's answers.

mmc1.xlsx (46.2KB, xlsx)

Articles from Journal of Pathology Informatics are provided here courtesy of Elsevier

RESOURCES