Skip to main content
AMIA Annual Symposium Proceedings logoLink to AMIA Annual Symposium Proceedings
. 2026 Feb 14;2025:443–450.

Right Patient, Right Specialist, Right Time: Retrieval Augmented Generation for Specialty Referral Routing

Fateme Nateghi Haredasht 1,*, Ethan Goh 1,2,*, Vishnu Ravi 3,*, Pooya Ashtari 4, Yixing Jiang 5, Nodir Yuldashev 1, François Grolleau 1, Robert J Gallo 6,7, Aaryan Shah 5, Evelyn Hur 8, Kanav Chopra 9, Olivia Jee 10, Julie J Lee 10, Leah Rosengaus 11, Lena Giang 11, Kevin Schulman 2,12,13, Jason Hom 7, Arnold Milstein 2, Andrew Y Ng 8, Jonathan H Chen 1,2,13,14
PMCID: PMC12919621  PMID: 41726438

Abstract

We present an embedding-based retrieval system that automatically directs physician clinical questions to the most relevant specialist-curated question template, which is necessary for the specialist to provide a clinically relevant response. The system utilizes MPNet, a transformer-based model, to generate dense vector representations of both clinical queries and 24 predefined clinical templates. Given a clinical question, the system computes cosine similarity between the query and template embeddings to retrieve the most relevant matches. When validated against real-world, retrospective eConsults across five specialties, the system accurately identified the most relevant template in 87% of cases (success@1) and included it in the top three results 99% of the time (success@3). Automating specialty selection and clinical question referrals reduces the administrative burden on physicians, minimizes care delivery delays, and improves specialist responses by providing proper context.

Introduction

Specialty referral management remains a persistent challenge in healthcare systems worldwide. Studies indicate that approximately 7.8% to 22 % of referrals are “clinically inappropriate,” equating to an average of 42 mismatched patients per specialist per year and totaling 19.7 million inappropriate referrals annually in the United States15. These misdirected referrals can lead to significant treatment delays, adversely affecting patient outcomes6. Such delays can also exacerbate medical conditions, leading to more severe health issues and increased healthcare costs.

Several studies have addressed this issue and proposed solutions to improve the referral process. For instance, Ramelson et al. demonstrated that implementing an enhanced referral management system within an electronic health record (EHR) can significantly improve the efficiency and completion rates of referrals, particularly for in-network referrals7. In pediatric care, O’Dwyer et al. examined the impact of eConsult technology on specialty referral efficiency. Their findings indicated that refining referral management systems and integrating eConsult solutions into existing clinical workflows could significantly improve the timeliness and quality of specialty care, particularly in under-resourced settings8.Current referral management relies heavily on manually curated pathways and guidelines that quickly become outdated as clinical evidence and institutional practices evolve. A survival analysis of clinical guidelines in the Spanish National Health System revealed that approximately 22.1% of recommendations became outdated within four years, with 92% remaining valid after one year, decreasing to 77.8% after four years9.

Early attempts to automate specialty selection relied on regular expressions (regex) to identify key terms in referral text10,11. While regex is effective for recognizing predefined patterns—such as structured medical codes or specific trigger phrases—it lacks the flexibility to understand clinical context, leading to high false-positive rates. Villena et al. evaluated a regex-based system for specialty triage in the Chilean public healthcare system and reported a Mean Average Precision (MAP) score of 0.63 at the subcategory level and 0.83 at the category level10. Natural Language Processing (NLP) models improve upon regular expressions by recognizing entities based on language patterns and context, making them more versatile in identifying data types12,13. Unlike regex-based approaches, which rely on rigid pattern-matching rules, NLP models leverage machine learning techniques to extract meaningful relationships from unstructured text. However, automating specialty selection using NLP presents significant challenges due to the inherent complexity and variability of clinical language, which makes it difficult for NLP models to accurately interpret and categorize information14,15. Traditional classification methods often fail to accurately connect clinical questions with the right medical specialties, leading to misclassifications that delay care and waste resources. Moreover, variations in documentation practices across healthcare institutions hinder the development of standardized NLP solutions for specialty selection, resulting in the limited effectiveness of these approaches in different clinical settings16.

To overcome the limitations of rule-based and traditional NLP approaches, Large Language Models (LLMs) such as GPT-based models are capable of deep semantic understanding and can generate context-aware predictions based on large-scale medical corpora17. One promising approach to further improve LLM-driven specialty selection is Retrieval-Augmented Generation (RAG)18,19. RAG combines the language understanding capabilities of LLMs with real-time retrieval of relevant institutional knowledge, allowing models to ground their decisions in up-to-date referral guidelines, specialty descriptions, and clinical best practices.

We hypothesized that an embedding-based retrieval system could more effectively encode subtle connections between medical concepts that determine specialty routing. This work was developed within the SAGE (Specialist AI for Guiding Experts) product at Stanford, which aims to provide specialist-level AI reasoning and actionable recommendations to providers in real-time. This study specifically addresses the foundational challenge of routing clinical questions to the appropriate specialty domain—a critical first step in providing accurate specialist responses to clinical questions.

Methods

Embedding Model

We employed MPNet (Masked and Permuted Pre-training for Language Understanding)20, specifically the all-mpnet-base-v2 variant from the Sentence-Transformers library to generate dense vector representations of clinical queries and template documents. MPNet is a transformer-based model that integrates the strengths of BERT (Bidirectional Encoder Representations from Transformers)21 and XLNet (Generalized Autoregressive Pretraining for Language Understanding)22 by employing a permutation-based training objective, where tokens are predicted in an arbitrary order while maintaining a natural left-to-right dependency. This allows the model to capture both the immediate (local) and overall (global) contextual relationships in text. The all-mpnet-base-v2 model was further fine-tuned using a contrastive learning approach on over 1 billion sentence pairs to improve sentence-level similarity estimation. The model operates at a dimensionality of 768 and generates embeddings that effectively capture semantic similarities between clinical queries and eConsult templates. Being trained on a diverse corpus including general domain and biomedical texts, the model further enables robust representation learning for clinical language. For embedding generation, we used the HuggingFace Sentence-Transformers implementation, which facilitates efficient vectorization of both clinical questions and predefined eConsult templates. The embedding-based retrieval pipeline is illustrated in Figure 1.

Figure 1.

Figure 1.

Overview of the embedding-based retrieval pipeline. Clinical templates are preprocessed by splitting them into chunks before being embedded using the MPNet model and stored in a vector database. Similarly, clinical questions are transformed into vector representations. Cosine similarity is computed between the query embedding and all template embeddings, ranking templates by relevance and retrieving the top matches.

Text Processing and Embedding Generation

Each clinical template was preprocessed by splitting its content into overlapping text chunks, each containing 1,000 tokens with a 50-token overlap. This segmentation preserved contextual continuity across chunks, ensuring that no critical information was lost during the embedding process.

Each text chunk was then transformed into a 768-dimensional dense vector using MPNet (all-mpnet-base-v2). The MPNet model effectively captured both local and global contextual relationships within the text, enabling robust semantic representation. The resulting embeddings for all template chunks were stored in a vector database for efficient similarity search and retrieval.

Similarly, each clinical question was processed as a single input and embedded into a 768-dimensional vector using the same MPNet model. This ensured that both clinical queries and template embeddings were represented in the same latent space, facilitating accurate similarity comparisons.

Similarity Computation and Retrieval

To determine the most relevant eConsult templates for a given clinical question, we computed the cosine similarity between the query embedding and all template embeddings23,24. Cosine similarity measures the angle between two vectors in a high-dimensional space, where a score closer to 1 indicates greater similarity. Since each template was split into multiple chunks, the highest similarity score across all chunks of a given template was used to rank the templates. We selected the top three templates that were most similar to the clinical query. This way, the system focused on templates that are most relevant to the specific question asked.

Knowledge Base Development

The knowledge base comprised 24 specialist-curated templates spanning different specialties like infectious disease, hematology, endocrinology, cardiology, gastroenterology, and neurology, containing 1,240 clinical decision points. Each template was constructed through a structured Delphi process with a group of specialists at Stanford Medicine. Templates covered both common presentations (e.g., “elevated TSH management”) and specialty-specific scenarios (e.g., “neutropenia evaluation in oncology patients”).

These templates represented the standard referral guidance documents used within Stanford’s eConsult program. Each template was converted to plain text and embedded using the same sentence-transformer model, creating a searchable embedding database.

Technical Implementation

The embedding pipeline and retrieval system were implemented in Python, utilizing the sentence-transformers library for embedding generation, scikit-learn for cosine similarity computations, and LangChain for efficient vector storage and retrieval. All evaluation code, including data handling, embedding generation, and performance computation, was run locally and reproducibly using open-source libraries such as NumPy, pandas, and FAISS for efficient vector operations.

Validation Dataset

For validation, we utilized all available historical eConsult records from Stanford Health Care’s clinical data warehouse that met our inclusion criteria, yielding 434 queries across five specialties (cardiology, urology, gynecology, pulmonology, and nephrology). However, during evaluation, the system could choose from all 24 available eConsult specialty templates (e.g., endocrinology, hematology, rheumatology, psychiatry, gastroenterology, etc.). All validation data derived from eConsults was completed between July 2023 and April 2024. This dataset represents the real-world specialty routing of eConsult cases at our institution, providing an authentic test environment for the embedding-based routing system. All data were de-identified following the Safe Harbor method following National Institute of Standards and Technology (NIST) guidelines. Additionally, clinical text was anonymized using the TiDE algorithm to ensure compliance with privacy regulations25. Table 1 presents the distribution of medical specialties in the dataset, showing the number of cases for each specialty.

Table 1.

Distribution of clinical questions across five specialties in the validation dataset.

graphic file with name AMIASYMPROC-2025-9210-t1.jpg

Outcome Measures and Analysis

The primary outcome was concordance between the system’s specialty prediction and the actual specialty that answered the clinical question in the retrospective eConsult data set (‘ground truth specialty’). This ground truth reflects real-world routing decisions made by clinicians and specialists within Stanford’s eConsult system. Since related templates like “Pulmonology” and “Interventional Pulmonology” exist, we grouped these together, counting both as correct matches to the same specialty category.

We evaluated system performance using two ranking-based metrics: success@1 and success@3. A prediction was considered a success@1 if the top-ranked template matched the ground-truth specialty assigned to the clinical question. Similarly, success@3 was defined as a correct match appearing within the top three retrieved templates.

Results

Overall Performance

As shown in Table 2, the overall success@1 accuracy was 0.87, indicating that in 87% of cases, the highest-ranked template correctly matched the ground-truth specialty. The success@3 accuracy reached 0.99, meaning that in 99% of cases, the correct specialty appeared within the top three ranked templates.

Table 2.

System performance by specialty.

graphic file with name AMIASYMPROC-2025-9210-t2.jpg

Performance varied across specialties. Cardiology achieved the highest success@1 accuracy at 0.96, while nephrology had the lowest at 0.73. Pulmonology, gynecology, and urology exhibited intermediate accuracy levels, ranging from 0.79 to 0.86. Even when the system’s top guess was wrong, the correct specialty was still included among its top three suggestions in 99% of cases for all specialties.

For success@1 (87%), the 95% CI was [83.8%, 90.2%], and for success@3 (99%), the CI was [97.6%, 99.7%]. These intervals were computed using bootstrap resampling (N=1000). These results demonstrate that the embedding-based retrieval approach is highly effective at routing clinical questions to the correct specialty and eConsult templates.

Error Analysis

To better understand the system’s limitations, we conducted an error analysis on cases where the correct template did not appear at rank one (i.e., failed success@1). Out of 434 total cases, 57 (13.1%) were misclassified (Figure 2). The most common specific error patterns were nephrology being confused with endocrinology (10 cases, 2.3%) and pulmonology being misclassified as oncology (6 cases, 1.4%). When analyzing overall specialty-level performance, urology had the highest error rate (26.9% of all urology cases), followed by nephrology (21.3%) and gynecology (19.0%).

Figure 2.

Figure 2.

Misclassification rates per specialty in success@1 predictions. Urology exhibited the highest misclassification rate (26.9%), followed by nephrology (21.3%) and gynecology (19.0%). The lower misclassification rate for cardiology (4.1%) suggests the model’s robustness in correctly identifying cardiovascular cases.

Discussion and Conclusions

This work demonstrates the feasibility and effectiveness of using semantic similarity methods for automating the routing of eConsults to the most relevant specialty templates. By leveraging dense text embeddings from MPNet, our system was able to understand and match the nuanced content of free-text clinical questions to structured specialty templates without relying on rigid keyword-based rules or manual triage. Using real-world, retrospective specialty referral data from Stanford Health Care, the model achieved a success@1 of 87% and success@3 of 99% across five specialties, indicating strong performance even when selecting from 24 total templates. Inappropriate referrals result in 19.7 million misdirected referrals annually in the United States. By correctly routing clinical questions to appropriate specialties, our system is able to reduce care delays that directly impact patient outcomes, while simultaneously decreasing the administrative burden on healthcare professionals. Looking forward, the near-perfect success@3 accuracy offers an immediate implementation pathway where the system could present highly relevant options while preserving physician autonomy in final specialty selection. This work contributes to the broader goal of optimizing electronic consultation workflows through AI-driven workflow assistance and represents a key component of the SAGE (Specialist AI for Guiding Experts) product at Stanford.

Further analysis of misclassification revealed distinct error patterns, with nephrology frequently confused with endocrinology (10 cases, 2.3%) and pulmonology misclassified as oncology (6 cases, 1.4%). Specialty-specific error rates were highest in urology (26.9%), nephrology (21.3%), and gynecology (19.0%). These misclassifications likely stem from semantic similarities in clinical language—for example, pulmonology and oncology share common diagnostic features (e.g., lung nodules and respiratory symptoms). Such errors reflect genuine clinical specialty overlap rather than model deficiencies and highlight opportunities to enhance accuracy by incorporating referring clinicians’ notes or developing specialty-specific embedding approaches. To mitigate misclassifications caused by specialty overlap, future iterations of the model could incorporate structured patient metadata (e.g., diagnosis codes, medications) or the full clinical note context. Additionally, using multi-hop retrieval—first identifying the specialty and then matching sub-templates—could help disambiguate cases with semantic proximity. Fine-tuning embedding models on clinical question-answer pairs may also enhance specificity.

While our system uses a standard dense retrieval architecture, its strength lies in its application to a real-world clinical workflow with carefully curated specialty templates. Future work will explore model fine-tuning on institutional clinical Q&A data and the integration of learning-to-rank objectives to further enhance relevance scoring.

Several limitations warrant consideration. First, the validation was conducted at a single academic medical center with specific specialty practice patterns that may not generalize to other settings. Second, the retrospective nature of the evaluation means that historical routing decisions were used as ground truth, but these decisions may themselves contain errors or reflect institutional idiosyncrasies rather than optimal clinical pathways. Additionally, the validation focused exclusively on electronic consultations, which may represent a subset of referral cases with characteristics different from general outpatient referrals or emergency referrals.

Figures & Tables

References

  • 1.New Report Reveals 19.7 Million Misdirected Physician Referrals in the U.S. Each Year. 2014. https://www.businesswire.com/news/home/20141110005119/en/New-Report-Reveals-19.7-Million-Misdirected-Physician-Referrals-in-the-U.S.-Each-Year .
  • 2.19.7M “Clinically inappropriate” Physician Referrals Occur Each Year. https://hitconsultant.net/2014/11/10/19-7m-clinically-inappropriate-physician-referrals-occur-each-year/
  • 3.Mehrotra A., Forrest C. B., Lin C. Y. Dropping the baton: specialty referrals in the United States. Milbank Q. 2011;89:39–68. doi: 10.1111/j.1468-0009.2011.00619.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Greenwood-Lee J., Jewett L., Woodhouse L., Marshall D. A. A categorisation of problems and solutions to improve patient referrals from primary to specialty care. BMC Health Serv Res. 2018;18:986. doi: 10.1186/s12913-018-3745-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Mariotti G., Meggio A., de Pretis G., Gentilini M. Improving the appropriateness of referrals and waiting times for endoscopic procedures. J Health Serv Res Policy. 2008;13:146–151. doi: 10.1258/jhsrp.2008.007170. [DOI] [PubMed] [Google Scholar]
  • 6.Puttick H. Revealed: Thousands dying on NHS waiting lists. 2025. https://www.thetimes.com/uk/scotland/article/patients-dying-while-on-waiting-lists-up-200-per-cent-since-2014-n7jgq5tmw .
  • 7.Ramelson H., et al. Closing the loop with an enhanced referral management system. J Am Med Inform Assoc. 2018;25:715–721. doi: 10.1093/jamia/ocy004. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.O’Dwyer B., Macaulay K., Murray J., Jaana M. Improving Access to Specialty Pediatric Care: Innovative Referral and eConsult Technology in a Specialized Acute Care Hospital. Telemed J E Health. 2024;30:1306–1316. doi: 10.1089/tmj.2023.0444. [DOI] [PubMed] [Google Scholar]
  • 9.García L. M., et al. The validity of recommendations from clinical guidelines: a survival analysis. CMAJ. 2014;186:1211–1219. doi: 10.1503/cmaj.140547. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Villena F., et al. Automatic Coding at Scale: Design and Deployment of a Nationwide System for Normalizing Referrals in the Chilean Public Healthcare System. Proceedings of the 5th Clinical Natural Language Processing Workshop. 2023:335–343. doi:10.18653/v1/2023.clinicalnlp-1.37. [Google Scholar]
  • 11.Dankovchik J., et al. Identification of Social Risk-Related Referrals in Discrete Primary Care Electronic Health Record Data: Lessons Learned From a Novel Methodology. Health Services Research. :e14443. [Google Scholar]
  • 12.Villena F., Bravo-Marquez F., Dunstan J. NLP modeling recommendations for restricted data availability in clinical settings. BMC Medical Informatics and Decision Making. 2025;25:116. doi: 10.1186/s12911-025-02948-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Singh A. K. B., Guntu M., Bhimireddy A. R., Gichoya J. W., Purkayastha S. Multi-label natural language processing to identify diagnosis and procedure codes from MIMIC-III inpatient notes. 2020. Preprint at . [DOI]
  • 14.Shelf. Challenges and Considerations in Natural Language Processing. Shelf. 2024. https://shelf.io/blog/challenges-and-considerations-in-nlp/
  • 15.Leaman R., Khare R., Lu Z. Challenges in Clinical Natural Language Processing for Automated Disorder Normalization. J Biomed Inform. 2015;57:28–37. doi: 10.1016/j.jbi.2015.07.010. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Carrell D. S., et al. Challenges in adapting existing clinical natural language processing systems to multiple, diverse health care settings. J Am Med Inform Assoc. 2017;24:986–991. doi: 10.1093/jamia/ocx039. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Singhal K., et al. Large language models encode clinical knowledge. Nature. 2023;620:172–180. doi: 10.1038/s41586-023-06291-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Yang R., et al. Retrieval-augmented generation for generative artificial intelligence in health care. npj Health Syst. 2025;2:1–5. [Google Scholar]
  • 19.Lopez I., et al. Clinical entity augmented retrieval for clinical information extraction. npj Digit. Med. 2025;8:1–11. doi: 10.1038/s41746-024-01410-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Song K., Tan X., Qin T., Lu J., Liu T.-Y. MPNet: Masked and Permuted Pre-training for Language Understanding. 2020. Preprint at . [DOI]
  • 21.Devlin J., Chang M.-W., Lee K., Toutanova K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. 2019. Preprint at . [DOI]
  • 22.Yang Z., et al. XLNet: Generalized Autoregressive Pretraining for Language Understanding. 2020. Preprint at . [DOI]
  • 23.Manning C. D., Raghavan P., Schutze H., editors. Introduction to Information Retrieval. Cambridge University Press; 2008. Evaluation in Information Retrieval. [Google Scholar]
  • 24.Huang P.-S., et al. Proceedings of the 22nd ACM international conference on Information & Knowledge Management. New York, NY, USA: Association for Computing Machinery; 2013. Learning deep structured semantic models for web search using clickthrough data; pp. 2333–2338. doi:10.1145/2505515.2505665. [Google Scholar]
  • 25.Datta S., et al. A new paradigm for accelerating clinical data science at Stanford Medicine. 2020. Preprint at . [DOI]

Articles from AMIA Annual Symposium Proceedings are provided here courtesy of American Medical Informatics Association

RESOURCES