Graphical Abstract
Keywords: Digital twins, Large language models, Artificial intelligence, Provider-patient communication, Conversational agents
Abstract
Digital twins have emerged as a paradigm in precision and personalized medicine, enabling data-driven modeling of individuals to support tailored interventions. While most existing work focuses on patient-oriented twins, little attention has been given to modeling the provider’s role, particularly in clinical communication. In this study, we present GRACE (Generalized RAG-Enhanced Conversation Framework), a framework for constructing a provider digital twin (ProDT) that emulates key aspects of clinicians’ communicative and cognitive behavior. GRACE integrates three modules: a physician-informed dialog script generation and optimization module for provider-patterned conversation, a Retrieval-Augmented Generation (RAG) pipeline for factual grounding and timely knowledge updating, and an LLM-based conversational interface that enables interactive, context-aware exchanges. Using HPV vaccination counseling as a representative use case, GRACE was evaluated with HealthBench and a structured user study involving clinician feedback. The results demonstrate its feasibility, trustworthiness, and adaptability for proactive provider–patient communication, marking a conceptual step toward safe, scalable, and cognitively grounded digital twins in healthcare.
Highlights
-
•
We model ProDT communication as dual mirroring: aligning what it says with clinical guidance and how it speaks with clinician-informed behavior.
-
•
We propose GRACE, a model-agnostic RAG framework combining clinical knowledge grounding with physician-informed dialog for auditable care.
-
•
An HPV vaccination counseling case study shows extensibility, with HealthBench and user feedback confirming improved accuracy and trust.
1. Introduction
Digital twins (DTs) [1], [2] are virtual representations of physical entities, processes, or systems that mirror their real-world counterparts through data integration, simulation, and machine learning. Originally adopted in engineering and manufacturing [3], [4], DTs have since expanded into healthcare [5], [6], smart cities [7], [8], and agriculture [9], [10]. In healthcare, DTs hold substantial promise: by continuously integrating patient-specific data, from genomic and physiological measurements to lifestyle and environmental factors, they can support highly personalized interventions [11], [12], [13].
Most healthcare DT implementations to date have centered on patients, leveraging IoT devices to collect health data [14], [15], [16], while comparatively little attention has been paid to provider digital twins (ProDTs). In contrast to patient DTs, a ProDT is grounded in the clinician’s professional competencies, domain knowledge, skills, and communicative behavior, and must act on behalf of the human clinician rather than replay static scripts. Although “provider” can encompass both human and organizational entities [11], this work focuses on the human clinician whose responsibilities span a broad spectrum, from communication and preventive counseling to diagnosis, treatment planning, and procedures. Realizing such a ProDT requires a digital embodiment that can maintain a robust coupling to its human counterpart and operate safely at scale.
Provider–patient communication [17], [18], [19], [20] is both central to clinical care and well-suited to a DT formulation. DTs typically comprise three components [11], [21], [22]: (i) a physical entity (the clinician), (ii) its digital counterpart, and (iii) a bidirectional linkage that synchronizes the two. For ProDTs, achieving this linkage in communication entails dual mirroring: a knowledge mirror that ensures what the ProDT says is aligned with up-to-date clinical guidance, and a behavior mirror that ensures how it communicates is aligned with clinician-informed conversational structure and tone. The ProDT must also adapt dialog patterns to heterogeneous patient needs without sacrificing safety, fidelity, or auditability. Motivated by these considerations, this paper focuses on the communication layer of a provider DT.
Recent advances in artificial intelligence, particularly large language models (LLMs) [23], [24], create new opportunities in this domain [25], [26]. While chat interfaces are now the predominant mode of human–LLM interaction [27], [28], [29], [30], a chatbot alone does not constitute a ProDT. Two coupled obstacles hinder the bidirectional linkage between the digital and physical provider: (i) knowledge currency, because base models have static training cutoffs and therefore require mechanisms to remain aligned with current guidance; and (ii) provider-patterned communication, because generic LLM dialogues are not explicitly constrained to clinician-informed structure or tone.
To address the first obstacle, knowledge currency, we adopt Retrieval-Augmented Generation (RAG) [31], [32], [33]. Rather than relying solely on static parameters, RAG retrieves relevant documents from an external, refreshable knowledge base and conditions generation on these sources. This enables temporal fidelity as recommendations evolve and provides verifiable provenance for audit and clinician oversight, reducing unsupported statements in domains where high-quality data are scarcer than general web corpora [34], [35], [36]. Several frameworks expose RAG workflows, including RAGFlow [37], Haystack [38], and LangChain [39]. However, their interactions are largely user-driven and assume that users can anticipate and articulate their information needs, an assumption often violated in healthcare, where uncertainty, low health literacy, or complex risk–benefit trade-offs are common.
To address the second obstacle, provider-patterned communication, we introduce a physician-informed dialog script generation and optimization module that encodes clinician-style conversational structure and adapts to user responses, supporting mixed-initiative and dynamic exchanges. Accurate information is necessary but insufficient in scenarios such as HPV vaccination counseling; natural, supportive, and structured interaction is also required.
Bringing these elements together, we propose GRACE (Generalized RAG-Enhanced Conversation Framework), a model-agnostic framework for communication-oriented ProDTs. GRACE comprises three components: (i) a physician-informed medical dialog script generation and optimization module that captures provider-style conversation flows, (ii) an interactive conversation interface for provider-patient communication, and (iii) a RAG pipeline that constructs and queries a domain-specific knowledge base to ensure factual grounding and continuous knowledge updating in synchrony with the provider’s evolving expertise. Together, these components allow the ProDT to mirror the clinician’s evolving expertise while delivering provider-like communication at scale.
To demonstrate GRACE, we apply it to HPV vaccination counseling as a representative provider–patient communication task. The system is evaluated using OpenAI’s HealthBench [40], indicating more accurate, trustworthy, and contextually informative responses relative to baseline LLM outputs. We also conduct a structured user feedback study with 21 participants to assess usability, clarity, and perceived trustworthiness; the analyses suggest consistently high satisfaction across user groups. In addition, illustrative RAG-enhanced examples show GRACE’s capacity to incorporate up-to-date medical knowledge and align responses with current clinical guidance. Although we focus on HPV counseling, the framework is applicable to other communication-centric tasks (e.g., preventive counseling and chronic disease follow-up), highlighting GRACE as a practical design for ProDTs that deliver reliable, proactive, and temporally grounded communication in healthcare settings.
2. Background and related work
2.1. Large language model (LLM)
LLMs have emerged as the cornerstone of modern natural language processing, enabling systems to generate fluent, coherent, and context-aware text across a wide range of tasks. These models are primarily built upon the Transformer architecture [41], which introduced the self-attention mechanism as a core innovation. The Transformer allows the model to weigh and aggregate information from all positions in the input sequence simultaneously, enabling efficient long-range dependency modeling and parallel computation. The self-attention mechanism computes contextual embeddings by assigning dynamic weights to each token relative to others, capturing semantic relevance and syntactic structure. Through this mechanism, LLMs are able to understand and generate natural language responses.
Despite their impressive generative capabilities, LLMs suffer from a critical limitation: they lack an inherent mechanism to verify the factual accuracy of their outputs. These models rely solely on statistical patterns learned from large-scale training data, and when encountering underrepresented or ambiguous topics, they may generate fabricated or inaccurate content, which is a phenomenon commonly referred to as hallucination. In high-stakes domains such as healthcare, such hallucinations can result in misleading or even harmful information, such as incorrect treatment suggestions. This limitation underscores the risk of relying exclusively on an LLM’s internal parameters for knowledge retrieval. To address this challenge, our proposed GRACE framework incorporates a RAG pipeline that grounds LLM outputs in authoritative external knowledge sources.
2.2. Digital twins (DTs)
The concept of DTs originated in the engineering and manufacturing domains, where virtual models are dynamically linked to their physical counterparts to enable real-time monitoring and simulation [1], [2]. In recent years, this paradigm has been extended to the biomedical field, giving rise to the notion of Human Digital Twins (HDTs) [15], [16]. Most existing HDT implementations are patient-centric for personalized medicine [14], [15], [16]. These systems rely on real-time physiological data to simulate and predict patient conditions, providing actionable insights for healthcare decision-making.
Despite the growing interest in HDTs, current efforts predominantly neglect the role of healthcare providers in the patient-care loop. However, in scenarios such as clinical counseling, the provider plays a central role in shaping patient outcomes [19], [20]. The concept of a ProDT seeks to address this gap by modeling the provider as an intelligent digital agent capable of delivering interactive and accurate support. LLMs have enabled chatbots to emulate natural provider-patient dialog. However, key challenges remain, including the mitigation of hallucinations in LLM outputs and the design of ProDT-driven dialog flows that can proactively guide patients through complex health topics. Our work addresses these gaps by introducing GRACE, a RAG-enhanced ProDT framework specifically tailored for provider-patient communication scenarios.
2.3. Retrieval augmented generation (RAG)
RAG [31], [32], [33] has emerged as a promising approach for improving the factual accuracy and domain relevance of LLM outputs. Rather than relying solely on pre-trained parameters, RAG augments LLMs with an external retrieval module that dynamically fetches relevant documents from a curated knowledge base. This hybrid architecture enables grounded, up-to-date content generation, which is particularly critical in high-stakes domains such as healthcare [42].
3. Material and methods
3.1. System overview
We present GRACE, a model-agnostic framework that instantiates a ProDT for healthcare communication. GRACE adopts a dual-mirroring design: a behavior mirror that governs how the ProDT communicates through clinician-informed dialog scripts, and a knowledge mirror that constrains what the ProDT conveys through RAG. In addition, the physical provider can learn from conversation histories and user feedback to refine both clinical knowledge and communication behavior, enabling the ProDT to evolve in parallel and continuously improve its interaction quality. The overall architecture is illustrated in Fig. 1.
Fig. 1.
Schematic diagram of GRACE, a provider digital twin framework for provider–patient communication.
Before interactive sessions, GRACE converts clinician-authored drafts into structured dialog scripts through a physician-informed script generation and optimization module (details in Section 3.2). During interaction, an LLM-powered agent runs a finite-state workflow with two alternating modes: (i) script delivery, in which the Conversation Script Navigator presents the script step-by-step; and (ii) interactive Q&A, entered whenever the user asks a question, where a RAG module retrieves evidence and generates source-attributed answers. A conversation memory maintains dialog state and user context (e.g., demographics, health literacy cues) to personalize tone and content. When the inquiry is resolved, control returns to script delivery until the session concludes.
To keep responses factual and current, GRACE integrates a domain knowledge base queried by the RAG pipeline (Section 3.4). Finally, the system is deployed via Streamlit [43] for a lightweight, clinician-friendly interface to run ProDT sessions. We demonstrate the framework on HPV vaccination counseling as a representative, information-dense use case and discuss generalization to other settings in Section 5.
3.2. Physician-informed dialogue script generation and optimization
The Physician-Informed Dialog Script Generation and Optimization Module operationalizes the behavior mirror. It ingests preliminary medical dialog resources and produces machine-readable scripts composed of hierarchical nodes (speaker, intent, transitions), enabling consistent, auditable delivery by the ProDT. Quality control combines automatic checks (linguistic coherence, structural completeness) with domain review loops, followed by iterative edits to improve clarity, empathy, and fidelity to clinical intent. Fig. 2 summarizes the script guidelines co-designed with Mayo Clinic physicians.
Fig. 2.
Structured script guideline (nine modules) used to standardize and generate dialogues within GRACE.
To reduce the authoring burden, GRACE provides a Script Generation Interface in Streamlit. Clinicians upload initial materials. Then, the system detects missing sections based on the guideline, proposes completions, and outputs a standardized script ready for deployment in the ProDT.
3.3. LLM-powered conversational agent and user interface layer
The conversational agent orchestrates scripted delivery and evidence-grounded Q&A (Fig. 3(b)). In the script-delivery state, the agent advances through nodes and periodically inserts brief teach-back prompts. When the user poses a question or signals uncertainty, the workflow transitions to interactive Q&A, where the RAG module retrieves relevant passages and the generator produces provenance-backed answers. A conversation memory tracks dialog history and optional user attributes to tailor explanations (e.g., plain-language summaries vs. evidence details). After the question is resolved, control returns to script delivery until completion.
Fig. 3.
GRACE interfaces for script authoring (a) and ProDT interaction (b).
3.4. Knowledge base and retrieval-augmented generation
The knowledge mirror is implemented as a RAG pipeline that injects authoritative context at inference time, avoiding fine-tuning while preserving provenance.
Sources and refresh. The knowledge base contains clinical guidelines, peer-reviewed literature, institutional protocols, and vetted web resources. It is refreshable so newly curated documents become immediately retrievable. Each passage stores metadata (source, timestamp) to support auditing.
Implementation. Documents are chunked into 800–1000 tokens with a 100-token overlap and embedded with all-mpnet-base-v2 [44]. Embeddings are stored in ChromaDB and queried via cosine similarity; by default the top passages are retrieved. At runtime, the user query plus retrieved context are passed to the generator (GPT-4o in our implementation) to produce responses with explicit citations. This division of labor, scripts for how to communicate and RAG for what to say, yields provider-like interactions that are both consistent and verifiable.
4. Results and discussion
4.1. Implementation details and HPV vaccination use case
We use HPV vaccination education as a use case to illustrate the implementation and application of the GRACE system. GRACE is implemented in Python 3.12.9 with approximately 2700 lines of code. The conversational backbone is GPT-4o accessed via LangChain (v0.3.21), and the user interface is built with Streamlit.
For the HPV-specific RAG corpus, we curated relevant documents (HTML, PDF, JSON) and normalized them to JSON with essential metadata (e.g., source, timestamp) to support consistent ingestion and provenance. This standardized preprocessing enables reliable retrieval without modifying model parameters.
Because Streamlit reruns the script on each user interaction, we persist critical session data (dialog script, optional demographics, and dialog history) using Streamlit’s session state to avoid loss of context and maintain continuity across turns.
Figs. 3(a) to (b) show the workflow of GRACE in the HPV use case. Users can optionally provide demographic information through the sidebar. If the information is not provided, GRACE delivers generalized content. The agent follows the clinician-informed script and switches to evidence-grounded question answering whenever the user raises a question. It cites relevant sources from the HPV corpus, resolves the inquiry, and then returns to the scripted dialog until completion. The same implementation can be easily applied to other health topics by replacing the dialog scripts and knowledge sources.
4.2. Evaluation on dual mirroring of GRACE
Provider–patient communication inherently involves both what is said and how it is said. Accordingly, GRACE’s ProDT design aims to achieve dual mirroring: a behavior mirror, which aligns the conversational flow and tone with clinician-informed dialog structures, and a knowledge mirror, which aligns the factual content with up-to-date and verifiable medical knowledge. This section evaluates these two complementary mirroring components through quantitative and qualitative experiments.
4.2.1. Enhancing patient-centered dialogue through script generation (behavior mirror)
To evaluate GRACE’s behavior mirroring capability, its ability to reproduce clinician-informed conversational structures and tones, we adopted the HealthBench framework [40], which provides a comprehensive benchmark grounded in expert physician consensus. HealthBench defines realistic clinical scenarios and offers standardized criteria for assessing the performance of LLMs in healthcare contexts. In addition, we performed a separate analysis to identify the specific contributions of GRACE’s healthcare-oriented dialog generation module to the overall response quality.
HealthBench establishes that an LLM can serve as a reliable grader for medical dialog evaluation. Accordingly, GPT-4o was employed as an automated evaluator in our experiments. Five evaluation metrics were used, including three directly adapted from HealthBench and two customized for this study. The detailed scoring rubrics are provided in A. Each dimension was rated on a 1–5 scale, with higher scores indicating more satisfactory performance.
-
1.
Accuracy: The response should convey medically valid information. When the evidence base is limited, appropriate acknowledgment of uncertainty is expected.
-
2.
Comprehensiveness: The response should contextualize the topic by covering causes, consequences, and relevant next steps to help users grasp the broader implications.
-
3.
Completeness: The response should include all essential information required for safe and effective understanding. Omission of critical details may result in harm or confusion.
-
4.
Communication: The response should be organized, concise, and phrased using terminology appropriate for the intended audience.
-
5.
Dialog Facilitation: The response should guide the user constructively through the conversation and promote engagement or informed decision-making.
This experiment evaluates the quality of healthcare dialog scripts for provider–patient communication. To create baseline scripts, we used Gemini-2 [28], a commercial large language model, to generate conversation scripts for the HPV vaccination scenario. The baseline output was organized into seven thematic sections: Introduction, Mechanism of Action, Target Audience and Timing, Addressing Common Concerns and Misconceptions, Benefits Beyond Cancer, Call to Action and Resources, and Open Q&A. Although this structure is relatively comprehensive, it lacks several elements critical for building patient trust, most notably, the inclusion of real-world narratives and the seamless integration of dynamic question-and-answer exchanges within the dialog flow.
In contrast, our proposed script generator introduces these features to emulate clinician-style communication. It incorporates authentic patient stories and supports context-aware question handling throughout the conversation, allowing users to interject or seek clarification naturally. This design better reflects the practical needs of healthcare communication, where patient inquiries may arise at any point and require immediate, empathetic, and contextually appropriate responses.
To evaluate the effectiveness of the proposed script generator, Gemini-2 outputs were refined using GRACE’s physician-informed dialog framework, and both the original and enhanced versions were evaluated by GPT-4o under identical scoring rubrics. To mitigate potential evaluator bias stemming from GPT-4o serving as both the generator and the grader, Gemini-2 was selected as the backend LLM for script construction.
The comparison results are summarized in Table 1. Overall, Gemini-2 exhibited a fundamental ability to convey medically accurate information; however, its general-purpose design constrained its capacity to address diverse patient concerns and sustain a coherent conversational trajectory. In contrast, GRACE-enhanced scripts demonstrated consistent improvements across all five evaluation dimensions, particularly in Completeness and Dialog Facilitation, confirming that domain-specific, behavior-mirroring mechanisms significantly elevate the quality of healthcare dialog generation.
Table 1.
Evaluation summary of dialog script quality before and after refinement by GRACE (behavior mirror).
| Metric | Original (gemini-2) | Improved by GRACE |
|---|---|---|
| Accuracy | 4.0 | 4.5 |
| Comprehensiveness | 3.5 | 4.0 |
| Completeness | 3.5 | 4.5 |
| Communication | 4.0 | 4.5 |
| Dialog Facilitation | 3.0 | 4.0 |
4.2.2. Knowledge mirroring via retrieval-augmented generation (RAG)
To further evaluate GRACE’s knowledge mirroring capability, its ability to acquire and maintain up-to-date medical knowledge through external retrieval, we conducted experiments using LLaMA-3 8B [45]. Two representative questions concerning HPV vaccination were selected to assess whether the inclusion of external retrieval improves the factual accuracy, timeliness, and traceability of the model’s responses.
This experiment demonstrates how RAG influences responses to queries that depend on temporal fidelity and verifiable sourcing, two essential attributes for a ProDT intended to reflect the evolving knowledge of real-world clinicians. Because LLMs inherently possess a static knowledge cutoff, our objective here is not to perform an exhaustive benchmark across a broad question set, but rather to illustrate the mechanism by which RAG enables GRACE to surface up-to-date medical guidance and cite authoritative, verifiable sources.
Specifically, we examined two patient-like questions requiring recent medical information: (1) adult catch-up HPV vaccination, and (2) updates in HPV vaccination guidelines. Each question was posed to the same base model with and without RAG. The comparison highlights two intended benefits of RAG for a ProDT: (i) replacing outdated or generic statements with current, evidence-based recommendations framed in the context of shared decision-making, and (ii) incorporating traceable references that support auditability and clinical transparency.
For the first question, the plain model cited legacy vaccine options (e.g., Cervarix), whereas the RAG-enhanced response aligned with the latest recommendations for adults over 27 years old and explicitly framed the discussion as a shared clinical decision between patient and provider. For the second, which was explicitly time-sensitive, the plain output reflected a historical version of the guideline, while the RAG-enhanced response incorporated recent adoption trends, together with a verifiable citation.
These examples serve as illustrative case studies demonstrating both temporal grounding and source attribution, core features of GRACE’s knowledge mirroring process that emulate how clinicians continually update their medical knowledge. While large-scale automated benchmarking remains beyond the present study’s scope, these case analyses provide mechanism-level evidence that RAG enhances factual alignment, temporal relevance, and traceability within GRACE’s responses. To ensure transparency and reproducibility, the retrieval pipeline, document selection criteria, and complete prompts used in these examples are included in the Supplementary Materials.
4.3. Analysis of user feedback on GRACE
To complement the quantitative evaluation of GRACE’s dual mirroring capabilities, we conducted a user study to obtain direct feedback from participants and to assess the overall experience of interacting with GRACE as a provider’s digital twin. The goal was to capture perceived usability, trustworthiness, and educational impact through a structured post-interaction survey.
We conducted an assessment involving twenty-one participants (12 females and 9 males; 11 with bachelor’s, degrees and 10 with postgraduate degrees) using a structured feedback survey designed to evaluate perceptions of the GRACE chatbot across two key dimensions. The survey comprised six 1–5 Likert items grouped as: (1) General User Experience for Usability, covering overall satisfaction, ease of use, and the comfort or trust conveyed by GRACE; and (2) Scripted Flow and Knowledge Gain, assessing whether the conversation introduced new or helpful information, whether the guided dialog flow is effective, and whether it influenced health-related intentions.
Responses were scored so that higher values indicated greater satisfaction or perceived quality. Average scores across all participants are summarized in Table 2. As shown, GRACE received consistently high ratings across the user-experience spectrum. Usability (mean = 4.67) and vaccination intention (mean = 4.62) achieved the highest averages, followed by comfort/trust (4.29) and guided learning (4.24), suggesting a positive and engaging overall experience.
Table 2.
User feedback evaluation summary for GRACE.
| Evaluation category | Metric | Average score (mean SD, 1–5 scale) |
|---|---|---|
| General User Experience for Usability | Overall Satisfaction | 4.29 0.96 |
| Usage Easiness | 4.67 0.48 | |
| Comfort and Trustworthiness | 4.29 0.96 | |
| Scripted Flow and Knowledge Gain | New or Helpful Information | 4.00 1.34 |
| Guided Learning Effectiveness | 4.24 1.00 | |
| Likelihood to Recommend/ Get Vaccine | 4.62 0.50 |
To further assess robustness, we performed multiple quantitative analyses. Descriptive statistics revealed a pronounced ceiling effect: over 80 % of ratings were 4, and several items (e.g., ease of use and recommendation likelihood) reached 100 % above this threshold (Fig. 4). Internal consistency across all six rubrics was moderate (Cronbach’s ), indicating that the items capture complementary aspects rather than a single latent construct. Inter-item Spearman correlations were modest and partly mixed (Supplementary Table B.6).
Fig. 4.

Distribution of composite score. A clear ceiling effect was observed, with most ratings at 4 or above.
We next examined potential subgroup effects using nonparametric tests with effect sizes. Mann–Whitney tests (with Cliff’s effect sizes) found no significant differences by gender (, ) or by education level (Bachelor vs. Postgraduate, ). These results suggest consistent user perceptions across demographic groups (Figs. 5(a)–(b); Supplementary Table B.7, Table B.8). Interestingly, female participants tended to show slightly higher engagement and awareness regarding HPV vaccination, which may explain small perceptual variations across items. However, internal consistency was comparable across genders (Table B.3). In contrast, postgraduate participants exhibited greater variance in their ratings, likely reflecting a more critical appraisal of GRACE’s responses given their higher medical or scientific literacy. This interpretation is supported by higher internal consistency among postgraduate participants compared to bachelor-level participants (Table B.4).
Fig. 5.
Composite score comparisons by gender and degree.
Subgroup reliability analysis showed Cronbach’s values of 0.43 (female), 0.42 (male), 0.21 (bachelor), and 0.63 (postgraduate). The relatively low coefficients, particularly in smaller subgroups, can be attributed to the pronounced ceiling effect (limited response variance) and the complementary nature of the six rubrics, which were designed to capture distinct facets of user experience rather than a single latent construct. In such multidimensional perception measures, lower internal consistency does not necessarily indicate unreliability but reflects the diversity of evaluation dimensions (e.g., usability, trust, and learning).
Given the small per group, these should be interpreted as preliminary. A rater-level summary (composite -scores and item profiles) is provided in Supplementary Table B.5 to facilitate inspection of potential rater leniency or severity; no extreme outliers were apparent.
Overall, participants rated GRACE as highly usable, trustworthy, and educationally supportive. The observed ceiling effect highlights broad user satisfaction, while reliability and subgroup analyses clarify that the six rubrics capture distinct yet complementary facets of user experience, usability, communication tone, and learning impact. Together, these results provide qualitative and quantitative evidence supporting GRACE’s effectiveness as a provider digital twin designed to enhance health communication and knowledge dissemination.
5. Limitations and future work
Despite the promising results, this study has several limitations. First, the current implementation of GRACE demonstrates only the communication functionality of a provider digital twin through the HPV vaccination counseling use case. In real-world clinical practice, medical providers perform a much broader range of tasks, ranging from patient communication and preventive counseling to diagnosis, treatment planning, and even procedural or surgical operations. Developing a full-scale provider digital twin therefore requires establishing a bidirectional mirroring mechanism between the digital representation and the real clinician, enabling continuous synchronization of knowledge, reasoning, and decision-making behavior.
The present work primarily aims to design and validate a framework that demonstrates the feasibility of constructing such a digital twin for communication-oriented tasks. By integrating physician-informed dialog scripts with RAG, we developed a functional prototype capable of emulating provider–patient interactions in educational and counseling contexts. Because the dialog-script component is adaptable and customizable, the framework can be readily extended to other healthcare scenarios where providers are required to convey medical information, respond to patient concerns, or guide shared decision-making.
GRACE is not an FDA-cleared medical device and is not intended for standalone diagnostic or therapeutic decision-making. Its deployment requires institutional governance, human oversight, and adherence to HIPAA and relevant local privacy regulations. Future work will also explore integration within HIPAA-compliant infrastructures and alignment with emerging FDA guidance on AI-enabled clinical decision support.
The realism of GRACE can be enhanced through multimodal digital avatars and by conducting larger-scale evaluations involving participants with diverse health literacy levels as well as clinician reviewers. We further aim to extend the framework to additional medical domains, such as chronic disease management, to assess its generalizability and broader impact.
6. Conclusion
This work presents GRACE, a model-agnostic framework for a communication-oriented provider digital twin (ProDT). GRACE operationalizes dual mirroring: a behavior mirror realized by physician-informed dialog scripts and a mixed-initiative conversation agent, and a knowledge mirror realized by a Retrieval-Augmented Generation (RAG) pipeline that provides provenance-backed, up-to-date evidence. Together, these components enable provider-like, auditable, and context-aware provider–patient communication. Using HPV vaccination counseling as a representative use case, we demonstrate that GRACE can alternate between structured, clinician-patterned script delivery and interactive Q&A, adapting to user profiles and conversation history to support comprehension.
CRediT authorship contribution statement
Pengze Li: Writing – review & editing, Writing – original draft, Visualization, Validation, Software, Resources, Methodology, Data curation, Conceptualization. Yutong Hu: Writing – review & editing, Writing – original draft, Validation, Investigation, Data curation. Jianfu Li: Writing – review & editing, Writing – original draft, Validation, Investigation, Data curation. Garit Gemeinhardt: Writing – review & editing, Conceptualization. Fang Li: Writing – review & editing, Writing – original draft, Supervision . Muhammad Amith: Writing – review & editing. Licong Cui: Writing – review & editing, Formal analysis. Antonio J. Forte: Writing – review & editing, Conceptualization. Cui Tao: Writing – review & editing, Writing – original draft, Validation, Supervision, Project administration, Methodology, Investigation, Funding acquisition, Formal analysis, Conceptualization.
Code and Data Availability
The source code is available at https://github.com/Tao-AI-group/GRACE_CSBJ.
Declaration of generative AI and AI-assisted technologies in the writing process
During the preparation of this work, the author(s) used ChatGPT to check for errors and polish the manuscript. After using this tool, the author(s) reviewed and edited the content as needed and take(s) full responsibility for the content of the publication.
Declaration of competing interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgments
The authors acknowledge support from the National Institutes of Health (NIH) under grants U24AI171008, R01AG084236, and R01AG083039.
Contributor Information
Pengze Li, Email: li.pengze@mayo.edu.
Yutong Hu, Email: Hu.Yutong@mayo.edu.
Jianfu Li, Email: li.jianfu@mayo.edu.
Garit Gemeinhardt, Email: Gemeinhardt.Garit@mayo.edu.
Fang Li, Email: li.fang@mayo.edu.
Muhammad Amith, Email: muamith@utmb.edu.
Licong Cui, Email: licong.cui@uth.tmc.edu.
Antonio J Forte, Email: forte.antonio@mayo.edu.
Cui Tao, Email: tao.cui@mayo.edu.
Appendix A. Rubrics grading criteria
The rubrics with detailed grading criteria are shown in the following list.
-
1.Accuracy:
-
•Score 5: All medical claims are factually correct; uncertainty is explicitly acknowledged where evidence is limited.
-
•Score 4: Minor factual imprecision that does not alter clinical meaning or safety; uncertainty is acknowledged in most relevant cases.
-
•Score 3: At least one moderate factual inaccuracy or outdated claim; may lack explicit acknowledgment of uncertainty.
-
•Score 2: Multiple factual errors or misleading statements; significant lack of alignment with current medical consensus.
-
•Score 1: Contains major misinformation that may cause harm or conflicts with core medical knowledge.
-
•
-
2.Comprehensiveness:
-
•Score 5: Fully addresses the topic with a clear explanation of causes, implications, and recommended next steps; integrates relevant context.
-
•Score 4: Covers most key dimensions of the topic, though one minor aspect (e.g., long-term outcome or risk) may be omitted.
-
•Score 3: Addresses the central issue but lacks depth on consequences or actionable follow-up.
-
•Score 2: Discussion is limited to isolated facts without adequate contextualization or user guidance.
-
•Score 1: The response is superficial or off-topic, missing most relevant dimensions.
-
•
-
3.Completeness:
-
•Score 5: Includes all necessary information for a user to understand and act safely and effectively; no critical omissions.
-
•Score 4: Minor omission of non-critical details; the main message remains intact and actionable.
-
•Score 3: One or two moderate omissions that may cause confusion or require further clarification.
-
•Score 2: Key steps or warnings are missing, limiting usability or introducing potential risks.
-
•Score 1: The response is fragmented or insufficient for safe user understanding.
-
•
-
4.Communication:
-
•Score 5: Response is clear, well-structured, concise, and uses language appropriate to the user’s presumed health literacy level.
-
•Score 4: Generally understandable with slight verbosity or minor jargon; structure is mostly logical.
-
•Score 3: Adequately phrased but contains confusing segments, poor structure, or an inconsistent tone.
-
•Score 2: Language is overly technical or informal; difficult to follow without prior knowledge.
-
•Score 1: Disorganized or unclear, with significant readability issues.
-
•
-
5.Dialog Facilitation:
-
•Score 5: Effectively encourages continued engagement, poses follow-up prompts, or leads the user to the next logical step in the conversation.
-
•Score 4: Offers implicit cues for continued dialog; mostly maintains conversational flow.
-
•Score 3: Neutral; maintains coherence but does not actively guide the interaction.
-
•Score 2: Abrupt or passive response that may discourage further user input.
-
•Score 1: Breaks conversational flow; fails to acknowledge or anticipate user needs.
-
•
Appendix B. User feedback survey criteria
All questions were rated on a 1–5 scale, with 1 representing the most negative evaluation and 5 representing the most positive. For questions using the full 1–5 scale, participants could select any integer between 1 and 5 to reflect their level of satisfaction. For questions with fixed response options, each choice was mapped to a corresponding score to capture varying degrees of agreement or experience. The detailed evaluation criteria for each question are outlined below, covering General User Experience for Usability as well as Scripted Flow and Knowledge Gain.
-
1.General User Experience for Usability
-
•Overall satisfaction with the chatbot:
-
•Score 5: Very satisfied
-
•Score 1: Not satisfied
-
•
-
•Ease of use and understanding:
-
•Score 5: Very satisfied
-
•Score 1: Not satisfied
-
•
-
•Comfort and trustworthiness of chatbot tone (RAG influence):
-
•Score 5: Very satisfied
-
•Score 1: Not satisfied
-
•
-
•
-
2.Scripted Flow and Knowledge Gain
-
•Learning something new from the conversation:
-
•Score 5: Yes, a lot
-
•Score 4: Yes, a little
-
•Score 3: Neutral
-
•Score 2: Learned very little
-
•Score 1: Learned nothing
-
•
-
•Effectiveness of guided learning process vs. self-navigation:
-
•Score 5: Yes
-
•Score 3: Neutral
-
•Score 1: No
-
•
-
•Likelihood to get (or recommend) the HPV vaccine after the chat:
-
•Score 5: Very likely
-
•Score 3: Neutral
-
•Score 1: Not likely
-
•
-
•
Table B.3.
Cronbach’s by gender.
| Gender | Cronbach’s |
|---|---|
| F | 0.429 |
| M | 0.419 |
Table B.4.
Cronbach’s by education level (bachelor vs. postgraduate).
| Degree | Cronbach’s |
|---|---|
| Bachelor | 0.214 |
| Postgraduate | 0.626 |
Table B.5.
Rater-level composite scores and standardized () values. Higher indicates relatively more lenient ratings; lower indicates stricter ratings.
| Rater | Composite | Composite_z |
|---|---|---|
| R1 | 5.000 | 1.381 |
| R2 | 5.000 | 1.381 |
| R3 | 5.000 | 1.381 |
| R4 | 4.667 | 0.674 |
| R5 | 4.667 | 0.674 |
| R6 | 4.667 | 0.674 |
| R7 | 4.667 | 0.674 |
| R8 | 4.500 | 0.320 |
| R9 | 4.500 | 0.320 |
| R10 | 4.500 | 0.320 |
| R11 | 4.333 | −0.034 |
| R12 | 4.333 | −0.034 |
| R13 | 4.333 | −0.034 |
| R14 | 4.333 | −0.034 |
| R15 | 4.167 | −0.388 |
| R16 | 4.000 | −0.742 |
| R17 | 4.000 | −0.742 |
| R18 | 3.833 | −1.096 |
| R19 | 3.667 | −1.450 |
| R20 | 3.500 | −1.803 |
| R21 | 3.333 | −2.157 |
Table B.6.
Inter-item Spearman correlations among rubrics.
| Overall | Ease | Tone | Learn | Guided | Likelihood | |
|---|---|---|---|---|---|---|
| Overall satisfaction | 1.000 | 0.130 | 0.222 | 0.382 | 0.217 | 0.118 |
| Ease of use | 0.130 | 1.000 | 0.280 | −0.154 | −0.347 | 0.069 |
| Comfort/trust tone | 0.222 | 0.280 | 1.000 | 0.094 | 0.279 | 0.118 |
| Learning new | 0.382 | −0.154 | 0.094 | 1.000 | 0.349 | −0.374 |
| Guided learning | 0.217 | −0.347 | 0.279 | 0.349 | 1.000 | −0.212 |
| Likelihood to vaccinate | 0.118 | 0.069 | 0.118 | −0.374 | −0.212 | 1.000 |
Table B.7.
Gender comparisons using Mann–Whitney and Cliff’s . No significant differences were found between groups (all ).
| Metric | Female mean | Male mean | Diff (F–M) | Cliff’s | ||
|---|---|---|---|---|---|---|
| Composite | 4.417 | 4.259 | 0.157 | 63.500 | 0.518 | 0.176 |
| Learning something new | 4.167 | 3.778 | 0.389 | 62.000 | 0.538 | 0.148 |
| Guided learning effectiveness | 4.333 | 4.111 | 0.222 | 60.000 | 0.643 | 0.111 |
| Comfort/trust tone | 4.417 | 4.111 | 0.306 | 60.000 | 0.662 | 0.111 |
| Likelihood to vaccinate | 4.583 | 4.667 | −0.083 | 49.500 | 0.736 | −0.083 |
| Overall satisfaction | 4.333 | 4.222 | 0.111 | 56.000 | 0.905 | 0.037 |
Table B.8.
Education-level comparisons using Mann–Whitney and Cliff’s . No significant differences were found between groups (all ).
| Metric | Bachelor mean | Postgrad mean | Diff (B–P) | Cliff’s | ||
|---|---|---|---|---|---|---|
| Composite | 4.318 | 4.392 | −0.074 | 50.000 | 0.120 | 0.270 |
| Learning something new | 4.182 | 4.300 | −0.118 | 46.500 | 0.230 | 0.180 |
| Guided learning effectiveness | 4.182 | 4.300 | −0.118 | 51.500 | 0.200 | 0.220 |
| Comfort/trust tone | 4.182 | 4.400 | −0.218 | 47.000 | 0.210 | 0.230 |
| Likelihood to vaccinate | 4.545 | 4.700 | −0.155 | 52.000 | 0.190 | 0.200 |
| Overall satisfaction | 4.273 | 4.500 | −0.227 | 49.500 | 0.180 | 0.250 |
References
- 1.Batty M. Digital twins. 2018.
- 2.Herwig C., Pörtner R., Möller J. Digital twins. Springer; 2021. [Google Scholar]
- 3.Juarez M.G., Botti V.J., Giret A.S. Digital twins: review and challenges. J Comput Inf Sci Eng. 2021;21(3) [Google Scholar]
- 4.Mihai S., Yaqoob M., Hung D.V., Davis W., Towakel P., Raza M., Karamanoglu M., Barn B., Shetve D., Prasad R.V., et al. Digital twins: a survey on enabling technologies, challenges, trends and future prospects. EEE Commun. Surv. & Tutor. 2022;24(4):2255–2291. [Google Scholar]
- 5.Xames M.D., Topcu T.G. A systematic literature review of digital twin research for healthcare systems: research trends, gaps, and realization challenges. IEEE Access. 2024;12:4099–4126. [Google Scholar]
- 6.Okegbile S.D., Cai J., Niyato D., Yi C. Human digital twin for personalized healthcare: vision, architecture and future directions. IEEE Network. 2022;37(2):262–269. [Google Scholar]
- 7.White G., Zink A., Codecá L., Clarke S. A digital twin smart city for citizen feedback. Cities. 2021;110 [Google Scholar]
- 8.Jafari M., Kavousi-Fard A., Chen T., Karimi M. A review on digital twin technology in smart grid, transportation system and smart city: challenges and future. IEEE Access. 2023;11:17471–17484. [Google Scholar]
- 9.Pylianidis C., Osinga S., Athanasiadis I.N. Introducing digital twins to agriculture. Comput Electron Agric. 2021;184 [Google Scholar]
- 10.Purcell W., Neubauer T. Digital twins in agriculture: a state-of-the-art review. Smart Agric Technol. 2023;3 [Google Scholar]
- 11.Katsoulakis E., Wang Q., Wu H., Shahriyari L., Fletcher R., Liu J., Achenie L., Liu H., Jackson P., Xiao Y., et al. Digital twins for health: a scoping review. NPJ digital medicine. 2024;7(1):77. doi: 10.1038/s41746-024-01073-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Chen J., Yi C., Okegbile S.D., Cai J., Shen X. Networking architecture and key supporting technologies for human digital twin in personalized healthcare: a comprehensive survey. IEEE Commun Surv Tutorials. 2023;26(1):706–746. [Google Scholar]
- 13.Chen J., Shi Y., Yi C., Du H., Kang J., Niyato D. Generative ai-driven human digital twin in iot-healthcare: a comprehensive survey. IEEE Internet Things J. 2024 [Google Scholar]
- 14.Liu Y., Zhang L., Yang Y., Zhou L., Ren L., Wang F., Liu R., Pang Z., Deen M.J. A novel cloud-based framework for the elderly healthcare services using digital twin. IEEE Access. 2019;7:49088–49101. [Google Scholar]
- 15.Elayan H., Aloqaily M., Guizani M. Digital twin for intelligent context-aware iot healthcare systems. IEEE Internet Things J. 2021;8(23):16749–16757. [Google Scholar]
- 16.Mourtzis D., Angelopoulos J., Panopoulos N., Kardamakis D. A smart iot platform for oncology patient diagnosis based on ai: towards the human digital twin. Procedia CIRP. 2021;104:1686–1691. [Google Scholar]
- 17.Sarella P.N.K., Mangam V.T. Ai-driven natural language processing in healthcare: transforming patient-provider communication. Indian J Pharm Pract. 2024;17(1) [Google Scholar]
- 18.Otte S.V. Improved patient experience and outcomes: is patient–provider concordance the key? J Patient Exp. 2022;9 doi: 10.1177/23743735221103033. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Drossman D.A., Chang L., Deutsch J.K., Ford A.C., Halpert A., Kroenke K., Nurko S., Ruddy J., Snyder J., Sperber A. A review of the evidence and recommendations on communication skills and the patient–provider relationship: a rome foundation working team report. Gastroenterology. 2021;161(5):1670–1688. doi: 10.1053/j.gastro.2021.07.037. [DOI] [PubMed] [Google Scholar]
- 20.Hannawa A.F., Wu A.W., Kolyada A., Potemkina A., Donaldson L.J. The aspects of healthcare quality that are important to health professionals and patients: a qualitative study. Patient Educ Couns. 2022;105(6):1561–1570. doi: 10.1016/j.pec.2021.10.016. [DOI] [PubMed] [Google Scholar]
- 21.Tortora M., Pacchiano F., Ferraciolli S.F., Criscuolo S., Gagliardo C., Jaber K., Angelicchio M., Briganti F., Caranci F., Tortora F., et al. Medical digital twin: a review on technical principles and clinical applications. J Clin Med. 2025;14(2):324. doi: 10.3390/jcm14020324. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Tudor B.H., Shargo R., Gray G.M., Fierstein J.L., Kuo F.H., Burton R., Johnson J.T., Scully B.B., Asante-Korang A., Rehman M.A., et al. A scoping review of human digital twins in healthcare applications and usage patterns. npj Digital Medicine. 2025;8(1):587. doi: 10.1038/s41746-025-01910-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Zhao W.X., Zhou K., Li J., Tang T., Wang X., Hou Y., Min Y., Zhang B., Zhang J., Dong Z., et al. A survey of large language models. 2023. arXiv:2303.18223 [Preprint]
- 24.Naveed H., Khan A.U., Qiu S., Saqib M., Anwar S., Usman M., Akhtar N., Barnes N., Mian A. A comprehensive overview of large language models. 2023. arXiv:2307.06435 [Preprint]
- 25.Le N.Q.K. Leveraging transformers-based language models in proteome bioinformatics. Proteomics. 2023;23(23–24) doi: 10.1002/pmic.202300011. [DOI] [PubMed] [Google Scholar]
- 26.Tran T.-O., Le N.Q.K. Sa-ttca: an svm-based approach for tumor t-cell antigen classification using features extracted from biological sequencing and natural language processing. Comput Biol Med. 2024;174 doi: 10.1016/j.compbiomed.2024.108408. [DOI] [PubMed] [Google Scholar]
- 27.Montagna S., Aguzzi G., Ferretti S., Pengo M.F., Klopfenstein L.C., Ungolo M., Magnini M. 2024 IEEE International Conference on Pervasive Computing and Communications Workshops and other Affiliated Events (PerCom Workshops) IEEE; 2024. Llm-based solutions for healthcare chatbots: a comparative analysis; pp. 346–351. [Google Scholar]
- 28.DeepMind G. Gemini: a family of highly capable multimodal models. 2023. arXiv:2312.11805 [Preprint]
- 29.Anthropic, Claude ai. 2023, https://claude.ai [accessed: 28 March 2025].
- 30.OpenAI Gpt-4 technical report. 2023. arXiv:2303.08774 et al. [Preprint]
- 31.Jiang Z., Xu F.F., Gao L., Sun Z., Liu Q., Dwivedi-Yu J., Yang Y., Callan J., Neubig G. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2023. Active retrieval augmented generation; pp. 7969–7992. [Google Scholar]
- 32.Salemi A., Zamani H. Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2024. Evaluating retrieval quality in retrieval-augmented generation; pp. 2395–2400. [Google Scholar]
- 33.Nian Y., Du J., Bu L., Li F., Hu X., Zhang Y., Tao C. Knowledge graph-based neurodegenerative diseases and diet relationship discovery. 2021. arXiv:2109.06123 [Preprint]
- 34.Huang L., Yu W., Ma W., Zhong W., Feng Z., Wang H., Chen Q., Peng W., Feng X., Qin B., et al. A survey on hallucination in large language models: principles, taxonomy, challenges, and open questions. ACM Trans Inf Syst. 2025;43(2):1–55. [Google Scholar]
- 35.Tonmoy S.M., Zaman S.M., Jain V., Rani A., Rawte V., Chadha A., Das A. A comprehensive survey of hallucination mitigation techniques in large language models. 2024. arXiv:2401.01313 [Preprint] [p. 6]
- 36.Chen Y., Fu Q., Yuan Y., Wen Z., Fan G., Liu D., Zhang D., Li Z., Xiao Y. Proceedings of the 32nd ACM International Conference on Information and Knowledge Management. 2023. Hallucination detection: robustly discerning reliable answers in large language models; pp. 245–255. [Google Scholar]
- 37.T. AI Ragflow: an open source framework for building high-quality rag pipelines. 2024. https://github.com/togethercomputer/ragflow
- 38.Deepset Haystack: flexible open source nlp framework for question answering and rag. 2023. https://haystack.deepset.ai
- 39.LangChain Langchain: building applications with llms through composability. 2023. https://www.langchain.com
- 40.Arora R.K., Wei J., Hicks R.S., Bowman P., Quiñonero-Candela J., Tsimpourlas F., Sharman M., Shah M., Vallone A., Beutel A., Heidecke J., Singhal K. Healthbench: Evaluating large language models towards improved human health. 2025. arXiv:2505.08775
- 41.Vaswani A., Shazeer N., Parmar N., Uszkoreit J., Jones L., Gomez A.N., Kaiser Ł., Polosukhin I. Attention is all you need. Adv Neural Inf Process Syst. 2017;30 [Google Scholar]
- 42.Lewis P., Perez E., Piktus A., Petroni F., Karpukhin V., Goyal N., Küttler H., Lewis M., Yih W.-T., Rocktäschel T., et al. Retrieval-augmented generation for knowledge-intensive nlp tasks. Adv Neural Inf Process Syst. 2020;33:9459–9474. [Google Scholar]
- 43.Inc. S Streamlit: The fastest way to build and share data apps. 2023. https://streamlit.io
- 44.Reimers N., Gurevych I. All-mpnet-base-v2: Sentence-transformers model. 2021. https://huggingface.co/sentence-transformers/all-mpnet-base-v2 available at Sentence-Transformers on Hugging Face.
- 45.Grattafiori A., Dubey A., Jauhri A., Pandey A., Kadian A., Al-Dahle A., Letman A., Mathur A., Schelten A., Vaughan A., et al. The llama 3 herd of models. 2024. arXiv:2407.21783 [Preprint]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
The source code is available at https://github.com/Tao-AI-group/GRACE_CSBJ.





