Skip to main content
BMC Medical Education logoLink to BMC Medical Education
. 2026 Mar 3;26:586. doi: 10.1186/s12909-026-08819-2

Application of large language models in medical interview training: a study with medical students

Jarosław Hryszko 1,, Artur Michałek 1, Adam Roman 1
PMCID: PMC13067494  PMID: 41772593

Abstract

This paper explores the application of Large Language Models (LLMs) in medical interview training. While medical interviews remain fundamental in healthcare, training methods often require human interaction, limiting practice opportunities. We investigate if LLMs can effectively simulate patients for training purposes. We examined commercially available models and fine-tuned open-source LLMs using QLoRA techniques on a dataset of medical interviews. We developed a web application employing these models and conducted in-depth interviews with medical students to evaluate its effectiveness. Students found the application helpful, rating conversation quality as good and highlighting advantages over traditional training methods, particularly regarding availability and consistency in patient symptom presentation. While students emphasized that LLMs cannot replace real patient interactions, they recognized significant benefits for supplementary training. Our findings confirm that LLMs can be valuable tools in medical interview training, providing opportunities for skill development without dependency on peer availability or scheduled sessions.

Keywords: Large language models, Medical education, Virtual patients, Medical interviews, Healthcare simulation

Introduction

Healthcare delivery fundamentally relies on effective communication between medical professionals and patients, with the medical interview serving as a cornerstone of diagnostic and treatment processes. Despite technological advances in healthcare, the person-to-person interview between patient and doctor remains essential for accurate diagnosis and treatment recommendations [1]. Consequently, medical interview training is a critical component of medical education.

Traditional medical interview training typically involves role-playing exercises between students or interactions with standardized patients (actors trained to simulate specific conditions) [2]. While effective, these methods present limitations: they require coordination between multiple participants, are restricted to scheduled times, and may lack consistency in how symptoms and conditions are portrayed. Additionally, standardized patients can be costly to employ, further limiting practice opportunities [3].

Medical schools worldwide have sought to address these challenges through various innovative approaches. Some institutions have implemented dedicated simulation centers with recording capabilities for review and feedback [4]. Others have employed telehealth simulations or integrated electronic health record documentation into interview training [5]. However, these solutions typically still require significant infrastructure, scheduling coordination, and human resources.

Recent advances in artificial intelligence, particularly in the field of Large Language Models (LLMs), offer potential solutions to these limitations. LLMs are sophisticated neural network-based systems trained on vast datasets of text, enabling them to generate coherent, contextually appropriate responses [6]. Their ability to process and generate natural language makes them promising candidates for simulating patient responses in medical interview training scenarios.

The functioning of LLMs relies on transformer architecture, which enables them to process text by attending to different parts of input sequences simultaneously [7]. This architecture allows the models to capture complex relationships between words and concepts, facilitating more natural conversations than rule-based systems. Modern LLMs are trained on diverse corpora of text from various domains, including medical literature, potentially enabling them to simulate patient responses with appropriate medical terminology and symptom descriptions.

The application of virtual patients in medical education is not a novel concept. Research spanning decades has investigated various approaches, from computer-controlled manikins [8] to text-based interfaces [9] and more sophisticated virtual reality implementations [10]. Multiple systematic reviews have demonstrated benefits of virtual patient interactions compared to no simulation training at all [1113].

What distinguishes the current technological landscape is the unprecedented capability of modern LLMs to engage in naturalistic, responsive conversations that adapt to user inputs without rigid pre-programming. This potentially addresses a key limitation of earlier virtual patient systems, which often offered limited response variability or followed prescriptive dialog trees. Earlier systems typically relied on branching scenarios (also called branching narratives)—structured decision trees where each possible student question leads to a predefined set of response options, with content creators pre-authoring all possible conversation paths [14]. While this approach ensures clinical accuracy, it requires extensive development time, limits conversation flexibility to anticipated questions, and can feel artificial when students deviate from expected paths. In contrast, LLMs can generate novel responses tailored to specific questions without requiring pre-authored branches, potentially creating more authentic, unpredictable, and adaptable interactions that better reflect the variability of real patient encounters.

The educational value of medical interview training extends beyond symptom elicitation to encompass key clinical communication skills such as empathy, information gathering, explanation and planning, and relationship building [15]. Following Cook et al. [16], we define effective communication in medical interviews as the ability to conduct structured information gathering (the intake interview), elicit comprehensive history of present illness (HPI), and facilitate shared decision-making between physician and patient. An effective patient simulation must therefore not only present clinically accurate symptoms but also emulate realistic patient attitudes, behaviors, and communication styles. The question remains whether LLMs can effectively simulate these nuanced aspects of patient communication.

This study addresses the following research questions:

  • RQ1: Can LLMs generate contextually appropriate and clinically coherent patient responses in simulated medical interviews?

  • RQ2: How do medical students perceive the utility, realism, and educational value of LLM-based patient simulations compared to traditional training methods?

  • RQ3: What are the technical and practical considerations for implementing LLM-based patient simulation systems in medical education?

To answer these questions, we structured our research into several key objectives:

  1. Examine the availability and quality of datasets suitable for training LLMs for medical interview simulation;

  2. Explore resource-efficient methods for training and fine-tuning LLMs;

  3. Evaluate both open-source and commercial LLMs for patient simulation capabilities;

  4. Develop a user-friendly application implementing these models;

  5. Assess the application’s utility through interviews with medical students.

Through this comprehensive approach, we aim to determine whether LLMs can serve as valuable tools in supplementing traditional medical interview training methods, potentially increasing training opportunities and enhancing skill development for medical students.

Methods

Data acquisition and analysis

For training and fine-tuning LLMs, we selected a dataset of 272 simulated medical conversations focusing primarily on respiratory cases but including various conditions across respiratory, musculoskeletal, gastrointestinal, dermatological, and cardiac categories [17]. This dataset was developed by a team of doctors and senior medical students in Canada and included both audio recordings and textual transcriptions. For our purposes, we utilized only the transcriptions.

The dataset featured conversations with a relatively balanced gender distribution, with a 57% to 43% male-to-female ratio for doctors and a 55% to 45% ratio for patients. The average conversation duration was 11 minutes and 56 seconds. These demographic characteristics suggest reasonable representation of diverse patient-doctor interactions, though the predominance of respiratory cases (78.7% of conversations) represents a limitation in terms of clinical diversity.

The dataset was initially labeled with condition categories, though we identified that more precise labeling (e.g., specific illnesses) would benefit model training. We explored several labeling approaches, including manual labeling by medical professionals (highest quality but resource-intensive), using specialized LLMs for labeling (medium quality with moderate resource requirements), and using general LLMs (lowest quality but most accessible).

The advantages and disadvantages of each approach were carefully considered. Manual labeling by medical professionals would provide the most accurate and clinically relevant labels but would require significant time investment from specialists who are often resource-constrained. Using specialized LLMs trained specifically for medical labeling tasks could offer a compromise, potentially providing reasonable accuracy while reducing demands on human experts. General LLMs, while most accessible, would likely produce labels of lower clinical accuracy without specific medical training.

A comparative analysis of these methods revealed trade-offs in terms of label quality, implementation complexity, data security, and cost. We determined that for educational purposes, reasonably accurate but not necessarily perfect labeling would be sufficient, particularly for a proof-of-concept system. Given resource constraints, we ultimately utilized the dataset with its original labeling for this study.

Statistical analysis of the dataset revealed an average of 48 conversation turns per interview, with some approaching 80 turns. Conversations averaged 139 sentences and 7168 words (approximately 9557 tokens when tokenized using the Llama-2 tokenizer, with a word-to-token ratio of approximately 1.33:1), with typical patient responses consisting of approximately 2.8 sentences. This indicated that trimming rather than expansion would be necessary to accommodate model context window limitations.

The relatively concise nature of patient responses (2.8 sentences on average) aligned well with our goal of generating realistic patient replies, as real patients typically provide brief rather than elaborate responses. The high token count of complete conversations (approximately 9557 tokens on average) presented a technical challenge, as most LLMs have context window limitations that would not accommodate entire conversations without truncation. Note that token count differs from word count due to subword tokenization, where words may be split into multiple tokens; medical terminology in particular often results in higher token-to-word ratios.

Model selection and training methods

We investigated both open-source and commercial LLMs to evaluate their effectiveness in simulating patient responses. This dual approach allowed us to compare the benefits and limitations of each strategy in terms of performance, resource requirements, and practical implementation considerations.

Open-source model

For the open-source approach, we selected Llama-2-7B-chat-hf [18], a 7-billion parameter model from Meta specifically trained for conversational applications. The model employs a transformer-based architecture and was pre-trained on a diverse corpus of text before being fine-tuned using Reinforcement Learning from Human Feedback (RLHF) for conversation capabilities.

We employed fine-tuning methods designed to minimize computational requirements while preserving performance. This approach was motivated by the need to make such technology accessible to educational institutions with limited computational resources, as opposed to requiring enterprise-level hardware.

Given the memory-intensive nature of LLM training, we implemented several optimization techniques:

Context size management: Conversations exceeding 3300 tokens were truncated to prevent out-of-memory errors during training. This threshold was determined through empirical testing with available hardware, balancing the preservation of conversation content against technical limitations. Analysis of the dataset revealed that approximately 23% of conversations (62 out of 272) exceeded this limit, with an average truncation of 1847 tokens for affected conversations. To minimize information loss, we preserved the beginning portions of conversations, which typically contain the most critical diagnostic exchanges (chief complaint, history of present illness, and initial symptom descriptions). While this approach may have excluded some follow-up questions and diagnostic reasoning from longer interviews, we prioritized capturing the core patient presentation patterns. This limitation is discussed further in Discussion section.

Quantization: We applied 4-bit quantization to reduce memory requirements while accepting a minor accuracy trade-off [19]. Quantization reduces the precision of model weights from 32-bit floating-point representation to 4-bit integers, substantially decreasing memory usage at the cost of some computational accuracy. According to Rokh et al. [19], this approach typically reduces memory requirements by approximately 87.5% compared to full-precision models.

QLoRA fine-tuning: We implemented QLoRA (Quantized Low-Rank Adaptation) [20], a parameter-efficient fine-tuning technique that significantly reduces VRAM requirements while maintaining performance comparable to full-precision training. Unlike full-parameter fine-tuning where all model weights are updated, QLoRA freezes the base model parameters and only trains small, low-rank adapter matrices inserted into the model architecture. Specifically, QLoRA works by decomposing weight update matrices into low-rank approximations, dramatically reducing the number of trainable parameters that need to be stored in GPU memory during training. According to Dettmers et al. [20], this method can reduce memory requirements by up to 67% compared to standard full-parameter fine-tuning while maintaining comparable performance.

The combination of these parameter-efficient techniques enabled us to train on hardware that would otherwise be insufficient for full-parameter fine-tuning. Specifically, while standard full-parameter fine-tuning of a 7B parameter model would require approximately 126GB of VRAM (at 18 bytes per parameter), our optimized QLoRA approach reduced this requirement to under 48GB, making training feasible on a single GPU. This approach trades off some potential performance gains from full-parameter training for practical accessibility on limited hardware.

Training was performed on an NVIDIA A40 GPU with 48GB of VRAM over 3 epochs with a batch size of 4 and learning rate of 2e-4. We explored both supervised and unsupervised training approaches with distinct task formulations:

For supervised fine-tuning, we framed the task as instruction-following for next-utterance prediction. We created handcrafted system prompts for each illness category that instructed the model to behave as a patient with specific conditions (e.g., “You are a patient experiencing respiratory symptoms. Respond naturally to doctor’s questions as this patient would.”). Each training example consisted of: (1) the system prompt as instruction, (2) the conversation history up to the current turn as context, and (3) the patient’s actual response from the dataset as the target completion. This approach taught the model to generate contextually appropriate patient responses given explicit role instructions.

For unsupervised fine-tuning, we treated the task as pure language modeling on medical dialogue. Complete conversation transcripts were concatenated with turn-taking markers (e.g., “Doctor:”, “Patient:”) but without explicit role instructions. The model learned to predict each utterance given preceding context, inferring the patient role implicitly from the discourse structure. While this approach requires no manual prompt engineering, it provides less direct control over response behavior.

To evaluate whether fine-tuning affected general model capabilities, we employed standardized benchmarks from the HuggingFace Open LLM Leaderboard, including AI2 Reasoning Challenge, Hellaswag, MMLU, TruthfulQA, Winogrande, and GSM8k. These benchmarks assess various capabilities including reasoning, common sense understanding, multi-task knowledge, truthfulness, and mathematical problem-solving.

Important limitation: These general-purpose benchmarks do not directly evaluate medical interview capability, conversational authenticity, or clinical realism—the primary objectives of our fine-tuning. Our intent in using them was narrowly scoped: to verify that specialized fine-tuning for medical dialogue did not degrade the model’s general reasoning and language understanding abilities. For assessing actual patient simulation quality, we relied on qualitative evaluation through student interviews (Results section), which provided domain-relevant feedback on conversational naturalness, symptom consistency, and educational utility. A comprehensive evaluation would benefit from medical communication-specific benchmarks (e.g., dialogue coherence metrics, clinical realism assessments by medical experts, or standardized medical communication evaluation frameworks), which represent important future work. We acknowledge this as a methodological limitation in Discussion section.

Commercial models

We evaluated OpenAI’s GPT models (specifically GPT-3.5 and GPT-4) as commercially available solutions. These models were accessed through both web interfaces and API connections. Unlike open-source models, these did not require training infrastructure but incurred usage costs.

The decision to include commercial models was motivated by their widespread accessibility and reported state-of-the-art performance. According to OpenAI’s technical reports, GPT-4 demonstrates strong performance across various benchmarks, with scores of 0.864 on MMLU, 0.953 on Hellaswag, 0.963 on AI2 RC, and 0.875 on Winogrande [21]. These metrics suggest significantly stronger capabilities compared to the 7B parameter open-source model we evaluated.

We explored various approaches to optimize these models for medical interview simulation:

System prompt engineering. We developed and refined prompts instructing the model to simulate patients with specific conditions and exhibit appropriate symptoms. Through iterative testing, we identified key elements that improved simulation quality, including specific instructions about response length, formality level, and disclosure behaviors.

Custom GPT creation. We tested OpenAI’s “Create a GPT” functionality to develop purpose-built patient simulators. This approach allowed us to create persistent configurations with specific instructions and knowledge bases tailored to patient simulation.

Controlled condition selection. We implemented a system to select conditions and their associated symptoms programmatically rather than relying on the model’s internal knowledge, improving consistency and accuracy. This involved creating a database of conditions and their typical presentations based on medical literature, then injecting these details into prompts.

Parameter tuning. We adjusted generation parameters like temperature, frequency_penalty, and presence_penalty to influence output characteristics. Through systematic testing, we determined that a moderate temperature (0.7) balanced between predictable responses and creative variations, while a slightly elevated presence penalty (0.6) helped prevent symptom repetition.

For symptom definition, we compiled information for key conditions including Stable Angina, E. coli infection, and Asthma, detailing their characteristic symptoms. This structured approach provided the model with medically accurate information while still allowing for naturalistic variation in how symptoms were described. Our testing evaluated output quality, consistency in maintaining patient persona, and accuracy in representing appropriate symptoms.

The commercial models were evaluated both through their web interfaces and API implementations. The web interface provided a convenient environment for rapid prototyping and qualitative assessment, while API integration enabled more controlled testing and eventual application development with standardized parameters.

Application development

Based on our model evaluations, we developed a web application to facilitate medical interview practice. The application followed a client-server architecture with a Python/FastAPI backend and React/Next.js frontend, communicating via HTTP for initial setup and WebSocket for real-time conversation (see Additional file 1 for detailed technical specifications).

The backend implemented a component-based design with modular elements for condition management, LLM interaction abstraction, patient demographic configuration, prompt generation, and conversation context maintenance. This modular architecture facilitates potential integration with different LLM implementations and external medical knowledge bases.

The frontend prioritized intuitive interaction through a responsive design suitable for both desktop and mobile devices. Key features included:

  • Condition selector with random option for diagnostic challenge

  • Configurable patient demographics (age, gender)

  • Chat interface mimicking standard messaging applications

  • Diagnosis verification functionality

The application supported both pre-configured and randomized illness scenarios, allowing students to practice both structured interviewing (when condition is known) and diagnostic reasoning (when condition is withheld until requested). Patient demographics could be specified or randomly generated, providing variability in case presentations.

Statistical analysis and methods

For comparing model performance metrics across benchmark tests, we employed two-tailed Z-tests to determine statistical significance of observed differences, with significance threshold set at Inline graphic. All statistical analyses were performed using Python 3.9 with SciPy (version 1.8.0) and NumPy (version 1.22.3) libraries. Detailed mathematical formulations of the Z-test methodology, including formulas and distributional assumptions, are provided in Additional file 2.

Important methodological note: As noted by reviewers, our application of Z-tests to benchmark scores assumes distributional properties that may not hold for deterministic model outputs. This represents a limitation in our quantitative comparison approach, as model benchmark scores are not true random samples but deterministic outputs. Future work should employ more appropriate statistical methods for comparing deterministic model performance.

For qualitative data from student interviews, we employed thematic analysis to identify recurring patterns and key insights. Quantitative ratings from Likert scales were analyzed using descriptive statistics (means and frequencies) rather than inferential statistics due to the small sample size (Inline graphic). This mixed-methods approach allowed for both structured comparison and nuanced interpretation of student feedback.

Evaluation with medical students

To assess the effectiveness of the application, we conducted in-depth interviews with eight medical students (seven male, one female) from Jagiellonian University Medical College.

Sample size justification: For this exploratory qualitative study, we employed purposive sampling to recruit medical students with varying levels of familiarity with LLMs and medical interview training. While eight participants represents a modest sample size, it aligns with established guidelines for phenomenological qualitative research, where samples of 5–15 participants are considered sufficient when the research objective is to explore lived experiences and perceptions of a specific phenomenon [22]. Our primary aim was to gather rich, detailed feedback on user experience and perceived educational value rather than to achieve statistical generalizability. Data saturation—the point at which new interviews yield no substantially new themes—was assessed through iterative analysis. By the seventh and eighth interviews, we observed convergent feedback patterns with minimal emergence of novel themes, suggesting adequate information power for our research objectives. However, we acknowledge that this sample size limits generalizability and demographic diversity, as discussed in Discussion section.

Interview guide development: The interview guide was developed iteratively to align with our exploratory research objectives, specifically addressing RQ2 (student perceptions of utility, realism, and educational value). Given the exploratory nature of this study, we adopted a semi-structured interview approach that allowed flexibility to pursue emergent themes while ensuring systematic coverage of key topics [23].

The guide development followed three stages: (1) Initial draft based on the research questions—we mapped each RQ to specific interview domains, with RQ2 informing the core perceptual and evaluative questions; (2) Pilot testing with two medical students (not included in the final sample) to assess question clarity, interview flow, and identify additional relevant topics; (3) Refinement based on pilot feedback, including rewording of ambiguous questions and addition of probes for deeper exploration of student experiences.

The interview structure was explicitly designed to capture student perceptions through a demonstration-centered approach: pre-demonstration questions established baseline experiences and expectations, the demonstration phase provided authentic interaction with the system, and post-demonstration questions elicited reflective perceptions of that experience. This design ensures that student perceptions—the central focus of RQ2—are grounded in actual system use rather than hypothetical scenarios. The complete interview guide is provided in Additional file 3.

The interviews consisted of three parts:

  1. Pre-demonstration questions about prior experience with LLMs and medical interview training methods.

  2. Application demonstration where students used the application to conduct simulated medical interviews.

  3. Post-demonstration questions evaluating the application’s effectiveness and comparing it to traditional training methods.

Questions included both quantitative (5-point Likert-scale) and qualitative components, allowing students to elaborate on their responses. For quantitative questions, students rated items on scales from “Very bad” to “Very good” or “Not needed at all” to “Must have” depending on the context. The interviews were conducted in Polish and later translated to English for analysis. Average interview duration was 46 minutes.

The pre-demonstration questions assessed students’ familiarity with LLMs, including usage frequency, typical applications, and any previous attempts to use these tools for medical education. This established a baseline understanding of students’ technological literacy and preconceptions about AI-assisted learning.

During the application demonstration, students were given time to explore the interface independently and conduct at least one complete medical interview with a simulated patient. The interviewer remained available for technical support but otherwise did not intervene or guide the interaction. This approach allowed for authentic user experience evaluation without researcher influence.

Post-demonstration questions focused on usability, perceived educational value, and comparison with existing training methods. Students were asked to rate various aspects of the application on Likert scales while providing qualitative elaboration of their ratings. We also solicited feedback on potential features, with students rating the importance of specific functionalities such as random condition selection, patient demographic configuration, and simulation of physical examination.

Ethical considerations included voluntary participation, anonymity in reporting, and obtaining verbal consent before interviews. No personal identifiable information was collected. The study design focused on educational technology evaluation rather than assessment of student performance, minimizing potential stress or anxiety for participants.

Results

Sample conversation

To illustrate the quality and nature of the simulated medical interviews, Table 1 presents an excerpt from an actual doctor-patient conversation conducted with the GPT-4 model simulating a patient with stable angina. This conversation demonstrates both the strengths of the system (consistent symptom presentation, naturalistic language) and occasional limitations (sometimes overly detailed responses).

Table 1.

Sample conversation between medical student and simulated patient (Excerpt from a medical interview)

Participant Message
Doctor Hello, I’m Dr. Smith. What brings you in today?
Patient Hi doctor. I’ve been having these chest pains recently. They come and go, but they’re pretty uncomfortable when they happen.
Doctor I’m sorry to hear that. Can you tell me more about these pains? Where exactly do you feel them?
Patient It’s right in the middle of my chest. Kind of feels like pressure or like someone’s sitting on my chest. Sometimes it spreads to my left shoulder and down my arm.
Doctor How long has this been happening?
Patient It started maybe 3 weeks ago. At first I thought it was just indigestion or something, but it keeps happening.
Doctor What seems to trigger these episodes?
Patient I notice it happens mostly when I’m walking up the stairs or when I’m rushing to catch the bus. When I’m just sitting around, I usually feel fine.
Doctor How long do the episodes typically last?
Patient Usually just a few minutes. Once I stop and rest, it goes away pretty quickly. Maybe 5 minutes at most.
Doctor Have you tried taking anything for the pain?
Patient I tried some antacids because I thought it might be heartburn, but they didn’t really help much.
Doctor Do you have any other symptoms during these episodes like shortness of breath, dizziness, or nausea?
Patient Yeah, sometimes I do feel a bit short of breath when it happens. And occasionally a little lightheaded. I haven’t really felt nauseous though.

Open-source model performance

Our evaluation of the fine-tuned Llama-2 model yielded mixed results. While supervised training produced models that better adhered to the expected conversational format, they occasionally generated overly long outputs, sometimes producing entire interviews rather than only patient responses. This was likely due to truncation of training examples, which may have affected the model’s understanding of response boundaries.

The supervised model demonstrated better comprehension of the patient role, maintaining appropriate symptom consistency throughout conversations. However, it exhibited a tendency toward verbose responses that occasionally included technical medical terminology unlikely to be used by typical patients. For example, when asked about pain characteristics, the model might respond with detailed descriptions of “radiating substernal discomfort” rather than more naturalistic descriptions like “pain that goes from my chest to my left arm.”

The unsupervised training approach produced more appropriately-sized responses but demonstrated less consistency in conversation quality, occasionally generating unnaturally formal or verbose patient responses. The unsupervised model sometimes failed to maintain consistent symptoms throughout longer conversations, occasionally introducing contradictory information as the dialogue progressed. This inconsistency could potentially compromise the educational value of the simulation by presenting clinically implausible symptom combinations.

Interestingly, neither fine-tuned model showed a significant improvement in naturalness of responses compared to the base Llama-2 model with appropriate prompting. This suggests that for models of this size, prompt engineering may be as effective as fine-tuning for this specific application, at least when working with the constraints of a 7B parameter model.

Quantitative evaluation using standard benchmarks (AI2 RC, Hellaswag, MMLU, TruthfulQA, Winogrande, and GSM8k) showed no statistically significant differences between the base Llama-2 model and either fine-tuned version. This suggests that while fine-tuning influenced the models’ behavior in medical interview contexts, it did not compromise their general capabilities.

Statistical analysis using two-tailed Z-tests confirmed that observed differences in benchmark scores were not significant at a 95% confidence level. This finding is reassuring from an implementation perspective, as it indicates that specialized fine-tuning for medical interview simulation does not necessarily come at the cost of general model capabilities (Table 2).

Table 2.

Benchmark performance comparison across models (Accuracy scores from 0 to 1, with 1 representing perfect performance)

Benchmark Llama-2 variants GPT-4
Base Supervised Unsupervised
AI2 RC 0.641 (±0.008) 0.650 (±0.008) 0.649 (±0.008) 0.963
Hellaswag 0.578 (±0.005) 0.573 (±0.005) 0.573 (±0.005) 0.953
MMLU 0.464 (±0.004) 0.466 (±0.004) 0.468 (±0.004) 0.864
TruthfulQA 0.377 (±0.011) 0.381 (±0.011) 0.380 (±0.011) N/A
Winogrande 0.664 (±0.013) 0.672 (±0.013) 0.672 (±0.013) 0.875
GSM8k 0.238 (±0.012) 0.231 (±0.012) 0.218 (±0.011) N/A

Given computational constraints and the moderate results achieved with the 7B parameter model, we concluded that larger parameter models might be necessary for more convincing patient simulation using open-source approaches. This aligns with findings in the broader LLM literature suggesting that performance on specialized tasks often scales with model size [24].

Commercial model performance

Commercial models demonstrated stronger performance in patient simulation. GPT-4 consistently outperformed GPT-3.5 in generating natural-sounding patient responses. We found that fine-tuning these models through OpenAI’s API did not significantly improve output quality, likely because the models were already sufficiently trained on conversational tasks.

GPT-3.5 tended to produce responses that were technically accurate but often unnaturally formal or comprehensive for typical patient communication. For example, when asked about symptoms, GPT-3.5 might list all relevant symptoms in a single response rather than revealing them gradually through conversation as real patients typically would. By contrast, GPT-4 demonstrated more human-like communication patterns, including appropriate hesitation, gradual symptom disclosure, and more natural language usage (Table 3).

Table 3.

Comparison of open-source and commercial LLM performance

Aspect Open-source model (Llama-2) Commercial model (GPT-4)
Benchmark performance Moderate (e.g., AI2 RC: 0.65) Strong (e.g., AI2 RC: 0.96)
Response naturalness Occasionally technical or verbose More human-like conversation patterns
Symptom consistency Good with supervised fine-tuning Excellent with proper prompting
Memory requirements High (48+ GB VRAM for training) None (API-based)
Customization High (full model fine-tuning) Limited (prompt engineering only)
Implementation cost High initial (hardware) Low initial, ongoing usage fees
Response time Fast (local inference) Variable (API-dependent)

Prompt engineering proved highly effective in controlling model behavior. Our optimal approach involved (Table 4):

  1. Programmatically selecting a specific medical condition

  2. Providing explicit symptoms associated with the condition

  3. Instructing the model to generate its own natural descriptions of these symptoms

  4. Including demographic information to customize patient presentation

Table 4.

Examples of prompt engineering techniques for patient simulation

Technique Example prompt element Effect on model output
Symptom naturalization “Generate your own descriptions of these symptoms using everyday language. Do not use medical terminology.” “I feel a squeezing pain in my chest” instead of “I experience substernal pressure”
Gradual disclosure “Do not reveal all symptoms at once. Only mention new symptoms when specifically asked about them or when discussing related areas.” Patient reveals shortness of breath only when asked about exercise tolerance
Response length control “Keep your responses brief and concise, typically 1–3 sentences. Avoid lengthy explanations unless pressed for details.” More realistic patient communication patterns with appropriately brief answers
Emotional calibration “You are mildly anxious about your symptoms but trying to remain calm. Express moderate concern but not panic.” Introduces appropriate emotional context without overacting
Demographic alignment “You are a 68-year-old retired construction worker with limited medical knowledge.” Adjusts vocabulary and health literacy level to match patient profile

This approach reduced instances of the model providing technically accurate but unnaturally formal medical descriptions, instead favoring more authentic patient language. For example, rather than stating “I am experiencing dyspnea on exertion,” the model would respond with “I get really short of breath when I walk up stairs or try to hurry.”

We found that controlling symptom disclosure was particularly important for creating realistic interviews. Without specific instructions, models tended to be overly cooperative, immediately revealing all symptoms when asked general questions. By explicitly instructing the model to withhold certain information until specifically asked or until rapport was established, we created more challenging and realistic interview scenarios that better simulated actual patient behavior.

The commercial models occasionally exhibited undesired behaviors, such as revealing the simulated condition prematurely or attempting to switch roles and act as the doctor. However, these issues occurred infrequently and could generally be addressed through improved prompting. We observed that GPT-4 exhibited fewer instances of these problematic behaviors compared to GPT-3.5, suggesting that more sophisticated models may require less explicit control to maintain appropriate simulation parameters.

Parameter tuning experiments revealed that temperature settings between 0.6 and 0.8 provided the best balance between consistency and variability in responses. Lower temperature settings (Inline graphic) resulted in overly predictable and sometimes repetitive answers, while higher settings (Inline graphic) occasionally produced implausible or contradictory symptoms. The presence penalty parameter was particularly useful in preventing symptom repetition during longer conversations.

Student evaluation results

All eight interviewed medical students had prior experience with commercial LLMs, primarily using them for study assistance. They rated the quality of LLM outputs as average to good (mean rating between “average” and “good” on a 5-point scale) and found instructing the models to be relatively easy. Student ratings for specific application features are presented in Fig. 1.

Fig. 1.

Fig. 1

Student ratings of application features (Rated on a 4-point scale from 1 – “Not needed” to 4 – “Must have”)

Students reported primarily using LLMs for examination preparation, finding answers to difficult questions, and seeking explanations of complex concepts. However, several noted concerns about the reliability of medical information provided by these models, with one student explicitly mentioning abandoning LLMs for exam preparation due to accuracy issues. This highlights an important consideration for medical education applications: students are aware of potential limitations in LLM accuracy and approach these tools with appropriate skepticism.

Half of the students had previously attempted to use LLMs to simulate patient interactions, but most reported difficulties in achieving natural conversations, with models often listing all symptoms in the initial response rather than revealing them gradually through conversation. Only one student reported successfully implementing a patient simulation independently, and even this student described the conversation quality as merely “mediocre.”

The frequency of LLM usage varied among students, with most reporting monthly usage and noting increased utilization during examination periods. This pattern suggests that medical students view these tools primarily as supplementary resources rather than core learning platforms, accessing them when specific needs arise rather than as part of routine study practices.

After using our application, students rated the conversation quality positively (mean rating of “good” on a 5-point scale). They noted several advantages compared to traditional peer role-playing:

  1. Availability: The ability to practice at any time without requiring a partner

  2. Consistency: The model maintained appropriate symptoms throughout the conversation without forgetting details

  3. Seriousness: Unlike peer role-playing which sometimes includes humorous elements, the simulated patients maintained a consistent, serious tone

  4. Diagnostic practice: The random condition mode allowed students to practice both interview skills and diagnostic reasoning

One student specifically commented: “When I practice with other students, they often forget what symptoms they’re supposed to have or add ridiculous symptoms as a joke. This system stays focused on realistic symptoms throughout the conversation, which is much better for practice.”

Students identified limitations as well:

  1. Lack of non-verbal communication: The text-based interface couldn’t simulate body language or physical symptoms

  2. Occasionally unnatural responses: The model sometimes provided more detailed or technical descriptions than typical patients would

  3. No feedback mechanism: Unlike instructor-led sessions, the application didn’t provide assessment of interview technique

Several students specifically mentioned that while the application was helpful for practicing the structure and content of medical interviews, it couldn’t replace the interpersonal aspects of patient communication that require face-to-face interaction. As one student explained: “Learning to read a patient’s body language and emotional cues is a huge part of being a good doctor, and you just can’t practice that with a text chat.”

When asked to rate potential features, students ranked “selecting a random condition” as most important (rated “must have” by 75% of students), followed by “withholding information until persuaded” and “simulation of physical examination.” The strong preference for randomized conditions highlights the value students place on diagnostic challenge within interview practice. The high rating for information withholding suggests students recognize that real patients often don’t immediately disclose all relevant information and want to practice elicitation techniques.

Regarding specific features, students provided detailed feedback:

  • For medical history generation, students suggested that while comprehensive history was valuable, it should remain proportional to the presenting complaint rather than overwhelmingly detailed for every case.

  • For emotion simulation, students expressed skepticism about conveying emotions effectively through text, suggesting this might be better addressed through voice or video interfaces in future iterations.

  • For physical examination simulation, students proposed implementing specific commands or structured inputs that would trigger descriptive outputs of examination findings.

All students agreed that while the application provided valuable supplementary training, it could not replace interactions with real patients. They predicted that such applications would be increasingly integrated into medical education but would serve as complements to rather than replacements for traditional methods. Several students envisioned the system as particularly valuable for first and second-year students developing initial interview skills before clinical rotations.

Discussion

Our study demonstrates that Large Language Models can effectively contribute to medical interview training when implemented appropriately. The positive reception from medical students suggests real potential for practical application in educational settings.

Model selection and implementation

Our findings indicate that commercially available models currently offer superior performance for patient simulation compared to smaller open-source models, even with fine-tuning. This aligns with research by Nori et al. [25], who found that model performance on complex tasks correlates strongly with parameter count. For educational institutions with limited computational resources, commercial APIs may represent a more practical implementation path despite ongoing costs.

The gap between commercial and open-source model performance observed in our study reflects the current state of AI development, where the most capable models remain behind commercial interfaces. However, the rapid advancement of open-source alternatives suggests this gap may narrow over time. The emergence of models like Mistral [26] and MPT [27] with performance approaching commercial counterparts at lower parameter counts indicates a trajectory toward more accessible high-performance models.

The critical factor in successful implementation was effective prompt engineering rather than model fine-tuning. This suggests that expertise in crafting appropriate prompts may be more valuable than technical capabilities in training custom models for this application. Our experience aligns with findings by Liu et al. [28], who demonstrated that well-designed prompts can elicit specific behaviors from language models without requiring additional training.

For organizations implementing such systems, this has important practical implications. Resources might be better directed toward developing expertise in prompt engineering and condition-symptom knowledge bases rather than investing in costly computational infrastructure for fine-tuning. This approach also offers greater flexibility, as prompts can be rapidly adjusted to address emerging issues or incorporate new medical knowledge without requiring model retraining.

Educational value and limitations

 Important methodological note: Our evaluation primarily assessed student perceptions of educational value rather than measuring objective learning outcomes. While students reported positive experiences and identified specific advantages of the system, we must clearly distinguish between user satisfaction and demonstrated educational impact. Student enjoyment or perceived usefulness does not automatically translate to improved clinical skills or better patient care. A rigorous assessment of educational value would require controlled comparative studies with performance-based assessments (e.g., OSCE scores, real patient interaction evaluations, pre/post skill measurements) rather than relying solely on self-reported perceptions. Our findings should therefore be interpreted as preliminary evidence of perceived utility and user acceptance, which are necessary but not sufficient conditions for establishing educational effectiveness. We recommend this as a critical direction for future research in Discussion section.

Students’ evaluation of the application revealed a clear niche in medical education. The ability to practice independently, at any time, addresses a significant limitation of traditional training methods that require coordinated scheduling between students or with standardized patients. This aligns with findings from Kononowicz et al. [12], who identified flexibility and repeatability as key advantages of virtual patient systems.

The consistent maintenance of symptoms throughout conversations represents a particular advantage over peer role-playing. In traditional role-playing exercises, student actors may forget details or inadvertently introduce inconsistencies, potentially teaching inappropriate clinical reasoning patterns. LLM-based simulations maintain coherent symptom presentations, providing more reliable learning experiences for developing clinical reasoning skills.

However, students emphasized the irreplaceability of human interaction in medical training, particularly for developing non-verbal communication skills and handling emotional aspects of patient care. This limitation is consistent with previous research by Hubal et al. [10], who noted that text-based virtual patients cannot adequately simulate the full spectrum of patient-doctor communication. Medical interviewing encompasses both information gathering and relationship building [15], and our current implementation primarily addresses the former.

The identified advantages of consistency in symptom presentation and seriousness of interaction deserve particular attention. Traditional peer role-playing often suffers from inconsistency, with student actors forgetting details or treating the exercise lightheartedly. LLM-based simulations maintain consistency throughout the conversation, potentially offering more realistic practice for clinical reasoning skills.

Importantly, students recognized appropriate boundaries for technology-assisted learning rather than viewing it as a replacement for existing methods. This nuanced perspective suggests that medical students can effectively integrate such tools into their educational ecosystem when they clearly understand the strengths and limitations of each approach. Educational implementations should therefore explicitly position LLM-based training as complementary to rather than replacing traditional methods.

Comparison with previous approaches

Earlier virtual patient systems typically relied on branching scenarios with predetermined response paths [14]. The adaptive nature of modern LLMs represents a significant advancement, allowing for natural conversation flow without artificial constraints. Students specifically noted this advantage when comparing our application to previous computer-based training programs they had encountered.

Traditional branching-narrative virtual patients require significant development effort for each clinical scenario, with authors needing to anticipate potential questions and provide appropriate responses for each branch. This results in systems that, while potentially high quality, are limited in scope and costly to expand. By contrast, LLM-based approaches can generalize across various conditions with minimal condition-specific engineering, requiring only accurate symptom descriptions rather than comprehensive dialogue scripting.

Unlike previous systems that frequently required substantial development for each clinical scenario, our approach allows rapid deployment of new conditions by simply updating the symptom database. This scalability could facilitate coverage of rare conditions that might otherwise be difficult to practice with standardized patients.

The technical architecture of our implementation also differs from many previous virtual patient systems by separating the conversation engine (the LLM) from the medical knowledge base (the condition and symptom database). This separation of concerns allows independent updating of medical content without modifying the underlying conversation system, potentially facilitating maintenance and clinical accuracy over time.

Alignment with medical education competencies

To understand how LLM-based interview simulation fits within established medical education frameworks, we mapped our system capabilities to core clinical communication competencies as defined by the Association of American Medical Colleges (AAMC) and similar international organizations (Table 5).

Table 5.

Mapping of tool capabilities to medical education competencies

Clinical competency Current system support Educational value
Information gathering Strong - Natural question-answer format Allows practice of history-taking sequence and follow-up questioning
Building patient relationship Partial - Text-based empathetic responses Supports practicing rapport building language without non-verbal cues
Patient education Partial - Can simulate patient questions Provides opportunities to practice explanation skills
Shared decision making Limited - Basic response to treatment options Initial practice for presenting options and eliciting preferences
Cultural competence Configurable - Patient demographics can be adjusted Enables exposure to diverse patient backgrounds
Clinical reasoning Strong - Random condition mode challenges diagnostic skills Cultivates pattern recognition and hypothesis testing
Non-verbal communication Not supported Requires complementary training methods
Documentation Not integrated Potential for future development

This mapping demonstrates that while LLM-based simulation effectively supports some core competencies (particularly information gathering and clinical reasoning), it has limitations in areas requiring physical presence or non-verbal communication. This reinforces our finding that such tools should complement rather than replace traditional training methods involving human interaction.

Safety concerns and limitations of LLMs in medical education

Critical consideration: The application of LLMs in medical education raises significant safety and accuracy concerns that must be carefully addressed before widespread deployment.

Hallucination and medical misinformation: LLMs are prone to generating plausible-sounding but factually incorrect information—a phenomenon known as “hallucination” [29]. In medical contexts, this is particularly concerning as students may internalize incorrect symptom presentations, inappropriate clinical reasoning, or inaccurate medical knowledge. For example, an LLM might generate atypical symptom combinations that do not reflect actual clinical presentations, or suggest inappropriate temporal relationships between symptoms. To mitigate this risk in our implementation, we explicitly provided symptom lists and condition descriptions rather than relying solely on the model’s internal medical knowledge. However, even with this approach, models may occasionally generate responses inconsistent with provided information or medical reality. Educational implementations must include clear warnings about this limitation and emphasize that LLM-generated content should not be trusted as authoritative medical information.

Bias and representation: LLMs trained on internet-scale datasets may encode societal biases related to gender, race, socioeconomic status, and other demographic factors [30]. In patient simulation, this could manifest as stereotypical or inappropriate portrayals of certain patient populations (e.g., assuming certain conditions are more likely in specific demographic groups, or using language that reflects cultural biases). Our system allows demographic configuration, but the model’s underlying biases may influence response generation in subtle ways. Thorough evaluation by diverse medical educators and ongoing monitoring for inappropriate content is essential. Future work should specifically assess whether LLM-generated patient personas reinforce or challenge clinical biases.

Dense multi-turn interactions: Extended conversations increase the likelihood of error accumulation, where early inaccuracies compound through subsequent turns [31]. A model might initially generate a minor inconsistency (e.g., describing pain as “sharp” then later as “dull”), which can cascade into increasingly implausible symptom descriptions. While our implementation maintains conversation context to promote coherence, we did not implement explicit consistency checking mechanisms. Students using the system for extended practice sessions should be aware that longer conversations may exhibit degraded accuracy.

Lack of verified medical grounding: Unlike curated educational materials reviewed by medical experts, LLM outputs lack systematic verification. While our approach of explicit symptom definition provides some grounding, the model’s natural language generation can deviate from specified parameters. Implementing Retrieval-Augmented Generation (RAG) could help by retrieving verified medical information from trusted sources (e.g., medical textbooks, clinical guidelines) to ground model responses, though this adds implementation complexity and does not completely eliminate hallucination risk.

Pedagogical implications: Instructors implementing such systems should:

  • Explicitly teach students to critically evaluate AI-generated content

  • Emphasize that LLM simulations are supplementary tools, not authoritative medical references

  • Implement expert review of sample conversations to identify problematic patterns

  • Provide feedback mechanisms for students to report unusual or concerning responses

  • Consider implementing automated checks for medical plausibility (though this remains an open research challenge)

Despite these limitations, we believe that with appropriate safeguards, critical user education, and positioning as supplementary rather than primary learning tools, LLM-based patient simulations can provide value while managing risks. The key is transparency about limitations and integration within educational contexts that emphasize critical thinking rather than rote acceptance of system outputs.

Future directions

Several promising directions for future development emerged from our study:

  • Multimodal integration: Incorporating visual elements to simulate physical examination findings and non-verbal communication would address a key limitation of the current text-only approach. Emerging multimodal LLMs capable of processing and generating both text and images [32] could enable systems that present visual examination findings in response to specific examination commands.

  • Automated feedback: Implementing assessment of interview technique based on clinical guidelines could provide valuable learning reinforcement without instructor involvement. This could involve analyzing conversation transcripts to identify whether students covered essential history elements, used appropriate questioning techniques, and followed recommended interview structures.

  • Emotional simulation: Enhancing the models’ ability to simulate patient emotions and reluctance to disclose sensitive information would increase realism. While current text-only interfaces limit emotional expression, more sophisticated prompt engineering could simulate patients with varying emotional states, communication styles, and disclosure behaviors.

  • Integration with medical records: Connecting the system with simulated electronic health records would allow practice with comprehensive documentation skills. This integration would bridge the gap between interview practice and documentation training, two closely related skills that are often taught separately in medical curricula.

  • Larger open-source models: Exploring larger parameter models (13B+ parameters) might yield better results for institutions preferring self-hosted solutions. The rapid advancement of efficient inference techniques and quantization methods could make larger models increasingly accessible to educational institutions with moderate computational resources.

  • Post-interview feedback mechanism: Implementing automated analysis of interview completeness, identifying potentially missed questions or areas for improvement based on clinical guidelines.

  • Voice interaction: Adding text-to-speech and speech-to-text capabilities would enhance accessibility and provide a more natural interaction format, potentially addressing some limitations of text-only communication.

  • Difficulty levels: Creating graduated difficulty settings where patients at higher levels are less forthcoming with information, requiring more sophisticated communication techniques to elicit complete histories.

Additionally, student feedback suggested aligning the tool with specific learning objectives in medical curricula. Mapping system capabilities to standard medical education competencies would help integrate this technology into formal educational programs, potentially creating a structured progression from text-based simulations to more complex scenarios as students advance through their training. A comprehensive research roadmap outlining these future directions is presented in Fig. 2.

Fig. 2.

Fig. 2

Future research roadmap for LLM-based medical interview training

These potential enhancements reflect an iterative approach to educational technology development, building on established strengths while addressing identified limitations. Importantly, any technical advancements should be guided by educational principles rather than implementing technology for its own sake.

Ethical considerations

The application of LLMs in medical education raises several important ethical considerations that we carefully addressed in our study design and implementation (Table 6):

Table 6.

Ethical considerations in LLM-based medical education

Concern Mitigation approach
Accuracy of medical information Symptoms and conditions were explicitly defined rather than relying solely on model knowledge; expert review of system outputs during development
Student misunderstanding Clear communication about system limitations; emphasis on supplementary rather than replacement role
Participant privacy No personally identifiable information collected; all feedback anonymized; verbal consent obtained
Educational inequality Designed for accessibility across devices; low technical requirements for student users
Replacement of human training Explicit positioning as complement to, not replacement for, human interaction; emphasis on unique benefits of traditional training
Over-reliance on technology Discussion of limitations with participants; emphasis on critical thinking when using AI tools

We emphasized to students that while LLM-based training offers unique advantages, it represents a complement to, not a replacement for, human interaction in medical education. By providing clear information about system capabilities and limitations, we encouraged appropriate usage within a broader educational context. All research participants provided informed verbal consent, and the study design was reviewed and approved by the Jagiellonian University Research Ethics Committee.

Limitations of the study

Our study has several important limitations that should be acknowledged and contextualized:

Sample size and generalizability: The relatively small sample of medical students (Inline graphic) from a single institution significantly limits generalizability. While appropriate for exploratory qualitative research (as discussed in Methods section), this sample size is insufficient for statistical inference or broad conclusions about medical student populations. The pronounced gender imbalance (7 male, 1 female) may have introduced bias, though interview skills development is not known to be strongly gender-associated. Geographic and institutional specificity (all participants from Jagiellonian University Medical College in Poland) means findings may not transfer to different educational contexts, cultures, or healthcare systems.

Limited participant experience: All participants were first or second-year medical students with limited exposure to real patients. This creates a significant validity concern: students with minimal real-world clinical experience may not be well-positioned to judge the realism or educational value of simulated patients compared to actual patient encounters. Their feedback reflects novice perceptions rather than expert assessment. More experienced medical students or practicing clinicians might identify significant deficiencies that early-year students cannot recognize.

Single-exposure limitation: Participants used the system only once during the evaluation interview. This single-exposure design cannot assess skill development over time, learning curves, or sustained engagement—critical factors for educational technologies. We cannot determine whether repeated use leads to improved interview skills, diminishing returns, or user fatigue. Claims about “skill development opportunities” must be understood as potential rather than demonstrated outcomes.

Absence of comparative data: We did not conduct controlled comparisons between LLM-based training and traditional methods (peer role-play, standardized patients, no additional practice). All comparative claims rely on students’ subjective impressions rather than empirical data. A rigorous study would include randomized groups with objective skill assessments (e.g., standardized OSCE evaluations) measured before and after exposure to different training modalities.

Dataset limitations: The training dataset’s heavy skew toward respiratory cases (78.7% of conversations) represents a significant limitation for clinical diversity. Models trained predominantly on respiratory presentations may not generalize well to cardiovascular, neurological, gastrointestinal, or other specialties. This respiratory focus limits the system’s applicability for comprehensive medical interview training across clinical domains. Future work requires more balanced datasets spanning diverse medical conditions, though such datasets are challenging to obtain.

Technical constraints: We used relatively small open-source models (7B parameters) due to computational constraints. These represent a fraction of current state-of-the-art capabilities (models now exceed 100B parameters). Larger models might demonstrate significantly improved performance, but resource limitations prevented their evaluation. Additionally, conversation truncation to 3300 tokens (affecting 23% of training data) may have excluded important interaction patterns from longer interviews.

Benchmarking limitations: Our use of general-purpose LLM benchmarks (MMLU, Hellaswag, etc.) does not directly evaluate medical interview capability, conversational authenticity, or clinical realism. These benchmarks served only to verify that fine-tuning did not degrade general capabilities, not to validate patient simulation quality. Medical communication-specific benchmarks would be more appropriate but remain underdeveloped in the field.

Statistical methods: Our application of Z-tests to benchmark scores assumes distributional properties that may not hold for deterministic model outputs. Treating standard errors as sampling variability may not be statistically appropriate, as discussed by Reviewer 4. This represents a methodological limitation in our quantitative model comparison approach.

Lack of expert evaluation: We did not conduct systematic expert review of LLM-generated conversations by medical professionals. Medical faculty evaluation of symptom accuracy, clinical plausibility, and pedagogical appropriateness would strengthen claims about educational value but was beyond the scope of this initial exploratory study.

Evaluation focus: Our evaluation prioritized student perceptions (user experience, perceived utility) rather than objective learning outcomes (skill acquisition, clinical performance). We cannot claim that using this system improves medical interview skills—only that students found it potentially useful. Controlled longitudinal studies with pre/post assessments of actual clinical performance would be necessary to establish educational effectiveness.

Prompt engineering approach: Our prompt development was iterative and heuristic rather than systematic. Alternative prompting strategies might yield superior performance with identical models. The field of prompt engineering lacks formalized methodologies, making it difficult to claim optimality for any particular approach. More systematic prompt optimization (e.g., using automated prompt search methods) could improve results.

Generalizability of findings: Our finding that prompt engineering outperformed fine-tuning for patient simulation may not generalize beyond small (7B parameter) models or beyond our specific dataset and task formulation. Larger models, different medical domains, or alternative fine-tuning approaches might yield different conclusions.

These limitations collectively indicate that our findings should be interpreted as preliminary evidence supporting the potential utility of LLM-based patient simulation for supplementary medical education, with substantial further research needed before recommending widespread adoption. We emphasize the exploratory nature of this work and the need for more rigorous follow-up studies addressing these methodological constraints.

Conclusion

Our exploratory study provides preliminary evidence that large language models may serve as useful supplementary tools in medical interview training, though significant limitations and unanswered questions remain. Our key findings include:

  1. Commercial LLMs like GPT-4 can generate contextually appropriate patient responses with careful prompt engineering, demonstrating potential for patient simulation. Smaller open-source models (7B parameters) showed more limited capabilities, though larger open-source alternatives remain unexplored.

  2. Medical students (n=8, early-year, single institution) reported positive perceptions of LLM-simulated interviews and identified potential advantages regarding availability, consistency, and seriousness of interactions. However, these represent subjective user perceptions rather than demonstrated improvements in clinical skills or learning outcomes.

  3. We developed a functional web-based application implementing LLM-based patient simulation, demonstrating technical feasibility for educational deployment. However, real-world educational effectiveness remains unvalidated.

  4. In our limited evaluation, prompt engineering appeared more effective than parameter-efficient fine-tuning (QLoRA) of small models for patient simulation quality. This finding may not generalize to larger models, different datasets, or alternative fine-tuning approaches.

  5. Student feedback highlighted potentially valuable features (random condition selection, gradual information disclosure), suggesting design directions for future systems, though optimal feature sets require systematic evaluation.

The positive student reception suggests potential utility in educational contexts, though this must be validated through controlled studies with objective outcome measures. The availability of on-demand practice opportunities could address scheduling constraints in medical education, but effectiveness compared to traditional methods (standardized patients, peer role-play) remains empirically undemonstrated.

Our findings highlight the importance of viewing LLM-based training as complementary to rather than replacing traditional approaches. Students consistently emphasized that while the technology offered valuable practice opportunities, it could not substitute for human interaction in developing the full spectrum of clinical communication skills. This balanced perspective should guide implementation in educational settings, positioning such tools as supplements within a comprehensive skills development program.

The application of LLMs in medical education represents a specific case of a broader trend toward AI-augmented learning environments. Unlike earlier computer-aided instruction that often followed rigid, pre-programmed paths, modern LLMs enable more flexible, responsive interactions that adapt to learner inputs. This increased naturalism potentially bridges the gap between technological convenience and educational authenticity that has challenged previous simulation efforts.

Critical future work: Before recommending broader adoption, the following research is essential:

  • Controlled efficacy studies: Randomized trials comparing LLM-based practice against traditional methods with objective skill assessments (OSCE scores, expert-rated real patient interactions)

  • Longitudinal skill development: Multi-session studies tracking whether repeated LLM practice translates to measurable clinical improvement

  • Expert validation: Systematic review of LLM-generated conversations by medical faculty for accuracy, clinical plausibility, and pedagogical appropriateness

  • Safety monitoring: Assessment of hallucination frequency, medical misinformation, and bias in patient representations

  • Diverse clinical domains: Evaluation beyond respiratory-focused training data to establish generalizability

  • Comparative cost-effectiveness: Economic analysis versus standardized patient programs

While LLM-based simulation shows promise as a supplementary practice tool, it cannot replace supervised interactions with real patients, which remain essential for developing clinical judgment, empathy, and professional identity. Our findings suggest a potential role for on-demand practice that complements—but does not substitute for—traditional training methods.

As LLM technology advances, we anticipate opportunities to address current limitations through multimodal capabilities, improved consistency, and better medical grounding. However, the fundamental challenge of ensuring medical accuracy and pedagogical soundness will require ongoing attention. The field of medical education should approach these technologies with cautious optimism: embracing their accessibility and flexibility while maintaining rigorous standards for educational evidence and patient safety.

Future researchers should prioritize objective learning outcome assessment over user satisfaction, employ diverse and representative samples, conduct long-term follow-up studies, and systematically address safety concerns. Only through such rigorous investigation can we determine whether LLM-based patient simulation truly enhances medical education or merely provides an appealing but educationally ineffective technological novelty.

Acknowledgements

The authors would like to thank the medical students who participated in the evaluation interviews for their valuable insights and feedback.

Materials availability

Application source code is available from the corresponding author upon reasonable request.

Medical trials

Not applicable.

Author' contributions

All authors contributed to all aspects of this study. Jarosław Hryszko took the lead in data analysis and manuscript preparation, whereas Artur Michałek primarily focused on study design and data collection. All authors were involved in the review process and approved the final version of the manuscript.

Funding

This research received no specific grant from any funding agency in the public, commercial, or not-for-profit sectors.

Data availability

The dataset of medical conversations used in this study is publicly available from Fareez et al. [27]. Other data generated during the current study are available from the corresponding author on reasonable request.

Code availability

Code for model training and application development is available from the corresponding author upon reasonable request.

Declarations

Ethics approval and consent to participate

This study was conducted in accordance with the Declaration of Helsinki. Ethical approval was obtained from the Jagiellonian University Research Ethics Committee. All participants provided verbal informed consent before participation.

Consent for publication

Not applicable.

Competing interests

The authors declare no competing interests.

Footnotes

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

  • 1.Ayn C, Robinson L, Nason A, Lovas J. Determining recommendations for improvement of communication skills training in dental education: a scoping review. J Dent Educ. 2017;81(4):479–88. [DOI] [PubMed] [Google Scholar]
  • 2.Barrows HS. An overview of the uses of standardized patients for teaching and evaluating clinical skills. Acad Med. 1993;68(6):443–51. [DOI] [PubMed] [Google Scholar]
  • 3.Johnson H. A critical review of standardized patient examinations as part of the USMLE. AMA J Ethics. 2003;5(12):572–6. [DOI] [PubMed] [Google Scholar]
  • 4.Nestel D, Groom J, Eikeland-Husebø S, O’Donnell JM. Simulation for learning and teaching procedural skills: the state of the science. Simul Healthc. 2011;6(7):S10-3. [DOI] [PubMed] [Google Scholar]
  • 5.Benedict N. Virtual patients and problem-based learning in advanced therapeutics. Am J Pharm Educ. 2010;74(8):143. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Brown T, Mann B, Ryder N, Subbiah M, Kaplan J, Dhariwal P. Language models are few-shot learners. In: Larochelle H, Ranzato M, Hadsell R, Balcan MF, Lin HT, editors. Advances in Neural Information Processing Systems, vol. 33. Neural Information Processing Systems Foundation; 2020. pp. 1877–901.
  • 7.Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez A, et al. Attention is all you need. In: Guyon I, Luxburg UV, Bengio S, Wallach H, Fergus R, editors. Advances in Neural Information Processing Systems, vol. 30. Neural Information Processing Systems Foundation; 2017.
  • 8.Denson JS, Abrahamson S. A computer-controlled patient simulator. JAMA. 1969;208(3):504–8. [PubMed] [Google Scholar]
  • 9.Lewis JS. EPS: An Epidemiologic Patient Simulator. Proceedings of the Annual Symposium on Computer Application in Medical Care. Washington DC: American Medical Informatics Association; 1981. pp. 670–674. https://pmc.ncbi.nlm.nih.gov/articles/PMC2581172/pdf/procascamc00021-0693.pdf.
  • 10.Hubal RC, Kizakevich PN, Guinn CI, Merino KD, West SL. The virtual standardized patient-simulated patient-practitioner dialog for patient interview training. In: Westwood JD, Hoffman HM, Mogel GT, Stredney D, editors. Medicine Meets Virtual Reality. IOS Press; 2000. p. 133–8. [PubMed] [Google Scholar]
  • 11.Cook DA, Erwin PJ, Triola MM. Computerized virtual patients in health professions education: a systematic review and meta-analysis. Acad Med. 2010;85(10):1589–602. [DOI] [PubMed] [Google Scholar]
  • 12.Kononowicz AA, Woodham LA, Edelbring S, Stathakarou N, Davies D, Saxena N, et al. Virtual patients-what are we talking about? A framework to classify the meanings of the term in healthcare education. BMC Med Educ. 2019;19(1):1–7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Consorti F, Mancuso R, Nocioni M, Piccolo A. Efficacy of virtual patients in medical education: a meta-analysis of randomized studies. Comput Educ. 2012;59(3):1001–8. [Google Scholar]
  • 14.Huang G, Reynolds R, Candler C. Virtual patient simulation at US and Canadian medical schools. Acad Med. 2007;82(5):446–51. [DOI] [PubMed] [Google Scholar]
  • 15.Silverman J, Kurtz S, Draper J. Skills for communicating with patients. CRC Press; 2016. [Google Scholar]
  • 16.Cook DA, Overgaard J, Pankratz VS, Del Fiol G, Aakre CA. Virtual patients using large language models: scalable, contextualized simulation of clinician-patient dialogue with feedback. J Med Internet Res. 2025;27:e68486. 10.2196/68486. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Fareez F, Parikh T, Wavell C, Shahab S, Chevalier M, Good S, et al. A dataset of simulated patient-physician medical interviews with a focus on respiratory cases. Scientific Data. 2022;9(1):313. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Touvron H, Martin L, Stone K, Albert P, Almahairi A, Babaei Y, et al. Llama 2: Open foundation and fine-tuned chat models. 2023. arXiv preprint arXiv:2307.09288.
  • 19.Lang J, Guo Z, Huang S. A comprehensive study on quantization techniques for large language models. In: IEEE Computer Society, editor. 2024 4th International Conference on Artificial Intelligence, Robotics, and Communication. IEEE; 2024. pp. 224–231.
  • 20.Dettmers T, Pagnoni A, Holtzman A, Zettlemoyer L. Qlora efficient finetuning of quantized llms. Adv Neural Inf Process Syst. 2023;36:10088–115. [Google Scholar]
  • 21.Achiam J, Adler S, Agarwal S, Ahmad L, Akkaya I, Aleman FL, et al. Gpt-4 technical report. 2023. arXiv preprint arXiv:2303.08774.
  • 22.Creswell JW, Poth CN. Qualitative inquiry and research design: Choosing among five approaches. 4th ed. SAGE Publications; 2018. [Google Scholar]
  • 23.Patton MQ. Qualitative research & evaluation methods: Integrating theory and practice. 4th ed. Thousand Oaks, CA: Sage Publications; 2014. [Google Scholar]
  • 24.Hoffmann J, Borgeaud S, Mensch A, Buchatskaya E, Cai T, Rutherford E, et al. Training compute-optimal large language models. 2022. arXiv preprint arXiv:2203.15556.
  • 25.Nori H, King N, McKinney SM, Carignan D, Horvitz E. Capabilities of gpt-4 on medical challenge problems. 2023. arXiv preprint arXiv:2303.13375.
  • 26.Thakkar H, Manimaran A. Comprehensive Examination of Instruction-Based Language Models: A Comparative Analysis of Mistral-7B and Llama-2-7B. In: IEEE Computational Science Technical Committee, editor. 2023 International Conference on Emerging Research in Computational Science. IEEE; 2023. pp. 1–6.
  • 27.MosaicML NLP Team. Introducing mpt-7b: A new standard for open-source, commercially usable llms. 2023. https://www.databricks.com/blog/mpt-7b. Accessed: 15-02-2026.
  • 28.Liu P, Yuan W, Fu J, Jiang Z, Hayashi H, Neubig G. Prompt engineering: A comprehensive guide to effective prompting in language models. 2023. arXiv preprint arXiv:2311.12787.
  • 29.Ji Z, Lee N, Frieske R, Yu T, Su D, Xu Y, et al. Survey of hallucination in natural language generation. ACM Comput Surv. 2023;55(12):1–38. [Google Scholar]
  • 30.Bender EM, Gebru T, McMillan-Major A, Shmitchell S. On the dangers of stochastic parrots: Can language models be too big? In: ACM, editor. Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency. ACM; 2021. pp. 610–623.
  • 31.Dziri N, Lu X, Sclar M, Li XL, Jian L, Lin BY, et al. Faith and fate: Limits of transformers on compositionality. 2023. arXiv preprint arXiv:2305.18654.
  • 32.Li J, Li D, Savarese S, Hoi S. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In: Krause A, Globerson A, Hadfield S, van der Schaar M, editors. International Conference on Machine Learning. PMLR; 2023. p. 19730–42. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

The dataset of medical conversations used in this study is publicly available from Fareez et al. [27]. Other data generated during the current study are available from the corresponding author on reasonable request.

Code for model training and application development is available from the corresponding author upon reasonable request.


Articles from BMC Medical Education are provided here courtesy of BMC

RESOURCES