Quality Assessment of Large Language Model–Generated Medical Dialogue for Clinical Vignettes: Evaluation Study

Yasutaka Yanagita; Daiki Yokokawa; Shiichi Ihara; Ryo Yoshida; Yoshihide Okano; Takanori Uehara

doi:10.2196/80752

. 2025 Nov 3;9:e80752. doi: 10.2196/80752

Quality Assessment of Large Language Model–Generated Medical Dialogue for Clinical Vignettes: Evaluation Study

Yasutaka Yanagita ^1,^✉, Daiki Yokokawa ¹, Shiichi Ihara ¹, Ryo Yoshida ¹, Yoshihide Okano ¹, Takanori Uehara ¹

Editor: Amaryllis Mavragani

Reviewed by: Mohammad Al-Agil, Mohammad Amin Roshani, Shamnad Mohamed Shaffi, Somnath Banerjee, Xiaolong Liang

PMCID: PMC12624296 PMID: 41183323

Abstract

Background

Traditional clinical vignettes, though widely used in medical education, often focus on prototypical presentations; require substantial time and effort to develop; and fail to represent patient diversity, the complexity of clinical conditions, patients’ perspectives, and the dynamic nature of physician-patient interactions.

Objective

This study aimed to evaluate the quality of Japanese-language physician-patient dialogues produced by generative artificial intelligence (AI), focusing on their medical accuracy and overall appropriateness as medical interviews.

Methods

We created an AI prompt that included a specific clinical history and instructed the model to simulate a cooperative patient responding to the physician’s questions to generate a physician-patient dialogue. The target diseases were those covered by the Japanese National Medical Licensing Examination. Each dialogue consisted of 25 turns by the physician and 25 by the patient, reflecting the typical volume of conversation in Japanese outpatient settings. Three internists independently evaluated each generated dialogue using a 7-point Likert scale across 6 criteria: coherence of the conversation, medical accuracy of the patient’s responses, medical accuracy of the physician’s responses, content of the medical history, communication skills, and professionalism. In addition, a composite score for each dialogue was calculated as the overall mean of these 6 criteria. Each dialogue was also examined for the presence of 5 essential clinical components commonly included in medical interviews: chief concern and clinical course since onset, physical findings, test results, diagnosis, and treatment course. A dialogue was considered to include a component only if all 3 evaluators independently confirmed its presence.

Results

The mean composite score was 5.7 (SD 1.0), indicating high overall quality. Mean scores for each criterion were as follows: coherence of the conversation, 5.9 (SD 0.9); medical accuracy of the patient’s responses, 6.0 (SD 0.9); medical accuracy of the physician’s responses, 5.6 (SD 1.1); content of medical history taking, 5.9 (SD 0.9); communication skills, 5.6 (SD 0.9); and professionalism, 5.5 (SD 1.1). Among the 5 clinical components assessed in each dialogue across 47 clinical cases, chief concern and clinical course were included in all 47 (100%) cases, physical findings in 15 (32%) cases, test results in 27 (57%) cases, diagnosis in 45 (96%) cases, and treatment course in 0 (0%) cases.

Conclusions

While physician oversight remains essential, it is feasible to efficiently create AI-generated educational materials for medical education that overcome the limitations of traditional clinical vignettes. This approach may reduce time and financial burdens, enhancing opportunities to practice clinical interviewing in settings that closely mirror real-world encounters.

Keywords: ChatGPT, clinical vignettes, artificial intelligence, generative AI, medical education, physician-patient dialogue

Introduction

Artificial intelligence (AI) is critical for the advancement of medicine. In particular, generative AI, exemplified by large language models (LLMs), has the potential to fundamentally transform health care. Tools such as ChatGPT, developed by OpenAI, possess advanced conversational capabilities and broad applicability [1], with an increasing number of applications in the medical field. From the perspective of conversational capabilities, generative AI–based chatbots can provide rehabilitation guidance and mental health support to patients and assist health care professionals in patient management [2]. Generative AI is versatile enough to have passed the Japanese National Medical Licensing Examination [3]. Moreover, LLMS are now used to enter patients’ medical histories and verify diagnostic accuracy in English and Japanese [4,5], indicating that the scope of generative AI applications is expected to expand further in the future.

The application of generative AI in medical education has attracted increasing interest. The use of generative AI to produce physician-patient dialogues has the potential to replicate disease-specific communication encountered in clinical practice and serve as a valuable educational resource for medical students.

A comparable and widely used educational tool in medical education research is the clinical vignette [6-8]. Vignettes are paper-based instructional resources that present a patient’s medical history and key clinical information in an organized written format. They are considered highly effective in helping medical students understand disease concepts, symptomatology, and treatment pathways [9].

Exposure to various vignettes can deepen students’ clinical understanding. However, traditional vignettes often lack diversity in case scenarios and tend to focus narrowly on diagnostic information, thereby failing to capture the complexity, ambiguity, and uncertainty that characterize real-world clinical practice [10]. In addition, the development of vignettes requires substantial time and effort. As clinical vignettes are generally written from the health care provider’s perspective, they inherently fail to capture the patient’s viewpoint and the dynamics of physician-patient interaction. Using generative AI such as ChatGPT to create dialogue-based educational materials that incorporate these missing elements is considered a promising approach. These educational materials offer several advantages. First, ChatGPT can generate a wide variety of dialogues tailored to different diseases and clinical scenarios, enabling repeated practice. Its use can reduce both the time and financial costs associated with producing educational content. Moreover, medical students can engage in clinical reasoning by organizing information while considering multiple differential diagnoses and can practice effective physician-patient communication, as would be required in actual clinical settings.

AI-generated dialogues incorporate the medical information necessary for diagnostic decision-making along with a variety of interactions that closely resemble those encountered in real clinical settings. This enables medical students to develop information-processing skills in authentic, context-rich scenarios. Through this process, students are expected to acquire the ability to identify clinically relevant information and develop a broader range of clinical competencies, including medical interviewing techniques, patient management skills, and effective communication strategies.

This study examined the feasibility of generating physician-patient dialogues using generative AI and evaluated the quality of the generated dialogues. Generative AI has already demonstrated the ability to generate differential diagnosis lists, illness scripts, and clinical vignettes based on medical information [11,12]. Accordingly, it is considered feasible to generate physician-patient dialogues with a high level of accuracy using generative AI.

In this study, we assessed the quality of dialogues generated in Japanese medical interviews, including their medical accuracy, and investigated their potential utility as educational materials in the context of medical education. By demonstrating the feasibility of easily generating diverse physician-patient dialogues using generative AI, it becomes possible to efficiently produce a large volume of educational content for use in medical education under the supervision of physicians with medical expertise. The appropriate application of AI technology is expected to increase the opportunities for medical students to engage with more practical and interactive content, thereby enhancing the effectiveness of medical education.

Methods

Overview

Physician-patient dialogues were generated using an LLM and were subjected to cross-sectional evaluation. To ensure relevance to medical education, 47 target diseases were selected to represent diseases that medical students were expected to learn. The selection was based on the content of the Japanese National Medical Licensing Examination [13] and finalized through discussion between a board-certified internist (YY) and a board-certified family physician (DY; Textbox 1). To evaluate the generated dialogues, we recruited 3 Japanese internists (SI, RY, and YO) who are actively involved in diagnostic practice at a university hospital and regularly manage a wide range of diagnostically challenging cases. These internists assessed the quality of the dialogues.

List of 47 diseases used to generate and evaluate artificial intelligence–generated physician-patient dialogues.

Diseases

Gastroesophageal reflux disease
Functional dyspepsia
Esophageal achalasia
Crohn disease
Appendicitis
Pheochromocytoma
Primary aldosteronism
Cushing syndrome
Hashimoto disease
Subacute thyroiditis
Graves disease
Depression
Panic disorder
Gout
Pneumothorax
Myasthenia gravis
Trigeminal neuralgia
Parkinson disease
Alzheimer disease
Benign paroxysmal positional vertigo
Migraine
Cluster headache
Neurocardiogenic syncope
Intervertebral disk herniation
Insomnia
Bronchial asthma
Acute eosinophilic pneumonia
Rheumatoid arthritis
Systemic lupus erythematosus
Sjögren syndrome
Measles
Fibromyalgia
Gallstone attack
Acute cholecystitis
Unstable angina
COVID-19
Lumbar spinal stenosis
Thromboangiitis obliterans
Infectious mononucleosis
Streptococcal pharyngitis
Hepatitis
Transient ischemic attack
Iron deficiency anemia
Heatstroke
Acute pericarditis
Chronic obstructive pulmonary disease
Lung cancer

Production Environment

The analysis was performed on a terminal running Ubuntu (version 20.04.2 Long-Term Support; Canonical). The central processing unit was equipped with an AMD EPYC 7402P 24-core processor (Advanced Micro Devices, Inc), with 256 GB of main memory. An NVIDIA A100 graphics processing unit (NVIDIA Corp) with 40 GB of RAM was used for the computations. Python (version 3.6.9) and the OpenAI Python library (version 0.27.8; OpenAI) were used. The application programming interface used to generate the dialogues was the latest one available at the start of the study, gpt-4o-2024-11-20 [14], and the dialogues were generated on November 22, 2024.

Prompts to Be Entered Into the LLM

Referring to previous studies that used AI to generate vignettes and illness scripts [12], concise instructions specifying the desired output conditions were provided to the LLM as prompts (Figure 1). To enhance the completeness and coherence of the generated dialogues, the prompts included conditions such as the user supplying a specific vignette before the generation, refraining from using medical terminology, and responding cooperatively to the physician’s questions. The dialogue length was determined with reference to the average volume of conversation in typical outpatient consultations (approximately 5-10 min) [15], and the LLM was instructed to generate 25 turns each from the physician and the patient.

A portion of the prompt used to generate physician-patient dialogues in Japanese using generative artificial intelligence (Top: Japanese, Bottom: English).

Evaluation of Dialogue

The presence of five key components commonly included in medical interviews was assessed: (1) chief concern and clinical course since onset, (2) physical findings, (3) test results, (4) diagnosis, and (5) treatment course. Three physicians independently reviewed each dialogue, and a dialogue was considered to contain all 5 components only if all 3 physicians confirmed their inclusion.

Next, criteria for assessing the quality of the medical interviews were developed. Six evaluation criteria were selected through discussions between a board-certified internist (YY) and a board-certified family physician (DY), with reference to the evaluation domains of the Mini-Clinical Evaluation Exercise (Mini-CEX) [16]: (1) coherence of the conversation, (2) medical accuracy of the patient’s statements, (3) medical accuracy of the physician’s statements, (4) quality of the physician’s history taking, (5) communication skills, and (6) professionalism.

Criterion 1, coherence of the conversation, was evaluated as a linguistic criterion, focusing on the smoothness of the interaction between the physician and patient, the presence of inconsistencies, grammatical or typographical errors, and the logical relationship of the dialogue. Criteria 2 and 3, medical accuracy of the patient’s and physician’s statements, respectively, were assessed as clinical criteria by evaluating their consistency with the known clinical features of the respective diseases. Criterion 4, history taking, was assessed based on whether the physician elicited information regarding symptom characteristics and exacerbating or relieving factors. Criteria 5 and 6, communication skills and professionalism, respectively, were evaluated by assessing whether the physician explored the patient’s explanatory model and demonstrated respect, compassion, and empathy toward the patient (Table 1).

Table 1.

Evaluation criteria and definitions used to assess artificial intelligence–generated physician-patient dialogues in medical interviews.

Evaluation criteria	Evaluation details
Coherence of the conversation	Assessed whether the dialogue between the physician and patient proceeded smoothly, with accurate grammar, no typographical or spelling errors, and overall linguistic clarity
Medical accuracy of the patient’s statements	Evaluated whether the patient’s utterances accurately reflected the typical onset patterns, symptoms, and clinical course associated with the relevant disease
Medical accuracy of the physician’s statements	Assessed whether the physician’s explanations and other statements were medically accurate and aligned with established clinical knowledge
Quality of the physician’s history taking	Evaluated whether the physician asked about essential elements of the current illness, including symptom location, characteristics, severity, temporal course, contextual factors, aggravating and relieving factors, associated symptoms, and the patient’s response to the symptoms
Communication skills	Assessed whether the physician conducted the interview in a way that facilitated open communication, explored the patient’s explanatory model and psychosocial context, and confirmed the patient’s understanding of the information discussed
Professionalism	Evaluated whether the physician demonstrated respect, compassion, and empathy toward the patient and whether a trusting therapeutic relationship was established

Open in a new tab

Following prior studies [12], we specifically employed 3 Japanese physicians affiliated with a university hospital, each of whom was involved in the supervision and education of medical students and residents, to evaluate the generated dialogues. Each of the 6 evaluation criteria was rated on a 7-point Likert scale, based on the perceived educational usefulness of the medical students. The scale was defined as follows: 1=not applicable at all or not useful—major overall revision required, 2=low usefulness—multiple major revisions needed, 3=limited usefulness—some valuable content but substantial revisions necessary, 4=moderate usefulness—both strengths and several areas for improvement, 5=generally useful—some revisions or adjustments desirable, 6=high usefulness—only minor adjustments possibly needed, and 7=extremely useful and complete—no further revisions required.

For each dialogue, the score of each evaluation criterion was calculated as the average of the ratings of the 3 evaluators. The composite score for the dialogue was then derived by averaging the scores across all 6 evaluation criteria and interpreted using the same 7-point Likert scale.

Ethical Considerations

This study did not involve human or animal participants, and therefore, ethics approval was not required.

Results

Using the gpt-4o-2024-11-20 model, physician-patient dialogues were generated for 47 clinical cases (Table 2). Among the 47 generated dialogues, clinical component 1, chief concern and clinical course since onset, was present in all 47 (100%) cases; clinical component 4, diagnosis, was included in 45 (96%) cases, and in each of these cases, the model accurately outputted the specified disease name, as instructed. In contrast, clinical component 2, physical findings, was included in 15 (32%) cases; clinical component 3, test results, was included in 27 (58%) cases; and clinical component 5, treatment course, was not included in any of the cases (0%). Regarding the quality of the medical interviews, the average score was 5.9 (SD 0.9) for coherence of the conversation, 6.0 (SD 0.9) for medical accuracy of the patient’s statements, and 5.6 (SD 1.1) for medical accuracy of the physician’s statements. The average score was 5.9 (SD 0.9) for quality of the physician’s history taking, 5.6 (SD 0.9) for communication skills, and 5.5 (SD 1.1) for professionalism. The overall composite score, calculated as the mean of the 6 evaluation criteria, was 5.7 (SD 1.0).

Table 2.

Average scores for each of the 6 evaluation criteria used to assess artificial intelligence–generated physician-patient dialogues.

Evaluation criteria	Average score (SD)
Coherence of the conversation	5.9 (0.9)
Medical accuracy of the patient’s statements	6.0 (0.9)
Medical accuracy of the physician’s statements	5.6 (1.1)
Quality of the physician’s history taking	5.9 (0.9)
Communication skills	5.6 (0.9)
Professionalism	5.5 (1.1)

Open in a new tab

A focused discussion was conducted among 5 physicians, 2 specialists (YY and DY), and 3 evaluators (SI, RY, and YO), centered on dialogues that were subject to point deductions to identify and clarify the specific issues present in the lower-rated dialogues. The results of this analysis are summarized in Table 3. For reference, one dialogue with a perfect average score of 7 and another that received a lower average score of 4 are provided in Multimedia Appendix 1 and Multimedia Appendix 2, respectively.

Table 3.

Problems identified for each evaluation criterion based on expert review of lower-rated artificial intelligence–generated physician-patient dialogues.

Evaluation criteria	Problems
Coherence of the conversation	Responses to patient questions were omitted. Although some expressions were unnatural, overall coherence was maintained.
Medical accuracy of the patient’s statements	Typical responses regarding aggravating and relieving factors were not provided.
Medical accuracy of the physician’s statements	A diagnosis was rendered, despite the interview being insufficient. Diagnoses were finalized based on test results that were never mentioned as having been performed, indicating inappropriate or unjustified diagnostic reasoning. The response is that it cures a disease that cannot be cured.
Quality of the physician’s history taking	Redundant questions were asked regarding information that had already been provided. When multiple symptoms were present, it was unclear which symptom’s aggravating or relieving factors were being discussed. The structure and content of the medical interview were inadequate. Additional questioning was conducted about symptoms not initially reported by the patient. Unnecessary blood tests were included and described without justification.
Communication skills	No attempts were made to confirm the patient’s understanding of the information provided. The patient’s explanatory model was rarely elicited. The patient’s response lacked logical progression or flow.
Professionalism	No appropriate responses were provided following expressions of anxiety. The overall focus of the dialogue was limited to diagnostic questioning, with few utterances directed toward building rapport or fostering a therapeutic relationship.

Open in a new tab

Discussion

Principal Findings

This study aimed to generate physician-patient dialogue using generative AI and evaluate the prompt designs, the resulting outputs, and the overall quality of the dialogues. We created 47 dialogues based on diseases from the Japanese National Medical Licensing Examination and evaluated them using 6 criteria related to clinical communication. The dialogues consistently included the chief concern and clinical course and demonstrated high medical accuracy and coherence, while areas such as professionalism and inclusion of treatment information were less consistently addressed. The overall composite score was 5.7 (SD 1.0), indicating general usefulness with minor revisions required.

The analysis first focused on the outputs generated for 47 clinical cases and the corresponding prompt designs that produced them. Examination of the medical content essential for a clinical interview revealed that all dialogues included descriptions of the chief concern and clinical course since onset and that accurate diagnostic labels were presented in 45 (96%) of the 47 cases. In contrast, none of the dialogues included information regarding the treatment course. In designing the prompts, the number of dialogue turns was set at 25 for each participant (50 in total), considering the average consultation time in Japanese outpatient settings, which is approximately 5 minutes. As an exploratory extension, we tested longer dialogues (50 and 100 turns per speaker). Although brief mentions of treatment began to appear, these dialogues became increasingly verbose, with redundant differential diagnoses and questioning. This indicated a decline in dialogue quality and educational effectiveness, highlighting the need to identify an optimal dialogue length in prompt design. Although the generation of information related to the treatment course may be improved through more sophisticated prompt engineering [17,18] or future updates to generative AI models [19], prior studies have demonstrated that diagnostic reasoning and treatment management involve fundamentally distinct cognitive processes [20]. Therefore, separating these processes into distinct dialogue scenarios may be more educationally effective than integrating them into a single prompt.

The quality of the generated dialogues, produced in Japanese, was evaluated based on 6 criteria. Regarding criterion 1, coherence of the conversation, the dialogues were generally smooth and grammatically correct. This reflects the high-level natural language processing capabilities of generative AI and suggests its potential to replicate physician-patient interactions at a basic level of linguistic fidelity within the context of Japanese-language medical interviews. Regarding criterion 2, medical accuracy of the patient’s statements, although there were instances in which the patient did not provide clear responses to the physician’s questions, particularly in certain disease contexts, such ambiguity may reflect the nature of real-world clinical encounters. From an educational perspective, these instances may be valuable for simulating authentic dialogue dynamics. Criterion 3, medical accuracy of the physician’s statements, received comparatively lower ratings than other criteria. This may be attributable to the perceived need for greater domain-specific precision, particularly when formulating clinical questions and providing medically accurate explanations to patients. This finding aligns with previous research on the limitations of generative AI in clinical reasoning [12]. Regarding criterion 4, quality of the physician’s history taking, in actual clinical practice, once the probability of a particular disease increases based on the patient’s narrative, physicians typically engage in further in-depth inquiry. Generative AI appears to struggle with appropriately weighing clinical information and identifying which elements warrant further exploration; thus, at present, AI may be limited in its capacity to conduct history taking guided by probabilistic diagnostic reasoning. Regarding criterion 5, communication skills, the dialogues showed limited use of verbal cues such as acknowledgments or responses that convey an understanding of the patient’s statements, and expressions of empathy toward patients’ concerns were insufficient. In addition, there were a few attempts to elicit a patient’s explanatory model, which is a critical step toward building rapport. Prompt design that encourages empathic and interactive responses may help address these limitations in future development. Regarding criterion 6, professionalism, although no ethically inappropriate expressions were identified, the dialogues generally lacked explicit demonstrations of patient-centered attitudes, respect, or empathic concern.

It is important to note that, in face-to-face medical interviews, nonverbal cues, such as facial expressions and nodding, play a substantially role in promoting patient satisfaction and emotional attunement [21,22], and the absence of such elements constitutes a fundamental limitation of text-based dialogue evaluations. Moreover, a small number of dialogues contained clinically inappropriate content, such as failure to provide justification for a diagnosis or incorrect assertion that a typically incurable disease could be cured. However, no ethically problematic or professionally inappropriate statements were identified in the dataset. Taken together, these results, along with the overall composite score of 5.7 (SD 1.0), suggest that, while physician oversight and revision are essential, generative AI–based dialogues may serve as a valuable educational resource for teaching clinical communication skills to medical students and early-stage trainees.

Because of their underlying architecture, LLMs exhibit inherent output variability, and identical prompts do not consistently yield identical responses. Although this randomness poses challenges in terms of reproducibility and control, it offers opportunities for prompt-based modulation, whereby carefully designed prompts can elicit diverse and contextually appropriate outputs [23]. In this respect, the ability to generate a wide range of physician-patient dialogues represents a particularly compelling and pedagogically valuable feature.

In this study, to facilitate the acquisition of fundamental structures in history taking, the AI was instructed to assume the role of a cooperative patient who provided clear and responsive answers. However, the simulated patient did not need to be restricted to cooperative profiles. It is also feasible to generate dialogues featuring patients with diverse communicative behaviors, such as those who are angry, uncommunicative, or unable to articulate clearly because of their underlying health conditions.

On the basis of our findings, we suggest that physician-patient dialogues generated by generative AI can be developed into high-quality educational materials with relatively minimal effort, provided that particular attention is paid to the medical accuracy of the physician’s utterances and that supervising physicians revise the content as necessary. Given the capability of generative AI to rapidly produce medically relevant content, its application in medical education and clinical practice is expected to expand further in the coming years [24,25]. As demonstrated in previous studies, when generative AI is used in medical education [26], the ability of instructors to review and modify the generated content enables its use as a supplementary instructional resource. Considering the capacity of the model to generate dialogues tailored to a wide range of clinical scenarios, this approach, when used with appropriate oversight, could offer a highly flexible and scalable educational tool.

Limitations

This study had 3 primary limitations. First, the version of the generative AI used in this study was gpt-4o-2024-11-20, and the evaluation was based on the outputs generated on November 13, 2024. As the performance of generative AI models may evolve with future updates, periodic reevaluation is necessary. Moreover, as multiple LLMs are available, the quality and characteristics of dialogues generated by other models may differ from those evaluated in this study.

Second, there is currently no standardized method for prompt construction when interacting with generative AI, and the content of the input prompt can significantly influence the quality and nature of the output. In this study, the extent to which variations in the input conditions affect the generated dialogues was not examined systematically. Further investigations are warranted to clarify the impact of prompt design on output quality.

Third, this study was conducted in Japanese, and all prompts and generated dialogues were also in Japanese. Although prior research suggests a strong potential for adaptation to other languages, further validation is warranted.

Conclusions

In this study, physician-patient dialogues were generated using an LLM, that is, generative AI. These findings indicate that, although physician supervision remains essential, such dialogues can be easily developed into materials suitable for use in medical education. In particular, dialogues that incorporate the patient’s perspective and the interactive elements of physician-patient communication could provide practical learning experiences that are difficult to achieve with conventional educational resources, suggesting a wide range of possible applications in medical education.

Moreover, the use of generative AI enables the efficient and large-scale creation of diverse educational content. This represents a significant advantage in terms of reducing the substantial time and effort traditionally required to develop medical instructional materials. By using this approach, medical students are expected to have more opportunities to learn in contexts that closely resemble actual clinical practice, thereby facilitating the acquisition of more practical medical interview skills.

Acknowledgments

No external financial support or grants were received from any public, commercial, or not-for-profit entities for the research, authorship, or publication of this paper.

Abbreviations

AI: artificial intelligence
LLM: large language model
Mini-CEX: Mini-Clinical Evaluation Exercise

Multimedia Appendix 1

Dialogue with a perfect average score (case 21: migraine).

formative_v9i1e80752_app1.docx^{(18.1KB, docx)}

Multimedia Appendix 2

Dialogue with a lower average score of 4 (case 27: acute eosinophilic pneumonia).

formative_v9i1e80752_app2.docx^{(16.8KB, docx)}

Data Availability

The datasets generated or analyzed during this study are available from the corresponding author on reasonable request.

Footnotes

Authors' Contributions: YY, DY, and TU designed and coordinated the study. YY and DY carried out data analysis and interpretation. YY, DY, and TU drafted the manuscript. SI, RY, YO, and TU revised the manuscript for important intellectual content. All authors read and approved the final manuscript and take responsibility for all aspects of the work, ensuring that any questions regarding accuracy or integrity are appropriately investigated and resolved. This work was supported by JSPS KAKENHI (grant JP25K20499).

Conflicts of Interest: None declared.

References

1.Mu Y, He D. The potential applications and challenges of ChatGPT in the medical field. Int J Gen Med. 2024 Mar;Volume 17:817–26. doi: 10.2147/ijgm.s456659. [DOI] [PMC free article] [PubMed] [Google Scholar]
2.Laymouna M, Ma Y, Lessard D, Schuster T, Engler K, Lebouché B. Roles, users, benefits, and limitations of chatbots in health care: rapid review. J Med Internet Res. 2024 Jul 23;26:e56930. doi: 10.2196/56930. https://www.jmir.org/2024//e56930/ v26i1e56930 [DOI] [PMC free article] [PubMed] [Google Scholar]
3.Yanagita Y, Yokokawa D, Uchida S, Tawara J, Ikusaka M. Accuracy of ChatGPT on medical questions in the national medical licensing examination in Japan: evaluation study. JMIR Form Res. 2023 Oct 13;7:e48023. doi: 10.2196/48023. https://formative.jmir.org/2023//e48023/ v7i1e48023 [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Fukuzawa F, Yanagita Y, Yokokawa D, Uchida S, Yamashita S, Li Y, Shikino K, Tsukamoto T, Noda K, Uehara T, Ikusaka M. Importance of patient history in artificial intelligence-assisted medical diagnosis: comparison study. JMIR Med Educ. 2024 Apr 08;10:e52674. doi: 10.2196/52674. https://mededu.jmir.org/2024//e52674/ v10i1e52674 [DOI] [PMC free article] [PubMed] [Google Scholar]
5.Kanjee Z, Crowe B, Rodman A. Accuracy of a generative artificial intelligence model in a complex diagnostic challenge. JAMA. 2023 Jul 03;330(1):78–80. doi: 10.1001/jama.2023.8288. https://europepmc.org/abstract/MED/37318797 .2806457 [DOI] [PMC free article] [PubMed] [Google Scholar]
6.Coşkun Ö, Kıyak YS, Budakoğlu Iİ. ChatGPT to generate clinical vignettes for teaching and multiple-choice questions for assessment: a randomized controlled experiment. Med Teach. 2025 Feb;47(2):268–74. doi: 10.1080/0142159X.2024.2327477. [DOI] [PubMed] [Google Scholar]
7.Tremblay D, Turcotte A, Touati N, Poder TG, Kilpatrick K, Bilodeau K, Roy M, Richard PO, Lessard S, Giordano É. Development and use of research vignettes to collect qualitative data from healthcare professionals: a scoping review. BMJ Open. 2022 Jan 31;12(1):e057095. doi: 10.1136/bmjopen-2021-057095. https://bmjopen.bmj.com/lookup/pmidlookup?view=long&pmid=35105654 .bmjopen-2021-057095 [DOI] [PMC free article] [PubMed] [Google Scholar]
8.Wofford JL, Singh S. Exploring the educational value of clinical vignettes from the Society of General Internal Medicine national meeting in the internal medicine clerkship: a pilot study. J Gen Intern Med. 2006 Nov;21(11):1195–7. doi: 10.1111/j.1525-1497.2006.00596.x. https://europepmc.org/abstract/MED/17026730 .JGI596 [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Trullàs JC, Blay C, Sarri E, Pujol R. Effectiveness of problem-based learning methodology in undergraduate medical education: a scoping review. BMC Med Educ. 2022 Feb 17;22(1):104. doi: 10.1186/s12909-022-03154-8. https://bmcmededuc.biomedcentral.com/articles/10.1186/s12909-022-03154-8 .10.1186/s12909-022-03154-8 [DOI] [PMC free article] [PubMed] [Google Scholar]
10.Evans SC, Roberts MC, Keeley JW, Blossom JB, Amaro CM, Garcia AM, Stough CO, Canter KS, Robles R, Reed GM. Vignette methodologies for studying clinicians' decision-making: validity, utility, and application in ICD-11 field studies. Int J Clin Health Psychol. 2015;15(2):160–70. doi: 10.1016/j.ijchp.2014.12.001. https://linkinghub.elsevier.com/retrieve/pii/S1697-2600(14)00066-0 .S1697-2600(14)00066-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
11.Yanagita Y, Yokokawa D, Fukuzawa F, Uchida S, Uehara T, Ikusaka M. Expert assessment of ChatGPT's ability to generate illness scripts: an evaluative study. BMC Med Educ. 2024 May 15;24(1):536. doi: 10.1186/s12909-024-05534-8. https://bmcmededuc.biomedcentral.com/articles/10.1186/s12909-024-05534-8 .10.1186/s12909-024-05534-8 [DOI] [PMC free article] [PubMed] [Google Scholar]
12.Yanagita Y, Yokokawa D, Uchida S, Li Y, Uehara T, Ikusaka M. Can AI-generated clinical vignettes in Japanese be used medically and linguistically? J Gen Intern Med. 2024 Dec;39(16):3282–9. doi: 10.1007/s11606-024-09031-y.10.1007/s11606-024-09031-y [DOI] [PMC free article] [PubMed] [Google Scholar]
13.Questions and answers for the 118th National Medical Examination. Ministry of Health, Labour, and Welfare Japan. 2024. [2025-10-27]. https://www.mhlw.go.jp/seisakunitsuite/bunya/kenkou_iryou/iryou/topics/tp240424-01.html .
14.Model OpenAI API. OpenAI Platform. [2024-12-05]. https://platform.openai.com/docs/models#gpt-4o .
15.Silverman J, Kurtz S, Draper J. Skills for Communicating with Patients. Boca Raton, FL: CRC Press; 2013. [Google Scholar]
16.Martinsen SS, Espeland T, Berg EA, Samstad E, Lillebo B, Slørdahl TS. Examining the educational impact of the mini-CEX: a randomised controlled study. BMC Med Educ. 2021 Apr 21;21(1):228. doi: 10.1186/s12909-021-02670-3. https://bmcmededuc.biomedcentral.com/articles/10.1186/s12909-021-02670-3 .10.1186/s12909-021-02670-3 [DOI] [PMC free article] [PubMed] [Google Scholar]
17.Meskó B. Prompt engineering as an important emerging skill for medical professionals: tutorial. J Med Internet Res. 2023 Oct 04;25:e50638. doi: 10.2196/50638. https://www.jmir.org/2023//e50638/ v25i1e50638 [DOI] [PMC free article] [PubMed] [Google Scholar]
18.Liu X, Wang J, Sun J, Yuan X, Dong G, Di P, Wang W, Wang D. Prompting frameworks for large language models: a survey. ArXiv. doi: 10.48550/arXiv.2311.12785. Preprint posted online on November 21, 2023. https://doi.org/10.48550/arXiv.2311.12785 . [DOI] [Google Scholar]
19.Thirunavukarasu AJ, Ting DS, Elangovan K, Gutierrez L, Tan TF, Ting DS. Large language models in medicine. Nat Med. 2023 Aug;29(8):1930–40. doi: 10.1038/s41591-023-02448-8.10.1038/s41591-023-02448-8 [DOI] [PubMed] [Google Scholar]
20.Cook DA, Sherbino J, Durning SJ. Management reasoning: beyond the diagnosis. JAMA. 2018 Jun 12;319(22):2267–8. doi: 10.1001/jama.2018.4385.2681495 [DOI] [PubMed] [Google Scholar]
21.Collins LG, Schrimmer A, Diamond J, Burke J. Evaluating verbal and non-verbal communication skills, in an ethnogeriatric OSCE. Patient Educ Couns. 2011 May;83(2):158–62. doi: 10.1016/j.pec.2010.05.012.S0738-3991(10)00304-6 [DOI] [PubMed] [Google Scholar]
22.Little P, White P, Kelly J, Everitt H, Gashi S, Bikker A, Mercer S. Verbal and non-verbal behaviour and patient perception of communication in primary care: an observational study. Br J Gen Pract. 2015 May 25;65(635):e357–65. doi: 10.3399/bjgp15x685249. [DOI] [PMC free article] [PubMed] [Google Scholar]
23.Lyu Q, Tan J, Zapadka ME, Ponnatapura J, Niu C, Myers KJ, Wang G, Whitlow CT. Translating radiology reports into plain language using ChatGPT and GPT-4 with prompt learning: results, limitations, and potential. Vis Comput Ind Biomed Art. 2023 May 18;6(1):9. doi: 10.1186/s42492-023-00136-5. https://europepmc.org/abstract/MED/37198498 .10.1186/s42492-023-00136-5 [DOI] [PMC free article] [PubMed] [Google Scholar]
24.Karabacak M, Ozkara BB, Margetis K, Wintermark M, Bisdas S. The advent of generative language models in medical education. JMIR Med Educ. 2023 Jun 06;9:e48163. doi: 10.2196/48163. https://mededu.jmir.org/2023//e48163/ v9i1e48163 [DOI] [PMC free article] [PubMed] [Google Scholar]
25.Li Q, Qin Y. AI in medical education: medical student perception, curriculum recommendations and design suggestions. BMC Med Educ. 2023 Nov 09;23(1):852. doi: 10.1186/s12909-023-04700-8.10.1186/s12909-023-04700-8 [DOI] [PMC free article] [PubMed] [Google Scholar]
26.Han Z, Battaglia F, Udaiyar A, Fooks A, Terlecky SR. An explorative assessment of ChatGPT as an aid in medical education: use it with caution. Med Teach. 2024 May;46(5):657–64. doi: 10.1080/0142159X.2023.2271159. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Multimedia Appendix 1

Dialogue with a perfect average score (case 21: migraine).

formative_v9i1e80752_app1.docx^{(18.1KB, docx)}

Multimedia Appendix 2

Dialogue with a lower average score of 4 (case 27: acute eosinophilic pneumonia).

formative_v9i1e80752_app2.docx^{(16.8KB, docx)}

Data Availability Statement

The datasets generated or analyzed during this study are available from the corresponding author on reasonable request.

[ref1] 1.Mu Y, He D. The potential applications and challenges of ChatGPT in the medical field. Int J Gen Med. 2024 Mar;Volume 17:817–26. doi: 10.2147/ijgm.s456659. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref2] 2.Laymouna M, Ma Y, Lessard D, Schuster T, Engler K, Lebouché B. Roles, users, benefits, and limitations of chatbots in health care: rapid review. J Med Internet Res. 2024 Jul 23;26:e56930. doi: 10.2196/56930. https://www.jmir.org/2024//e56930/ v26i1e56930 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref3] 3.Yanagita Y, Yokokawa D, Uchida S, Tawara J, Ikusaka M. Accuracy of ChatGPT on medical questions in the national medical licensing examination in Japan: evaluation study. JMIR Form Res. 2023 Oct 13;7:e48023. doi: 10.2196/48023. https://formative.jmir.org/2023//e48023/ v7i1e48023 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref4] 4.Fukuzawa F, Yanagita Y, Yokokawa D, Uchida S, Yamashita S, Li Y, Shikino K, Tsukamoto T, Noda K, Uehara T, Ikusaka M. Importance of patient history in artificial intelligence-assisted medical diagnosis: comparison study. JMIR Med Educ. 2024 Apr 08;10:e52674. doi: 10.2196/52674. https://mededu.jmir.org/2024//e52674/ v10i1e52674 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref5] 5.Kanjee Z, Crowe B, Rodman A. Accuracy of a generative artificial intelligence model in a complex diagnostic challenge. JAMA. 2023 Jul 03;330(1):78–80. doi: 10.1001/jama.2023.8288. https://europepmc.org/abstract/MED/37318797 .2806457 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref6] 6.Coşkun Ö, Kıyak YS, Budakoğlu Iİ. ChatGPT to generate clinical vignettes for teaching and multiple-choice questions for assessment: a randomized controlled experiment. Med Teach. 2025 Feb;47(2):268–74. doi: 10.1080/0142159X.2024.2327477. [DOI] [PubMed] [Google Scholar]

[ref7] 7.Tremblay D, Turcotte A, Touati N, Poder TG, Kilpatrick K, Bilodeau K, Roy M, Richard PO, Lessard S, Giordano É. Development and use of research vignettes to collect qualitative data from healthcare professionals: a scoping review. BMJ Open. 2022 Jan 31;12(1):e057095. doi: 10.1136/bmjopen-2021-057095. https://bmjopen.bmj.com/lookup/pmidlookup?view=long&pmid=35105654 .bmjopen-2021-057095 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref8] 8.Wofford JL, Singh S. Exploring the educational value of clinical vignettes from the Society of General Internal Medicine national meeting in the internal medicine clerkship: a pilot study. J Gen Intern Med. 2006 Nov;21(11):1195–7. doi: 10.1111/j.1525-1497.2006.00596.x. https://europepmc.org/abstract/MED/17026730 .JGI596 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref9] 9.Trullàs JC, Blay C, Sarri E, Pujol R. Effectiveness of problem-based learning methodology in undergraduate medical education: a scoping review. BMC Med Educ. 2022 Feb 17;22(1):104. doi: 10.1186/s12909-022-03154-8. https://bmcmededuc.biomedcentral.com/articles/10.1186/s12909-022-03154-8 .10.1186/s12909-022-03154-8 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref10] 10.Evans SC, Roberts MC, Keeley JW, Blossom JB, Amaro CM, Garcia AM, Stough CO, Canter KS, Robles R, Reed GM. Vignette methodologies for studying clinicians' decision-making: validity, utility, and application in ICD-11 field studies. Int J Clin Health Psychol. 2015;15(2):160–70. doi: 10.1016/j.ijchp.2014.12.001. https://linkinghub.elsevier.com/retrieve/pii/S1697-2600(14)00066-0 .S1697-2600(14)00066-0 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref11] 11.Yanagita Y, Yokokawa D, Fukuzawa F, Uchida S, Uehara T, Ikusaka M. Expert assessment of ChatGPT's ability to generate illness scripts: an evaluative study. BMC Med Educ. 2024 May 15;24(1):536. doi: 10.1186/s12909-024-05534-8. https://bmcmededuc.biomedcentral.com/articles/10.1186/s12909-024-05534-8 .10.1186/s12909-024-05534-8 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref12] 12.Yanagita Y, Yokokawa D, Uchida S, Li Y, Uehara T, Ikusaka M. Can AI-generated clinical vignettes in Japanese be used medically and linguistically? J Gen Intern Med. 2024 Dec;39(16):3282–9. doi: 10.1007/s11606-024-09031-y.10.1007/s11606-024-09031-y [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref13] 13.Questions and answers for the 118th National Medical Examination. Ministry of Health, Labour, and Welfare Japan. 2024. [2025-10-27]. https://www.mhlw.go.jp/seisakunitsuite/bunya/kenkou_iryou/iryou/topics/tp240424-01.html .

[ref14] 14.Model OpenAI API. OpenAI Platform. [2024-12-05]. https://platform.openai.com/docs/models#gpt-4o .

[ref15] 15.Silverman J, Kurtz S, Draper J. Skills for Communicating with Patients. Boca Raton, FL: CRC Press; 2013. [Google Scholar]

[ref16] 16.Martinsen SS, Espeland T, Berg EA, Samstad E, Lillebo B, Slørdahl TS. Examining the educational impact of the mini-CEX: a randomised controlled study. BMC Med Educ. 2021 Apr 21;21(1):228. doi: 10.1186/s12909-021-02670-3. https://bmcmededuc.biomedcentral.com/articles/10.1186/s12909-021-02670-3 .10.1186/s12909-021-02670-3 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref17] 17.Meskó B. Prompt engineering as an important emerging skill for medical professionals: tutorial. J Med Internet Res. 2023 Oct 04;25:e50638. doi: 10.2196/50638. https://www.jmir.org/2023//e50638/ v25i1e50638 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref18] 18.Liu X, Wang J, Sun J, Yuan X, Dong G, Di P, Wang W, Wang D. Prompting frameworks for large language models: a survey. ArXiv. doi: 10.48550/arXiv.2311.12785. Preprint posted online on November 21, 2023. https://doi.org/10.48550/arXiv.2311.12785 . [DOI] [Google Scholar]

[ref19] 19.Thirunavukarasu AJ, Ting DS, Elangovan K, Gutierrez L, Tan TF, Ting DS. Large language models in medicine. Nat Med. 2023 Aug;29(8):1930–40. doi: 10.1038/s41591-023-02448-8.10.1038/s41591-023-02448-8 [DOI] [PubMed] [Google Scholar]

[ref20] 20.Cook DA, Sherbino J, Durning SJ. Management reasoning: beyond the diagnosis. JAMA. 2018 Jun 12;319(22):2267–8. doi: 10.1001/jama.2018.4385.2681495 [DOI] [PubMed] [Google Scholar]

[ref21] 21.Collins LG, Schrimmer A, Diamond J, Burke J. Evaluating verbal and non-verbal communication skills, in an ethnogeriatric OSCE. Patient Educ Couns. 2011 May;83(2):158–62. doi: 10.1016/j.pec.2010.05.012.S0738-3991(10)00304-6 [DOI] [PubMed] [Google Scholar]

[ref22] 22.Little P, White P, Kelly J, Everitt H, Gashi S, Bikker A, Mercer S. Verbal and non-verbal behaviour and patient perception of communication in primary care: an observational study. Br J Gen Pract. 2015 May 25;65(635):e357–65. doi: 10.3399/bjgp15x685249. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref23] 23.Lyu Q, Tan J, Zapadka ME, Ponnatapura J, Niu C, Myers KJ, Wang G, Whitlow CT. Translating radiology reports into plain language using ChatGPT and GPT-4 with prompt learning: results, limitations, and potential. Vis Comput Ind Biomed Art. 2023 May 18;6(1):9. doi: 10.1186/s42492-023-00136-5. https://europepmc.org/abstract/MED/37198498 .10.1186/s42492-023-00136-5 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref24] 24.Karabacak M, Ozkara BB, Margetis K, Wintermark M, Bisdas S. The advent of generative language models in medical education. JMIR Med Educ. 2023 Jun 06;9:e48163. doi: 10.2196/48163. https://mededu.jmir.org/2023//e48163/ v9i1e48163 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref25] 25.Li Q, Qin Y. AI in medical education: medical student perception, curriculum recommendations and design suggestions. BMC Med Educ. 2023 Nov 09;23(1):852. doi: 10.1186/s12909-023-04700-8.10.1186/s12909-023-04700-8 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref26] 26.Han Z, Battaglia F, Udaiyar A, Fooks A, Terlecky SR. An explorative assessment of ChatGPT as an aid in medical education: use it with caution. Med Teach. 2024 May;46(5):657–64. doi: 10.1080/0142159X.2023.2271159. [DOI] [PubMed] [Google Scholar]

PERMALINK

Quality Assessment of Large Language Model–Generated Medical Dialogue for Clinical Vignettes: Evaluation Study

Yasutaka Yanagita, MD, PhD

Daiki Yokokawa, MD, PhD

Shiichi Ihara, MD

Ryo Yoshida, MD

Yoshihide Okano, MD

Takanori Uehara, MD, PhD

Abstract

Background

Objective

Methods

Results

Conclusions

Introduction

Methods

Overview

List of 47 diseases used to generate and evaluate artificial intelligence–generated physician-patient dialogues.

Production Environment

Prompts to Be Entered Into the LLM

Figure 1.

Evaluation of Dialogue

Table 1.

Ethical Considerations

Results

Table 2.

Table 3.

Discussion

Principal Findings

Limitations

Conclusions

Acknowledgments

Abbreviations

Data Availability

Footnotes

References

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases