Abstract
Artificial intelligence (AI)-driven large language models (LLMs) hold potential for medical applications but face challenges, such as inaccurate or outdated training data. In this study, ZhongdaChat-ED, a personalized medical LLM integrating retrieval-augmented generation (RAG) technology, was developed to enhance erectile dysfunction (ED) counseling and clinical decision-making. The model was built using the open-source Deepseek-r1:32b framework, augmented with two specialized databases: a patient health consultation database and a clinical decision support database updated with real-time medical advancements. Two versions of ZhongdaChat-ED were developed: a Consumer Version for patient-facing health consultations and a Professional Version for clinician support. Performance was evaluated against four commonly used LLMs (ChatGPT4, Copilot, Claude, and Gemini) through simulated clinical consultations and case analyses. Three urologists and three patients assessed responses across various dimensions, including accuracy, human caring, ease of understanding, clinical significance, and informational frontier. The Consumer Version outperformed commonly used LLMs in accuracy (4.77/5), human caring (4.86/5), and ease of understanding (4.88/5) with all P < 0.001. The Professional Version demonstrated significantly higher clinical significance (>85.2% case score rate) and informational frontier scores (4.52/5) than those of other models (P < 0.001). ZhongdaChat-ED effectively addresses limitations of conventional LLMs by leveraging RAG to integrate real-time, domain-specific data. ZhongdaChat-ED shows promise in enhancing patient health consultation and clinician decision-making for ED, underscoring the value of tailored AI systems in bridging gaps between generalized AI and specialized medical needs. Future work should expand multimodal capabilities and cross-disciplinary integration to broaden clinical utility.
Keywords: artificial intelligence, clinical decision-making, erectile dysfunction, large language models, medical consultation, ZhongdaChat-ED
INTRODUCTION
Artificial intelligence (AI) refers to technology that simulates and executes human intelligence using computer systems or other types of machines, enabling tasks such as learning, reasoning, and natural language processing.1 Large language models (LLMs) are a representative AI technology used to address medical consultation issues, particularly in natural language processing (NLP) and clinical language comprehension tasks.2,3,4
LLMs are promising in patient health counseling and clinical decision support. Liu et al.5 proposed that ChatGPT can provide personalized recommendations on topics such as nutrition, exercise, and psychological support. Additionally, Zhu et al.6 suggested that LLMs can offer tailored clinical advice to patients with prostate cancer. However, in certain medical contexts, the readability and information quality of model outputs need to be improved, raising concerns regarding the reliability of health counseling information.7,8 Furthermore, due to the limitations posed by outdated training data in general language models, LLMs cannot achieve real-time synchronization with emerging medical advancements.9
Erectile dysfunction (ED) is defined as the inability of males to achieve or maintain an erection sufficient for sexual intercourse. Its incidence increases with age, and it is closely associated with various conditions, such as hypertension, cardiovascular disease, diabetes, and depression.10,11 Many patients experiencing ED often feel embarrassed, misunderstood, and psychologically stressed. They may also fear treatment, have negative past experiences, and worry about the impact on their relationships and financial well-being, making them reluctant to seek medical assistance.12,13,14,15,16 While clinical guidelines offer evidence-based management strategies, barriers to accessing timely, personalized counseling persist, leaving many patients underserved. LLMs have demonstrated significant potential in health counseling and may provide a low-pressure avenue for patients to seek advice before visiting a hospital.17 However, during the consultation process, commonly used LLMs may generate misleading responses, called hallucinations (i.e., factually incorrect responses), when prompted on specific topics due to outdated or limited training data. Cocci et al.18 demonstrated that ChatGPT frequently generates “hallucinations” when addressing bladder cancer cases, such as offering treatment for kidney stones instead of bladder cancer. Generating untrue or outdated answers can prevent patients from receiving accurate and personalized medical advice, thereby delaying optimal treatment.19
To address these challenges, retrieval-augmented generation (RAG) technology has emerged as a transformative paradigm. RAG enhances LLMs by dynamically retrieving contextually relevant information from external databases during response generation, ensuring that outputs are grounded in authoritative, up-to-date evidence.20
The aim of our study was to develop a personalized, dual-functional medical LLM that integrates health consultation services for patients and clinical decision-making support for professionals. This model, named ZhongdaChat-ED, is based on a patient health consultation database and a clinical decision support database for ED, utilizing the latest open-source LLM, Deepseek-r1:32b, through RAG technology. We compared ZhongdaChat-ED with commonly used LLMs to evaluate its performance in addressing issues related to ED.
PARTICIPANTS AND METHODS
Model development
ZhongdaChat-ED (www.zhongdachat.com) was developed based on the open-source Ollama framework,21 utilizing a database of patient health counseling information for ED and a clinical decision support database of cutting-edge medical research. We trained and fine-tuned the open-source LLM Deepseek-r1:32b,22 which was then integrated into ZhongdaChat-ED using RAG technology. This integration involved two open-source tools, Open WebUI23 and Cpolar (https://www.cpolar.com/, last accessed on 2024 October 28), to facilitate the construction of a public webpage and model function (version) switching. Two versions of ZhongdaChat-ED were developed: Consumer Version (providing patient healthcare consulting services) and Professional Version (providing clinical decision-making assistance for urologists). The web interface is illustrated in Supplementary Figure 1 (1.6MB, tif) . The development process is illustrated in Figure 1a.
Figure 1.
Flowchart of (a) model development, and (b) model mechanism and model assessment. LLMs: large language models; ED: erectile dysfunction; RAG: retrieval-augmented generation.
Specialized database establishment
Patient health consultation database integrates three primary sources: the European Association of Urology (EAU) Guidelines on Sexual and Reproductive Health-2025,24 MSD Manual Consumer Edition,25 and anonymized ED health consultation data from Chinese online treatment platforms (total size: 116 376 193 bytes). Data processing involved the following steps.
Data cleaning involved duplicate removal via Python’s Pandas library with manual review of redundant entries from overlapping platforms;26 error correction standardizing inconsistent terminology (e.g., variations of “erectile dysfunction”) and formatting discrepancies (e.g., date and calligraphic style); and exclusion of incomplete entries (<2% dataset) lacking critical fields like consultations lacking patient age or symptom duration. Data validation comprised source cross-verification validating guidelines against original publications and platform data against clinician-reviewed cases for real-world alignment; temporal exclusion of pre-2020 data to prioritize recent patient interactions. Standardization included mapping structured fields (e.g., symptoms and treatments) to SNOMED-CT using custom ontologies27 and tokenizing and categorizing of free-text entries (e.g., patient queries) using spaCy NLP pipelines for uniformity.28
Clinical decision support database includes the EAU Guidelines on Sexual and Reproductive Health-2025, 247 cited articles, and monthly updates from the latest literature on advances in ED treatment (total size: 186 275 773 bytes).29 Processing steps were as follows. Data cleaning involved format harmonization converting PDF and HTML articles to plain text via Apache Tika with tables/figures extracted separately30 and metadata extraction of publication dates, authors, and keywords for temporal filtering. Validation comprised expert review by three urologists verifying clinical relevance of articles and version control synchronizing guidelines with EAU’s official repository for latest revisions. Standardization entailed Elasticsearch indexing with MeSH term assignment ensuring semantic consistency.31
Performance assessment
We developed a series of inquiries and clinical scenarios encompassing 19 ED-related medical advice questions and 6 complex ED clinical cases, each with subdivided sub-questions. These inquiries ranged from fundamental concepts to cutting-edge research related to ED. First, these inquiries were informed by a systematic review of Chinese online medical platforms’ ED FAQs, including standardization of Traditional Chinese Medicine terms (e.g., Kidney deficiency and Yang deficiency). Second, we incorporated common diagnostic/therapeutic challenges from the EAU Guidelines (2025).24 Third, we integrated real-world clinical scenarios validated by urologists for representativeness of ED presentations.
These questions and cases covered a variety of topics, including the basic understanding of ED, diagnosis, self-assessment methods, patient-led self-management, therapeutic interventions, and follow-up procedures for patients treated for ED. Patient information and clinical data have been de-identified under the Guidance Regarding Methods for De-identification of Protected Health Information in Accordance with the Health Insurance Portability and Accountability Act (HIPAA) Privacy Rule and all data are HIPAA compliant.32 Both the questions and cases are presented in Supplementary Table 1 (183.2KB, pdf) .
Subsequently, we simulated the roles of patients and urologists in clinical consultations and case discussions using ZhongdaChat-ED and four commonly used LLMs (ChatGPT4, Copilot, Claude, and Gemini). We collected their responses over three tests during a one-week period, randomizing the order of questions and cases. Responses were evaluated by three urologists and three patients with ED. The evaluator cohort size aligned with prior studies of LLM performance,33 balancing rigor and feasibility for preliminary validation.
For ED-related medical consultation questions, three patients (aged 32–68 years) with confirmed ED diagnoses and varying disease durations (1–8 years) were recruited from Department of Urology, Zhongda Hospital, Southeast University, Nanjing, China to rate the ease of understanding of the model responses. Three experienced urologists specializing in andrology and ED (with ED clinical experience over 20 years and publications related to ED) evaluated the accuracy and human caring of the model responses to assess performance in delivering health consultation services. For complex clinical cases, clinically significant responses to each sub-question were screened and evaluated by three urologists during the LLM case analysis. The total score for the sub-questions was designated to represent the clinical significance of a case response. Additionally, the informational frontier of the responses was evaluated. These factors were combined to assess the ability of LLMs to provide clinical decision-making assistance (clinically significant responses) in the medical scenarios. Informed consent was obtained from all patient and physician participants after explaining the study’s purpose, procedures, and data usage. Written consent forms were signed. Participants consented to the use of their anonymized evaluation data for the purposes of this study and its publication. The assessment by patients and physicians was conducted on February 2, 2025 (over a one-week period). The assessment process is illustrated in Figure 1b.
When ZhongdaChat-ED receives input queries about ED, it vectorizes the text (Query vector) and embeds the query vectors in specialized databases (Embedding). Then, ZhongdaChat-ED searches for relevant literature or guideline fragments in specialized databases for the retrieval of transformed vectors (Relevant splits and retrieve). The retrieved medical evidence is spliced with the user’s original query (Context concatenation) and fed into the base generative model to ensure that the output answers are strictly based on the retrieved content with specialized databases, whereas mainstream LLMs rely on pre-trained static knowledge to generate answers directly. The mechanism is illustrated in Figure 1b.
The version of each language model used in the study were as follows: ChatGPT4 (https://openai.com/gpt-4/, May 13, 2024 version), Claude-3.5-Sonnet-20240620 (https://claude.ai/, June 20, 2024 version), Copilot (https://copilot.microsoft.com/, January 15, 2024 version), and Gemini 2.0 Flash (https://gemini.google.com/, December 12, 2024). ZhongdaChat-ED was constructed based on the open-source LLM Deepseek-r1:32b (https://github.com/deepseek-ai/DeepSeek-R1.git, September 9, 2024 version), and training was completed on January 28, 2025.
Scoring criteria
The assessment was conducted under blinded conditions, adhering to the principles of objectivity and rigor. For the four dimensions of accuracy, human caring, ease of understanding, and informational frontier scores of clinically significant responses, scoring was conducted using a 5-point Likert scale, with the median or mode representing the final score.34 For clinical significance, the sub-questions in each case were assigned scores of 0, 1, or 2 by three urologists, and the score for each sub-question was the sum of scores from the three urologists. The total score for the case was the sum of the sub-question scores. Notably, when there was one or more “0” values for a sub-question, the score for this sub-question was recorded as 0. Scores greater than 0 indicated a “clinically significant response”, and these were further assessed for frontier scores.
The Likert scale was defined as: 1 (unacceptable), the responses generated by the LLMs were clearly deficient in the criterion; 2 (poor), the responses were lacking in a criterion, but not severely; 3 (average), the responses were adequate but not particularly consistent with the criterion; 4 (good), the responses were in line with the criterion; and 5 (excellent), the responses surpassed the anticipated standard, demonstrating exceptional performance. Before the formal evaluation, reviewers received brief training on LLM interaction to minimize technology literacy bias. Additionally, we organized a training session for the reviewers on the evaluation criteria and process. During the training, the three urologists reached a consensus to refer to Chapter 5 Management of Erectile Dysfunction of the EAU Guidelines on Sexual and Reproductive Health-2025 and the Erectile Dysfunction: AUA Guideline (2018) as the basis for evaluation.24,35 We also provided detailed explanations and definitions of the five dimensions, including examples to aid in interpretation. Explanations of the five dimensions and examples of the scoring criteria can be found in the Supplementary Information (2.2MB, pdf) (Explanations of the Five Dimensions, and Examples of Scoring Criteria).
Visualization and statistical analysis
Origin 2023 (version 2023b SR1 [9.9.0.225], released on October 10, 2023) was utilized to create interval dot plots, which illustrated the individual model scores for each question, as well as radar plots for comparing the total scores for the different models. All statistical analyses were conducted using R version 4.3.0 (R Foundation for Statistical Computing, Vienna, Austria). Comparisons between mean scores for each dimension were analyzed using one-way analysis of variance (ANOVA), effective size (Cohen’s d), and post hoc tests for comparisons of scores. Cross-tabular Chi-square tests were used to compare the percentage of responses that were clinically significant. ANOVA assumptions were validated using Shapiro–Wilk and Levene’s tests. Inter-rater reliability was measured using the Fleiss Kappa test. Statistical significance is presented in the Supplementary Information (2.2MB, pdf) (Statistical Significance).
RESULTS
ZhongdaChat-ED is currently freely available on the Internet (www.zhongdachat.com). The comprehensive responses of the six models to the given questions and cases are documented in the Supplementary Information (2.2MB, pdf) (Responses of the Six Models to the Given Questions and Cases). Evaluative scores provided by three urologists and three patients for each question, assessed across dimensions, and underwent rigorous statistical dimensional reduction and subsequent visual processing. ANOVA assumptions were validated via Shapiro–Wilk (normality) and Levene’s tests (homogeneity). The measure of inter-rater reliability Fleiss’ k was 0.78, indicating strong agreement.
Performance of four LLMs and two ZhongdaChat-ED versions in answering questions related to ED
Figure 2a provides a detailed overview of the accuracy scores. In terms of accuracy, the Consumer Version demonstrated superior performance, consistently earning scores above four points for answers to all 19 questions across three simulation tests. The Consumer Version maintained a higher mean score (i.e., 4.77) than those of four commonly used LLMs and the Professional Version (4.63), which did not differ significantly (Consumer Version vs Gemini, P < 0.001, Cohen’s d = 1.75; Consumer Version vs ChatGPT4, P < 0.001, Cohen’s d = 0.91; Consumer Version vs Claude, P < 0.001, Cohen’s d = 1.02; Consumer Version vs Copilot, P = 0.01, Cohen’s d = 0.63; Consumer Version vs Professional Version, P = 0.20, Cohen’s d = 0.31). Further statistical analyses of human caring scores and ease of understanding scores for each model revealed that the Consumer Version received the highest mean scores in these dimensions, although the differences in comparisons with those of Copilot were not significant. In particular, the Consumer Version yielded values of 4.86 for human caring (Consumer Version vs Gemini, P < 0.001, Cohen’s d = 1.30; Consumer Version vs ChatGPT4, P < 0.001, Cohen’s d = 0.83; Consumer Version vs Claude, P < 0.001, Cohen’s d = 1.13; Consumer Version vs Copilot, P = 0.32, Cohen’s d = 0.25; Consumer Version vs Professional Version, P < 0.001, Cohen’s d = 0.89) and 4.88 for ease of understanding (Consumer Version vs Gemini, P < 0.001, Cohen’s d = 1.73; Consumer Version vs ChatGPT4, P = 0.003, Cohen’s d = 0.66; Consumer Version vs Claude, P < 0.001, Cohen’s d = 1.10; Consumer Version vs Copilot, P = 0.03, Cohen’s d = 0.45; Consumer Version vs Professional Version, P < 0.001, Cohen’s d = 0.78), as displayed in Figure 2b and 2c. Notably, the Consumer Version exhibited the most stable output results, demonstrating minimal fluctuations in scores across the three simulation sessions. Gemini was likely to respond to a request for a professional medical treatment for ED with an incorrect response, such as “If you are concerned about ED, it is important to see a doctor to determine the cause of your ED and discuss treatment options. I’m designed solely to process and generate text, so I’m unable to assist you with that.” The results of statistical analyses of the scores are presented in the Supplementary Information (2.2MB, pdf) (ZhongdaChat-ED Consumer Version).
Figure 2.

Performance of four LLMs and two ZhongdaChat-ED versions in answering questions related to ED. (a) Scatter plots show the accuracy scores of each model for each problem, and bar graphs show the average accuracy scores of each model. (b) Scatter plots show the human caring scores of each model for each problem, and bar graphs show the average human caring scores of each model. (c) Scatter plots show the ease of understanding scores of each model in each problem, and bar graphs show the average ease of understanding scores of each model. LLMs: large language models; ED: erectile dysfunction.
Overall, the Consumer Version outperformed the other five models in providing accurate, easy-to-understand, and humanistic responses to ED clinical consultations.
Performance of LLMs in clinical scenarios analysis
We compared the performance of various models for six subdivided complex clinical cases related to ED, covering topics from the basics to frontiers of research. We computed the mean scores for each case across three trial simulations and divided the total scores by the maximum attainable score (score rate [%] = total sub-question scores/maximum attainable score × 100%). These data were used to generate radar plots (Figure 3a) illustrating the clinical significance scores for specific output responses from the six models. Notably, the Professional Version performed better than the four generic models, exhibiting remarkable stability and exceptional problem-solving abilities across all three tests, consistently achieving a score rate of 85.2% or higher for each clinical case, while the Professional Version did not differ significantly from the Consumer Version in five cases (Table 1).
Figure 3.

Performance of the models in a clinical scenario. (a) Radar plots illustrate the clinical significance scores (percentages) for specific output responses of the six models for each case. (b) The proportion of “clinically significant responses” provided by each model to sub-questions. (c) Bar graphs show the informational frontier scores of clinically significant responses for each model. ****P < 0.0001. X²_Pearson (5): Pearson’s Chi-squared test statistic (denoted χ²) with 5 degrees of freedom; Cramer’s V: effect size measure for Chi-squared tests, ranging from 0 (no association) to 1 (perfect association); CI: confidence interval.
Table 1.
Clinical significance comparison of Professional Version versus other models across clinical scenarios analysis
| Case | Comparison model | Score rate (%) | P | Cohen’s d |
|---|---|---|---|---|
| 1 | Gemini | 85.2 | <0.001 | 6.56 |
| Claude | 85.2 | 0.001 | 1.94 | |
| Copilot | 85.2 | 0.11 | 2.11 | |
| ChatGPT4 | 85.2 | 0.046 | 2.59 | |
| Consumer Version | 85.2 | 0.81 | 0.33 | |
| 2 | Gemini | 90.3 | <0.001 | 6.67 |
| Claude | 90.3 | 0.008 | 1.66 | |
| Copilot | 90.3 | 0.57 | 0.58 | |
| ChatGPT4 | 90.3 | 0.7 | 0.39 | |
| Consumer Version | 90.3 | 1 | 0 | |
| 3 | Gemini | 89.7 | <0.001 | 7.33 |
| Claude | 89.7 | 0.005 | 2 | |
| Copilot | 89.7 | 0.02 | 3.54 | |
| ChatGPT4 | 89.7 | 0.007 | 4 | |
| Consumer Version | 89.7 | 0.58 | 0.71 | |
| 4 | Gemini | 91.7 | 0.001 | 3.85 |
| Claude | 91.7 | 0.04 | 1.84 | |
| Copilot | 91.7 | 0.001 | 3.85 | |
| ChatGPT4 | 91.7 | 0.002 | 3.73 | |
| Consumer Version | 91.7 | 0.02 | 2.62 | |
| 5 | Gemini | 89.8 | <0.001 | 3.63 |
| Claude | 89.8 | <0.001 | 4.11 | |
| Copilot | 89.8 | 0.01 | 2.37 | |
| ChatGPT4 | 89.8 | 0.09 | 1.42 | |
| Consumer Version | 89.8 | 0.55 | 0.47 | |
| 6 | Gemini | 91.7 | <0.001 | 5.56 |
| Claude | 91.7 | <0.001 | 1.85 | |
| Copilot | 91.7 | <0.001 | 2.5 | |
| ChatGPT4 | 91.7 | 0.001 | 2.31 | |
| Consumer Version | 91.7 | 0.06 | 1.15 |
*Score rate (%) = total sub-question scores/maximum attainable score × 100%. P values indicate statistical significance of performance differences (two-tailed tests). Cohen’s d quantifies effect size (standardized mean difference). All values directly extracted from Supplementary Information (2.2MB, pdf) (Statistical Significance) without modification. Clinical cases 1–6 correspond to Q20–Q25 in Supplementary Table 1 (183.2KB, pdf)
Responses without a score of zero for a sub-question were defined as “clinically significant responses”. By calculating the proportion of such responses for each model, we evaluated their informational frontier. The Professional Version provided clinically significant responses to all sub-questions in all three tests, achieving a rate of 100% (χ2 = 25.94, df = 5, P < 0.001) in all cases (Figure 3b). In comparisons of the informational frontier scores of clinically significant responses across the six models, the Professional Version had the highest average score of 4.5208 (standard deviation [s.d.] = 0.499; Professional Version vs Gemini, P < 0.001; Professional Version vs Claude, P < 0.001; Professional Version vs Copilot, P < 0.001; Professional Version vs ChatGPT4, P < 0.001; Professional Version vs Consumer Version, P = 0.006; Figure 3c). The statistical results are presented in the Supplementary Information (2.2MB, pdf) (Statistical Significance).
Collectively, these findings indicate that the Professional Version outperforms the other five models significantly in delivering clinically significant and cutting-edge responses in ED scenarios.
ZhongdaChat-ED redefines ED care by offering dual interfaces: the Consumer Version empowers patients with accessible, stigma-free consultations, while the Professional Version equips clinicians with evidence-based decision support. For patients, this model reduces barriers to seeking help, a pivotal advancement given that 65% of patients with ED delay medical visits due to embarrassment or inattention to ED symptoms.16 Clinicians benefit from streamlined case analysis; for instance, in complex scenarios involving comorbidities (e.g., diabetes-related ED), the Professional Version achieved an 85.2% case score rate, outperforming Gemini (63.2%) and Claude (71.5%) with all P < 0.001. Unlike traditional commonly used LLMs, the domain-specific training and RAG integration in ZhongdaChat-ED minimize hallucinations, as evidenced by its 100% clinically significant response rate in case analyses.
DISCUSSION
Current state of LLMs in healthcare
LLMs have emerged as a prominent topic in the medical field, with their development progressing rapidly.36 LLMs have the potential to transform healthcare education, research, and clinical practice. For instance, they can facilitate scientific writing, efficiently analyze massive datasets, generate code, and conduct rapid literature reviews. Additionally, they can be beneficial in drug discovery and development, cost-saving, documentation, personalized medicine, and improving health literacy.37 Models such as ChatGPT, launched in November 2022, have been particularly influential in facilitating clinical documentation and functioning as chatbots to address specific patient data and concerns.38,39 In the field of andrology, the quality of information provided by commonly used AI chatbots has been evaluated, such as the work by Pan et al.40 Furthermore, Şahin et al.41 conducted a comparative analysis of five different LLM responses to frequently searched queries about ED, and Razdan et al.42 assessed the ability of ChatGPT to answer questions pertaining to ED. These studies have highlighted the potential benefits of LLMs in healthcare and clinical assistance. However, limitations persist, including constraints related to training data, timeliness, and the accuracy of responses. Current research primarily focuses on the assessment of commonly used language models for ED problem-solving capabilities; however, a healthcare model with real-time updated training data specifically designed to address ED issues is lacking.
RAG models versus general LLMs
In such specific medical scenarios, general LLMs often fail to meet practical needs when applied to professional segmentation, primarily due to knowledge limitations. The model’s knowledge is entirely derived from its training data, which are predominantly constructed from publicly available datasets. Consequently, commonly used LLMs (such as ChatGPT, Copilot, and Gemini) rely on static datasets and lack access to real-time, non-public, or offline data, limiting LLMs to knowledge available at the time of training. In contrast, RAG dynamically retrieves real-time, domain-specific data during inference, enabling continuous integration of the latest guidelines and research. For instance, in ophthalmology, Chen et al.43 developed an automated pipeline for fundus fluorescein angiography interpretation and question-answering, named FFA-GPT, utilizing RAG technology. This system effectively integrates FFA image text information and provides question-and-answer interactions, thereby offering valuable medical advice and assistance for ophthalmic consultations. Ge et al.44 developed a liver disease-specific and protected health information-compliant LLM, named LiVersa, using RAG, which demonstrated higher accuracy than that of ChatGPT4 in addressing questions related to hepatology.
Clinical impact and advantages of ZhongdaChat-ED
We successfully developed ZhongdaChat-ED using a patient health advisory database and a frontier medical research database for ED, employing RAG technology. It integrates health consultation services for patients (ZhongdaChat-ED Consumer Version) and clinical decision-making assistance for professionals (ZhongdaChat-ED Professional Version). The Consumer Version distinguishes itself by delivering accurate, easy-to-understand, and humanistic responses to ED-related questions. This means that the ZhongdaChat-ED Consumer Version can provide clear and accurate clinical health counseling advice to patients with little or no medical education while considering the humanistic needs of individuals with ED or those at risk of developing it. The Professional Version offers clinically significant and cutting-edge responses in ED scenarios. For example, ZhongdaChat-ED’s Clinical Decision Support Database is updated monthly, ensuring that clinicians receive recommendations aligned with current evidence. This demonstrates the potential of the ZhongdaChat-ED Professional Version to assist urologists in case analysis and clinical decision-making in complex ED situations as well as to help them stay updated on recent ED research, which is not possible through static LLMs. For example, by querying the recent literature during inference, ZhongdaChat-ED can provide frontier research that may be relevant in the ED diagnosis and treatment process, such as data for the monocyte-to-high-density lipoprotein-cholesterol ratio and inflammatory biomarkers.45,46
The specialization of ZhongdaChat-ED in urology distinguishes it from other medical RAG models. LiVersa (Ge et al.44) and FFA-GPT (Chen et al.43) focus on hepatology and ophthalmology, respectively, but lack tailored frameworks for the psychosocial and physiological complexities for patients with ED.
Prospects and challenges
Although ZhongdaChat-ED incorporates general knowledge of ED treatment recommendations and research, it has limitations in addressing broader cross-disciplinary aspects of medicine. For instance, in clinical practice, ED frequently coexists with cardiovascular disorders and metabolic disorders, necessitating establishing connections with other disease databases such as cardiology and endocrinology.47 Future work should focus on integrating medical literature, clinical guidelines, and additional expertise related to systemic urological diseases and multi-clinical diseases to enhance the model’s utility. Furthermore, while ZhongdaChat-ED can respond to user inquiries based on text input, real-world health counseling or clinical scenarios often involve multimodal communication. For instance, patients with Peyronie’s disease often require visual assessments of penile curvature. Therefore, expanding the types of data that the model can process and respond to, including images, audio, and video, is critical for enabling more comprehensive clinical applications.48,49 Finally, despite using manual evaluations from multiple clinicians and patients, there is a need for more extensive comparisons with a broader range of ED analysis tools and prospective trials. This should involve more urologists or medical students with varying levels of expertise.50 Additionally, the model’s interpretability is limited by the “black box” nature of AI neural networks.
CONCLUSION
Our study successfully developed ZhongdaChat-ED, a dual-functional LLM for ED, leveraging RAG technology. The model includes a patient health consultation version (Consumer Version) and clinical decision support version (Professional Version). By leveraging RAG to integrate real-time, domain-specific data, ZhongdaChat-ED effectively addresses limitations of commonly used LLMs. In comparisons with commonly used LLMs, ZhongdaChat-ED demonstrated superior performance in consultation accuracy, humanistic care, and informational frontier, while minimizing hallucinations through real-time evidence integration. The Consumer Version addresses patient barriers by offering stigma-free, accessible consultations, whereas the Professional Version enhances clinical workflows with up-to-date recommendations. Current limitations include insufficient cross-disciplinary data integration and multimodal capabilities. Future research should prioritize multimodal expansion and integration of cardiovascular/metabolic disease databases to broaden clinical utility.
AUTHOR CONTRIBUTIONS
GYZ and CS conceived and designed the study, acquired/analyzed/interpreted data, supervised the work, and ensured data integrity/accuracy. YX conceived and designed the study, developed the model, acquired data, drafted the manuscript, performed statistical analysis, and provided technical support. YKZ analyzed data, drafted the manuscript, and provided technical support. CHL interpreted data and drafted the manuscript. XH was guided statistical analyses. RXZ and NKZ revised the manuscript. MC supervised the work. All authors read and approved the final manuscript.
COMPETING INTERESTS
All authors declare no competing interests.
Questions and clinical cases
Web interface of ZhongdaChat
ACKNOWLEDGMENTS
This work was supported by the Natural Science Foundation of China (No. 82170703 and No. 81871157), Jiangsu Provincial Hospital Association Hospital Management Innovation Research fund (No. JSYGY-3-2023-410), and Zhongda Hospital Affiliated to Southeast University, Jiangsu Province High-Level Hospital Construction Funds (No. GSP-ZXY12).
Supplementary Information is linked to the online version of the paper on the Asian Journal of Andrology website.
REFERENCES
- 1.Wang H, Fu T, Du Y, Gao W, Huang K, et al. Scientific discovery in the age of artificial intelligence. Nature. 2023;620:47–60. doi: 10.1038/s41586-023-06221-2. [DOI] [PubMed] [Google Scholar]
- 2.Benary M, Wang XD, Schmidt M, Soll D, Hilfenhaus G, et al. Leveraging large language models for decision support in personalized oncology. JAMA Netw Open. 2023;6:e2343689. doi: 10.1001/jamanetworkopen.2023.43689. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Lim ZW, Pushpanathan K, Yew SM, Lai Y, Sun CH, et al. Benchmarking large language models'performances for myopia care:a comparative analysis of ChatGPT-3.5, ChatGPT-4.0, and Google Bard. EBioMedicine. 2023;95:104770. doi: 10.1016/j.ebiom.2023.104770. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Schwartz IS, Link KE, Daneshjou R, Cortés-Penfield N. Black box warning:large language models and the future of infectious diseases consultation. Clin Infect Dis. 2024;78:860–6. doi: 10.1093/cid/ciad633. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Liu J, Wang C, Liu S. Utility of ChatGPT in clinical practice. J Med Internet Res. 2023;25:e48568. doi: 10.2196/48568. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Zhu L, Mou W, Chen R. Can the ChatGPT and other large language models with internet-connected database solve the questions and concerns of patient with prostate cancer and help democratize medical knowledge? J Transl Med. 2023;21:269. doi: 10.1186/s12967-023-04123-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Malak A, Şahin MF. How useful are current chatbots regarding urology patient information?Comparison of the ten most popular chatbots'responses about female urinary incontinence. J Med Syst. 2024;48:102. doi: 10.1007/s10916-024-02125-4. [DOI] [PubMed] [Google Scholar]
- 8.Şahin MF, Topkaç EC, Doğan Ç, Şeramet S, Özcan R, et al. Still using only ChatGPT?The comparison of five different artificial intelligence chatbots'answers to the most common questions about kidney stones. J Endourol. 2024;38:1172–7. doi: 10.1089/end.2024.0474. [DOI] [PubMed] [Google Scholar]
- 9.Vaishya R, Misra A, Vaish A. ChatGPT:is this version good for healthcare and research? Diabetes Metab Syndr Clin Res Amp Rev. 2023;17:102744. doi: 10.1016/j.dsx.2023.102744. [DOI] [PubMed] [Google Scholar]
- 10.Shamloul R, Ghanem H. Erectile dysfunction. Lancet Lond Engl. 2013;381:153–65. doi: 10.1016/S0140-6736(12)60520-0. [DOI] [PubMed] [Google Scholar]
- 11.Fadzil MA, Sidi H, Ismail Z, Hassan MR, Thuzar K, et al. Socio-demographic and psychosocial correlates of erectile dysfunction among hypertensive patients. Compr Psychiatry. 2014;55:S23–8. doi: 10.1016/j.comppsych.2012.12.024. [DOI] [PubMed] [Google Scholar]
- 12.Grant P, Jackson G, Baig I, Quin J. Erectile dysfunction in general medicine. Clin Med Lond Engl. 2013;13:136–40. doi: 10.7861/clinmedicine.13-2-136. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Rosen RC. Psychogenic erectile dysfunction. Classification and management. Urol Clin North Am. 2001;28:269–78. doi: 10.1016/s0094-0143(05)70137-3. [DOI] [PubMed] [Google Scholar]
- 14.Calzo JP, Austin SB, Charlton BM, Missmer SA, Kathrins M, et al. Erectile dysfunction in a sample of sexually active young adult men from a U. S. cohort:demographic, metabolic and mental health correlates. J Urol. 2021;205:539–44. doi: 10.1097/JU.0000000000001367. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Ma J, Zhang Y, Bao B, Chen W, Li H, et al. Prevalence and associated factors of erectile dysfunction, psychological disorders, and sexual performance in primary versus secondary infertility men. Reprod Biol Endocrinol RBE. 2021;19:43. doi: 10.1186/s12958-021-00720-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Akgül M, Yazıcı C, Doğan Ç, Özcan R, Şahin MF. Erectile dysfunction iceberg in an urology outpatient clinic:how can we encourage our patients to be more forthcoming? Andrologia. 2021;53:e14152. doi: 10.1111/and.14152. [DOI] [PubMed] [Google Scholar]
- 17.Yigman M, Untan I, Dogan AE. ChatGPT:a new hope for sexual dysfunction sufferers? J Mens Health. 2024;20:135. [Google Scholar]
- 18.Cocci A, Pezzoli M, Lo Re M, Russo GI, Asmundo MG, et al. Quality of information and appropriateness of ChatGPT outputs for urology patients. Prostate Cancer Prostatic Dis. 2024;27:103–8. doi: 10.1038/s41391-023-00705-y. [DOI] [PubMed] [Google Scholar]
- 19.Shen Y, Heacock L, Elias J, Hentel KD, Reig B, et al. ChatGPT and other large language models are double-edged swords. Radiology. 2023;307:e230163. doi: 10.1148/radiol.230163. [DOI] [PubMed] [Google Scholar]
- 20.Bhayana R, Fawzy A, Deng Y, Bleakney RR, Krishna S. Retrieval-augmented generation for large language models in radiology:another leap forward in board examination performance. Radiology. 2024;313:e241489. doi: 10.1148/radiol.241489. [DOI] [PubMed] [Google Scholar]
- 21.Ollama. Ollama:Get Up and Running with Large Language Models. GitHub. [[Last accessed on 2025 Jan 28]]. Available from: https://github.com/ollama/ollama .
- 22.Deep Seek AI. Deep Seek R. GitHub. [[Last accessed on 2025 Jan 28]]. Available from: https://github.com/deepseek-ai/DeepSeek-R1 .
- 23.Open WebUI. GitHub. [[Last accessed on 2024 Oct 28]]. Available from: https://github.com/open-webui/open-webui .
- 24.Salonia A, Capogrosso P, Boeri L, Cocci A, Corona G, et al. European Association of Urology guidelines on male sexual and reproductive health:2025 update on male hypogonadism, erectile dysfunction, premature ejaculation, and Peyronie's disease. Eur Urol. 2025;88:76–102. doi: 10.1016/j.eururo.2025.04.010. [DOI] [PubMed] [Google Scholar]
- 25.Jimbo M. Erectile Dysfunction (ED) MSD Manual Consumer Version. [[Last accessed on 2025 May 13]]. Available from: https://www.msdmanuals.com/home/men-s-health-issues/sexual-function-and-dysfunction-in-men/erectile-dysfunction-ed .
- 26.McKinney W. Data Structures for Statistical Computing in Python. Proceedings of the 9th Python in Science Conference (SciPy 2010) 2010:51–6. [Google Scholar]
- 27.SNOMED International. Snowstorm:Scalable SNOMED CT Terminology Server Using Elasticsearch. GitHub. [[Last accessed on 2025 May 13]]. Available from: https://github.com/IHTSDO/snowstorm .
- 28.Sanderson C, Curtin R. gmm_diag and gmm_full:C++classes for multi-threaded Gaussian mixture models and Expectation-Maximisation. J Open Source Softw. 2017;2:365. [Google Scholar]
- 29.Shojania KG, Sampson M, Ansari MT, Garritty C, Rader T. Updating Systematic Reviews. Agency for Healthcare Research and Quality (US) 2007. [[Last accessed on 2025 May 13]]. Available from: https://www.ncbi.nlm.nih.gov/books/NBK44099/ [PubMed]
- 30.The Apache Software Foundation. Apache Tika. GitHub. [[Last accessed on 2025 May 13]]. Available from: https://github.com/apache/tika .
- 31.Bodenreider O. The Unified Medical Language System (UMLS):integrating biomedical terminology. Nucleic Acids Res. 2004;32:D267–70. doi: 10.1093/nar/gkh061. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.U. S. Department of Health and Human Services. Guidance on De-identification of Protected Health Information. [[Last accessed on 2025 May 13]]. Available from: https://www.hhs.gov/sites/default/files/ocr/privacy/hipaa/understanding/coveredentities/De-identification/hhs_deid_guidance.pdf .
- 33.Song H, Xia Y, Luo Z, Liu H, Song Y, et al. Evaluating the performance of different large language models on health consultation and patient education in urolithiasis. J Med Syst. 2023;47:125. doi: 10.1007/s10916-023-02021-3. [DOI] [PubMed] [Google Scholar]
- 34.Mirahmadizadeh A, Delam H, Seif M, Bahrami R. Designing, constructing, and analyzing Likert scale data. J Educ Community Health. 2018;5:63–72. [Google Scholar]
- 35.Burnett AL, Nehra A, Breau RH, Culkin DJ, Faraday MM, et al. Erectile dysfunction:AUA guideline. J Urol. 2018;200:633–41. doi: 10.1016/j.juro.2018.05.004. [DOI] [PubMed] [Google Scholar]
- 36.Singhal K, Azizi S, Tu T, Mahdavi SS, Wei J, et al. Large language models encode clinical knowledge. Nature. 2023;620:172–80. doi: 10.1038/s41586-023-06291-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Sallam M. ChatGPT utility in healthcare education, research, and practice:systematic review on the promising perspectives and valid concerns. Healthcare. 2023;11:887. doi: 10.3390/healthcare11060887. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Meskó B, Topol EJ. The imperative for regulatory oversight of large language models (or generative AI) in healthcare. NPJ Digit Med. 2023;6:120. doi: 10.1038/s41746-023-00873-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Goodman RS, Patrinely JR, Stone CA, Zimmerman E, Donald RR, et al. Accuracy and reliability of chatbot responses to physician questions. JAMA Netw Open. 2023;6:e2336483. doi: 10.1001/jamanetworkopen.2023.36483. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Pan A, Musheyev D, Loeb S, Kabarriti AE. Quality of erectile dysfunction information from ChatGPT and other artificial intelligence chatbots. BJU Int. 2024;133:152–4. doi: 10.1111/bju.16209. [DOI] [PubMed] [Google Scholar]
- 41.Şahin MF, Ateş H, Keleş A, Özcan R, Doğan Ç, et al. Responses of five different artificial intelligence chatbots to the top searched queries about erectile dysfunction:a comparative analysis. J Med Syst. 2024;48:38. doi: 10.1007/s10916-024-02056-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Razdan S, Siegal AR, Brewer Y, Sljivich M, Valenzuela RJ. Assessing ChatGPT's ability to answer questions pertaining to erectile dysfunction:can our patients trust it? Int J Impot Res. 2023;36:1–7. doi: 10.1038/s41443-023-00797-z. [DOI] [PubMed] [Google Scholar]
- 43.Chen X, Zhang W, Xu P, Zhao Z, Zheng Y, et al. FFA-GPT:an automated pipeline for fundus fluorescein angiography interpretation and question-answer. NPJ Digit Med. 2024;7:111. doi: 10.1038/s41746-024-01101-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Ge J, Sun S, Owens J, Galvez V, Gologorskaya O, et al. Development of a liver disease–specific large language model chat interface using retrieval-augmented generation. Hepatology. 2024;80:1158–68. doi: 10.1097/HEP.0000000000000834. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.Liu C, Gao Y, Ji J, Sun C, Chen M. Association between inflammatory indexes and erectile dysfunction in U. S. adults:national health and nutrition examination survey 2001-2004. Sex Med. 2023;11:qfad045. doi: 10.1093/sexmed/qfad045. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.Üntan İ, Doğan AE. Monocyte-to-high-density lipoprotein-cholesterol ratio as a predictor and severity indicator of erectile dysfunction. Androloji Bül. 2023;25:3. [Google Scholar]
- 47.Terentes-Printzios D, Ioakeimidis N, Rokkas K, Vlachopoulos C. Interactions between erectile dysfunction, cardiovascular disease and cardiovascular drugs. Nat Rev Cardiol. 2022;19:59–74. doi: 10.1038/s41569-021-00593-6. [DOI] [PubMed] [Google Scholar]
- 48.Li J, Li D, Xiong C, Hoi S. Proceedings of the 39th International Conference on Machine Learning. PMLR; 2022. [[Last accessed on 2024 Oct 28]]. BLIP:Bootstrapping Language-Image Pre-Training for Unified Vision-Language Understanding and Generation; pp. 12888–900. Available from: https://proceedings.mlr.press/v162/li22n.html . [Google Scholar]
- 49.Lin Z, Zhang D, Tao Q, Shi D, Haffari G, et al. Medical visual question answering:a survey. Artif Intell Med. 2023;143:102611. doi: 10.1016/j.artmed.2023.102611. [DOI] [PubMed] [Google Scholar]
- 50.Cutillo CM, Sharma KR, Foschini L, Kundu S, Mackintosh M, et al. Machineintelligence in healthcare –perspectives on trustworthiness, explainability, usability, and transparency. NPJ Digit Med. 2020;3:47. doi: 10.1038/s41746-020-0254-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Questions and clinical cases
Web interface of ZhongdaChat

