Skip to main content
American Journal of Nuclear Medicine and Molecular Imaging logoLink to American Journal of Nuclear Medicine and Molecular Imaging
. 2025 Aug 15;15(4):146–152. doi: 10.62347/OAHP6281

Advancing patient education in PRRT through large language models: challenges and potential

Tilman Speicher 1, Moritz B Bastian 1, Arne Blickle 1, Armin Atzinger 2, Florian Rosar 1, Caroline Burgard 1, Samer Ezziddin 1
PMCID: PMC12444397  PMID: 40980741

Abstract

The increasing use of artificial intelligence (AI) chatbots for patient education raises questions about their accuracy, readability, and conciseness in delivering medical information. This study evaluates the performance of ChatGPT 4o and DeepSeek V3 in answering common patient inquiries about Peptide Receptor Radionuclide Therapy (PRRT). Twelve frequently asked patient questions regarding PRRT were submitted to both chatbots. The responses were assessed by nine professionals using a blinded survey, scoring accuracy, conciseness, and readability on a five-point scale. Statistical analyses included the Mann-Whitney U test for nonparametric data and the Chi-square test for medically incorrect responses. A total of 324 individual assessments were conducted. No significant differences were found in accuracy between ChatGPT 4o (mean 4.43) and DeepSeek V3 (mean 4.56; P = 0.0909) or in readability between ChatGPT 4o (mean 4.38) and DeepSeek V3 (mean 4.25; P = 0.1236). However, ChatGPT 4o provided significantly more concise responses (mean 4.55) compared to DeepSeek V3 (mean 4.24; P = 0.0013). Medically incorrect information defined as accuracy ≤ 3 was present in 7-8% of chatbot responses, with no significant difference between the two models (P = 0.8005). Both AI chatbots demonstrated strong performance in providing medical information on PRRT, with ChatGPT 4o excelling in conciseness. However, the presence of medical inaccuracies highlights the need for physician oversight when using AI chatbots for patient education. Future research should explore methods to enhance AI reliability and personalization in clinical communication.

Keywords: Peptide receptor radionuclide therapy, PRRT, large language models, ChatGPT, DeepSeek, patient education, neuroendocrine tumors

Introduction

The efficacy of peptide receptor radionuclide therapy (PRRT), as demonstrated in the randomized, controlled NETTER-1 trial [1], led to the approval of [177Lu]Lu-DOTATATE by the U.S. Food and Drug Administration (FDA). As a result, more patients with somatostatin receptor-positive NETs are choosing this treatment. Before deciding, they often seek to understand the theranostic approach and evaluate its potential benefits for their specific cases. Consequently, the rising interest in PRRT indicates that an increasing number of patients may rely on AI chatbots for information and guidance on this advanced treatment option.

The release of ChatGPT in November 2022 catalyzed the ongoing discourse surrounding artificial intelligence, with particular emphasis on multifunctional chatbots [2]. Especially with the advent of advanced Large Language Models (LLM) like OpenAI’s ChatGPT (Chat-Generative Pre-Trained Transformer), Google Gemini [3], and more recently, DeepSeek [4], AI chatbots have gained substantial attention and are increasingly being integrated into various applications. They are continuously being utilized across various domains of life, including education [4], science [5], and business [6], as well as healthcare, customer service, software development [7], entertainment, and creative industries. In the medical field, chatbots are progressively being used as a first point of contact for patients seeking answers to various medical inquiries. It is essential to acknowledge that AI chatbots are still in development and may generate inaccurate or misleading responses, known as ‘hallucinations’ [8-10]. This study examines the reliability of chatbots (ChatGPT 4o and DeepSeek V3) in the three dimensions of accuracy, conciseness, and readability, using illustrative questions about PRRT as a case study. The study aims to assess both the overall performance of chatbots and whether one chatbot demonstrates superiority in any of the three dimensions. Additionally, the study seeks to assess the frequency with which chatbots generate medically false answers.

Materials and methods

Prompting chatbots

The 12 patient questions listed in Table 1 were developed collaboratively by a multidisciplinary team of physicians, including specialists in nuclear medicine and palliative care. While no formal survey data were available to define a standardized set of “most common” questions, the included items were derived from repeated patterns of patient concerns encountered in daily clinical consultations. Notably, one contributing team member - a senior palliative care physician- has extensive experience conducting such discussions regularly, ensuring the clinical relevance and authenticity of the selected questions.

Table 1.

Overview of the 12 questions used to prompt the chatbots

Questions
1 “What is the level of radiation exposure from Peptide Receptor Radionuclide Therapy (PRRT) and what effects does it have on my body?”
2 “Is there a possibility of a cure with Receptor Radionuclide Therapy (PRRT)?”
3 “How many cycles of Peptide Receptor Radionuclide Therapy (PRRT) should I receive?”
4 “How often will follow-up examinations be conducted after the Peptide Receptor Radionuclide Therapy (PRRT)?”
5 “What are the most common side effects of Peptide Receptor Radionuclide Therapy (PRRT)?”
6 “Do I need to continue the therapy with somatostatin analogs (SSAs)?”
7 “How should I behave in the next days after the Peptide Receptor Radionuclide Therapy (PRRT)?”
8 “How soon will I know if the treatment (PRRT) is effective?”
9 “Does the radiation from the radioligand therapy pose a risk of developing a tumor myself?”
10 “How does Peptide Receptor Radionuclide Therapy (PRRT) work?”
11 “How much does Receptor Radionuclide Therapy (PRRT) cost?”
12 “What alternative treatment options are available for my neuroendocrine tumor, and why was radioligand therapy recommended in my case?”

The table presents the specific questions posed to the chatbots to assess their ability to provide accurate, concise, and readable information.

These questions were prompted to DeepSeek V3 and ChatGPT 4o on February 6, 2025 in a single chat window. For both LLMs, the default temperature setting of 1.0 was applied. The parameter max tokens was not explicitly defined; however, the initial prompt instructed the model to provide concise responses, thereby implicitly constraining the output length.

Using the chatbot responses, a survey was created using the Qualtrics platform [11], in which each question-answer pair was evaluated across the three dimensions: accuracy, conciseness, and readability. The survey was blinded and subsequently sent to nine experts (eight nuclear medicine physicians and one experienced student (the student had direct clinical experience with the relevant patient population through supervised rotations, including patient consultations. This provided sufficient clinical context for participation in the rating process)), who evaluated each question-answer pair based on the aforementioned criteria (accuracy, conciseness, and readability) using a 5-point scale. Ultimately, in the Accuracy category, responses with a score of ≤ 3 were classified as medically misleading or incorrect.

Statistical analysis

The statistical analyses were performed using GraphPad Prism (version 10.2.3 for mac, USA, CA). To assess nonparametric data, the Mann-Whitney U test was used to evaluate the scores of DeepSeek V3 and ChatGPT 4o in terms of accuracy, conciseness, and readability. For categorical data, the Chi-square test was applied to evaluate the frequency of clinically inaccurate information in the responses of LLM. Descriptive statistics were computed for each variable. Statistical significance was defined by a P-value threshold of 0.05.

Results

A total of 12 questions were evaluated by 9 specialists based on the criteria of accuracy, conciseness, and readability, using a five-point scale. This resulted in a total of 324 individual assessments. The performance of DeepSeek and ChatGPT was compared using the Mann-Whitney U test (Table 2). For accuracy (Figure 1A), ChatGPT 4o achieved a mean score of 4.43 (SEM = 0.06347), while DeepSeek V3 scored 4.56 (SEM = 0.06209). However, this difference was not statistically significant (P = 0.0909). For conciseness (Figure 1B), ChatGPT 4o presented a mean score of 4.55 (SEM = 0.05633), compared to 4.24 (SEM = 0.07069) for DeepSeek V3. This difference was statistically significant ( P = 0.0013). For readability (Figure 1C), ChatGPT 4o obtained a mean score of 4.38 (SEM = 0.07972), while DeepSeek V3 scored 4.26 (SEM = 0.07368), with no statistically significant difference (P = 0.1236). Examining the individual questions (Figure 2A, 2B), it becomes evident that ChatGPT 4o performed poorest in terms of accuracy on Question 7 (“How should I behave in the next few days after the Peptide Receptor Radionuclide Therapy (PRRT)?”), while DeepSeek V3 had the lowest performance in readability on Question 11 (“How much does Receptor Radionuclide Therapy (PRRT) cost?”). The Chi-square test was conducted to analyze the frequency of medically incorrect information in AI chatbot responses. Experts classified AI-generated answers as incorrect in 7.41% of cases (8/108) for ChatGPT 4o and 8.33% (9/108) for DeepSeek V3. There was no significant difference between the two LLMs (P = 0.8005). In response to Question 7, ChatGPT 4o incorrectly recommended avoiding contact with pets after PRRT - a precaution that is not supported by clinical guidelines and not considered medically necessary. This constitutes a hallucination, as the model introduced a fabricated detail unrelated to evidence-based practice. Such errors may lead to unnecessary patient concern or behavioral restrictions, highlighting the importance of content verification in patient-facing applications. In response to Question 6, ChatGPT 4o failed to mention the clinically relevant need to pause long-acting somatostatin analogues (SSA) at least four weeks prior to PRRT, which is a standard pre-treatment requirement in our department. Instead, it incorrectly stated that SSA should be paused for 4-6 weeks after therapy, whereas our protocol recommends only a one-week pause post-treatment. This represents an omission error with an additional factual inaccuracy, as it omits a critical preparatory step and misrepresents post-treatment management. Such inaccuracies may negatively impact patient understanding and clinical coordination if relied upon without expert oversight. It should be noted that this statement refers to the standard procedure at our institution; clinical protocols may differ across centers.

Table 2.

A Mann-Whitney U test was conducted to compare the performance in the areas of accuracy, conciseness, and readability

Accuracy Conciseness Readability



ChatGPT 4o DeepSeek V3 ChatGPT 4o DeepSeek V3 ChatGPT 4o DeepSeek V3
Mean 4.435 4.565 4.556 4.241 4.38 4.259
Median 5 5 5 4 5 4
SD 0.6596 0.6452 0.5854 0.7346 0.8284 0.7657
Mann Whitney U 5154 4501 5198
p-Value 0.0909 0.0013 0.1236

A statistically significant difference was found for Conciseness (P = 0.0013), while no significant difference was observed for the other two areas.

Figure 1.

Figure 1

Comparison of the performance of ChatGPT 4o and DeepSeek V3 in the categories of Accuracy (A), Conciseness (B), and Readability (C). The violin plots and boxplots illustrate the distribution of scores for each category, highlighting the differences between the two models in terms of their overall performance.

Figure 2.

Figure 2

Individual question scores for the categories of Accuracy (_a), Conciseness (_c), and Readability (_r) are presented for ChatGPT 4o (A) and DeepSeek V3 (B). The plots display the performance of both models across each category, illustrating the variation in scores for each question.

Overall, both chatbots presented similar performance, with a significant difference observed only in conciseness. Figure 3 shows a sample question. For this question, ChatGPT received an accuracy score of 4.3, a conciseness score of 4.8, and a readability score of 4.5. In comparison, DeepSeek achieved an accuracy score of 4.5, a conciseness score of 4.5, and a readability score of 4.5.

Figure 3.

Figure 3

An example question comparing the performance of ChatGPT 4o and DeepSeek V3, along with the corresponding scores for each model. The figure highlights the differences in performance between the two models for this specific question.

Discussion

The evaluation of 12 questions by 9 professionals resulted in 324 individual assessments based on accuracy, conciseness, and readability. While ChatGPT 4o and DeepSeek V3 showed no significant differences in accuracy and readability, ChatGPT 4o provided significantly more concise responses. Notably, the rate of medically incorrect information was low for both LLM (7.41% and 8.33% respectively), with no significant difference between them. Overall, the results indicate that ChatGPT 4o and DeepSeek V3 performed similarly, suggesting that both models could be suitable for patient education on PRRT.

The higher conciseness of ChatGPT 4o may help patients grasp medical information more quickly and easily. However, it is crucial to assess whether this conciseness compromises informational completeness. Omitting key details could lead to misunderstandings, particularly when addressing complex treatments like PRRT. Future research should focus on balancing conciseness and medical accuracy to ensure both clarity and comprehensiveness. The lack of significant differences in accuracy and readability suggests that both models provide comparable quality in terms of correctness and comprehensibility. This is particularly relevant for their potential use in patient communication, as it indicates that neither model has a clear advantage in conveying medically accurate and easily understandable information. Further studies should investigate whether these findings hold across different medical contexts and patient populations. The lower accuracy of ChatGPT 4o on Question 7 and the reduced readability of DeepSeek V3 on Question 11 suggest specific weaknesses in medical knowledge representation. These differences may stem from limitations in training data or response generation. Further analysis is needed to refine AI models for more reliable patient communication.

The application of chatbots for addressing medical inquiries has been explored in existing literature. A survey was conducted by Goodman et al. [12] to assess the accuracy of GPT-4 and GPT-3.5 responses to questions from different medical specialities. Their findings indicate that while ChatGPT can deliver precise and detailed medical information, it may also produce hallucinated responses that are partially or entirely incorrect. In the present study, no fully incorrect or hallucinated responses were observed for ChatGPT 4o or DeepSeek V3. The quality of responses from Bing and ChatGPT 3.5 to common patient inquiries about different types of cancer was investigated by Janopaul-Naylor et al. [13]. Their study found that ChatGPT 3.5 provided superior answers compared to Bing, particularly for questions regarding lung, breast, prostate and colorectal cancer.

Nevertheless, both models occasionally produced entirely incorrect or contradictory responses, a phenomenon not observed in our study. Similarly, Rahsepar et al. [14] assessed how Bing, ChatGPT 3.5, Bard and Google search engines performed in responding to patient queries about lung cancer screening and prevention. While ChatGPT 3.5 typically demonstrated higher precision, neither chatbot was able to provide completely correct responses. In another study, Bilgin et al. [15] examined the performance of ChatGPT 4 and Bard in answering patient questions about PSMA radioligand therapy. Their findings demonstrated the superiority of ChatGPT 4 in terms of medical accuracy, while Bard provided responses with better readability.

ChatGPT 4o and DeepSeek V3 differ in their development and underlying architecture. ChatGPT 4o, (OpenAI, USA) is based on the GPT-4 architecture and has been trained on a mix of publicly available and licensed data. In contrast, DeepSeek V3, (DeepSeek AI, China) utilizes a transformer-based architecture specifically optimized for technical and scientific applications. Unlike ChatGPT, DeepSeek does not have real-time internet access however, similar to ChatGPT 4o, it is based on a pre-trained model that has been trained on a large dataset from various sources. DeepSeek AI also offers a search engine-based version called DeepSeek Search, which can retrieve real-time internet sources. The transformer architecture [16], on which both models are based, is characterized by its self-attention mechanism, which helps determine the importance of different words in a sequence. Unlike other models, transformers process all input tokens at once, making training faster and more efficient. Positional encoding preserves word order since the model does not process words sequentially. Multi-head attention allows the model to capture different relationships between words. Additionally, layer normalization and residual connections help stabilize training and improve performance.

AI-assisted consultations have the potential to reduce the workload of physicians. However, the results indicate that medical accuracy cannot yet be fully guaranteed. A combined approach, integrating AI-generated responses with physician oversight, could be a viable solution to enhance patient education while ensuring reliability. Future advancements could enable chatbots to tailor responses to individual patient needs, considering factors such as medical history, health literacy, educational status and personal preferences. This would enhance patient engagement and comprehension. Exploring methods for adaptive AI in patient communication represents a promising area for further research.

A key limitation of this study is the subjective evaluation of the chatbot responses, which may introduce bias in the assessment of accuracy and readability. Furthermore an important aspect not addressed in this study is the evaluation of patient comprehension. While expert ratings provide valuable insights into clinical accuracy, readability, and relevance, they do not necessarily reflect how well patients understand or engage with the information provided. Incorporating structured comprehension testing with patients would offer a more robust evaluation of the real-world utility of LLM-generated responses. Future studies should therefore include patient-centered validation methods to assess not only factual correctness but also communication effectiveness and the potential for misunderstanding.

We acknowledge that the inclusion of non-native English-speaking raters may have influenced the assessments, particularly for more nuanced language judgments. While our raters had a high level of English proficiency and clinical familiarity, subtle linguistic factors might have been under- or overestimated. Future studies should consider including native English speakers to validate and refine readability scores, thereby enhancing the generalizability and linguistic sensitivity of the findings. Additionally, medical practices and guidelines vary between countries, which may affect the generalizability of the results to other healthcare systems. It is also important to note that LLMs are rapidly evolving, with frequent updates that can significantly alter their capabilities and output quality. The results of this study are therefore specific to the versions tested at the time of analysis and should not be generalized to newer iterations without further validation. Future studies are needed to reassess performance as LLMs continue to develop and improve, particularly in clinical and patient-facing applications.

In summary, both chatbots demonstrated strong performance in accuracy, conciseness, and readability. Despite the overall strong performance, the persistence of 7-8% medical inaccuracies remains concerning. Unfortunately, no directly comparable data on error rates in physician-delivered education could be identified in the literature. However, it is reasonable to assume that experienced physicians typically convey medical information with substantially lower error rates. This underscores the need for critical oversight and cautious integration of LLM-generated medical content into clinical practice, particularly in patient-facing applications.

As a conclusion, a significant difference was observed in conciseness, with ChatGPT performing better, while no differences were found in accuracy and readability. These findings suggest that while AI-driven patient education holds promise, careful evaluation and oversight remain essential.

Disclosure of conflict of interest

None.

References

  • 1.Di Franco M, Zanoni L, Fortunati E, Fanti S, Ambrosini V. Radionuclide theranostics in neuroendocrine neoplasms: an update. Curr Oncol Rep. 2024;26:538–50. doi: 10.1007/s11912-024-01526-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.An J, Ding W, Lin C. ChatGPT: tackle the growing carbon footprint of generative AI. Nature. 2023;615:586. doi: 10.1038/d41586-023-00843-2. [DOI] [PubMed] [Google Scholar]
  • 3.Alhur A. Redefining healthcare with artificial intelligence (AI): the contributions of ChatGPT, Gemini, and Co-pilot. Cureus. 2024;16:e57795. doi: 10.7759/cureus.57795. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Liu A, Feng B, Xue B, Wang B, Wu B, Lu C, et al. Deepseek-v3 technical report. arXiv [Preprint] 2024. Available from: https://arxiv.org/abs/2412.19437.
  • 5.Chakraborty C, Pal S, Bhattacharya M, Dash S, Lee SS. Overview of chatbots with special emphasis on artificial intelligence-enabled ChatGPT in medical science. Front Artif Intell. 2023;6:1237704. doi: 10.3389/frai.2023.1237704. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Ramya J, Alur S. Unleashing the potential of chatbots in business: a bibliometric analysis. Bus Inf Rev. 2023;40:123–36. [Google Scholar]
  • 7.Rahmaniar W. Chatgpt for software development: opportunities and challenges. IT Prof. 2024;26:80–6. [Google Scholar]
  • 8.Athaluri SA, Manthena SV, Kesapragada VSRKM, Yarlagadda V, Dave T, Duddumpudi RTS. Exploring the boundaries of reality: investigating the phenomenon of artificial intelligence hallucination in scientific writing through ChatGPT references. Cureus. 2023;15:e37432. doi: 10.7759/cureus.37432. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.González-Corbelle J, Bugarín-Diz A, Alonso-Moral J, Taboada J. Dealing with hallucination and omission in neural Natural Language Generation: a use case on meteorology. 15th International Conference on Natural Language Generation.2022. [Google Scholar]
  • 10.Lee P, Bubeck S, Petro J. Benefits, limits, and risks of GPT-4 as an AI chatbot for medicine. N Engl J Med. 2023;388:1233–9. doi: 10.1056/NEJMsr2214184. [DOI] [PubMed] [Google Scholar]
  • 11.Douglas BD, Ewell PJ, Brauer M. Data quality in online human-subjects research: comparisons between MTurk, prolific, cloudresearch, qualtrics, and SONA. PLoS One. 2023;18:e0279720. doi: 10.1371/journal.pone.0279720. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Goodman RS, Patrinely JR, Stone CA Jr, Zimmerman E, Donald RR, Chang SS, Berkowitz ST, Finn AP, Jahangir E, Scoville EA, Reese TS, Friedman DL, Bastarache JA, van der Heijden YF, Wright JJ, Ye F, Carter N, Alexander MR, Choe JH, Chastain CA, Zic JA, Horst SN, Turker I, Agarwal R, Osmundson E, Idrees K, Kiernan CM, Padmanabhan C, Bailey CE, Schlegel CE, Chambless LB, Gibson MK, Osterman TJ, Wheless LE, Johnson DB. Accuracy and reliability of chatbot responses to physician questions. JAMA Netw Open. 2023;6:e2336483. doi: 10.1001/jamanetworkopen.2023.36483. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Janopaul-Naylor JR, Koo A, Qian DC, McCall NS, Liu Y, Patel SA. Physician assessment of ChatGPT and bing answers to american cancer society’s questions to ask about your cancer. Am J Clin Oncol. 2024;47:17–21. doi: 10.1097/COC.0000000000001050. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Rahsepar AA, Tavakoli N, Kim GHJ, Hassani C, Abtin F, Bedayat A. How AI responds to common lung cancer questions: ChatGPT vs google bard. Radiology. 2023;307:e230922. doi: 10.1148/radiol.230922. [DOI] [PubMed] [Google Scholar]
  • 15.Belge Bilgin G, Bilgin C, Childs DS, Orme JJ, Burkett BJ, Packard AT, Johnson DR, Thorpe MP, Riaz IB, Halfdanarson TR, Johnson GB, Sartor O, Kendi AT. Performance of ChatGPT-4 and Bard chatbots in responding to common patient questions on prostate cancer (177)Lu-PSMA-617 therapy. Front Oncol. 2024;14:1386718. doi: 10.3389/fonc.2024.1386718. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I. Attention is all you need. Advances in Neural Information Processing Systems. 2017 [Google Scholar]

Articles from American Journal of Nuclear Medicine and Molecular Imaging are provided here courtesy of e-Century Publishing Corporation

RESOURCES