Abstract
Background:
Patient Education is a healthcare concept that involves educating the public with evidence-based medical information. This information surges their capabilities to promote a healthier life and better manage their conditions. LLM platforms have recently been introduced as powerful NLPs capable of producing human-sounding text and by extension patient education materials.
Objective:
This study aims to conduct a scoping review to systematically map the existing literature on the use of LLMs for generating patient education materials.
Methods:
The study followed JBI guidelines, searching five databases using set inclusion/exclusion criteria. A RAG-inspired framework was employed to extract the variables followed by a manual check to verify accuracy of extractions. In total, 21 variables were identified and grouped into five themes: Study Demographics, LLM Characteristics, Prompt-Related Variables, PEM Assessment, and Comparative Outcomes.
Results:
Results were reported from 69 studies. The United States contributed the largest number of studies. LLM models such as ChatGPT-4, ChatGPT-3.5, and Bard were the most investigated. Most studies evaluated the accuracy of LLM responses and the readability of LLM responses. Only 3 studies implemented external knowledge bases leveraging a RAG architecture. All studies except 3 conducted prompting in English. ChatGPT-4 was found to provide the most accurate responses in comparison with other models.
Conclusion:
This review examined studies comparing large language models for generating patient education materials. ChatGPT-3.5 and ChatGPT-4 were the most evaluated. Accuracy and readability of responses were the main metrics of evaluation, while few studies used assessment frameworks, retrieval-augmented methods, or explored non-English cases.
Keywords: Patient Education Materials; Natural Language Processors; Artificial Intelligence; Generative AI; Large Language Models, Transformer; ChatGPT; Copilot; Bard; Gemini; DeepSeek; Claude; Retrieval Augmented Generation; Prompts
1. BACKGROUND
Patient education (PE) involves providing medically accurate, evidence-based information to patients or the general public. By promoting informed decisions and positive behavior changes, PE enables better health management and improved quality of life (1). PE also contributes to economic development by improving productivity, reducing healthcare expenses, increasing life expectancy, and extending working life expectancy globally (2).
Historically, PE has transitioned from oral teachings in ancient times to academic texts during the Islamic Golden Age, public health initiatives in Europe’s Enlightenment, and formalized hospital and government efforts in the 20th century. Digitization, the internet, and global health initiatives significantly expanded access to health information, especially during the COVID-19 pandemic (3-6).
Recently, advanced Natural Language Processing (NLP) and tools and Large Language Model (LLM) platforms, such as ChatGPT, Bard/Gemini, and Copilot, have emerged as practical applications (7-9). These models, trained on vast textual datasets, recognize patterns and relationships between concepts using transformer-based architectures like GPT, BART, and BERT (10-13). They handle complex reasoning tasks, adapt to context, and address nuanced challenges, making them suitable for healthcare, finance, data analysis, scientific research, content generation, and personalized education (14-16).
In healthcare, NLP models offer potential in generating personalized, accessible Patient Education Materials (PEMs) tailored to individual health literacy, though more research is needed (17). Compared to traditional static resources that require manual updating and lack personalization, NLP models can provide current information, simplify medical terminology, and boost patient engagement.
Readability is crucial for PEMs since complex medical terminology can reduce patient comprehension. Common readability tests include Flesch Reading Ease, Flesch-Kincaid Grade Level, and SMOG readability tests (18-20). Additionally, frameworks such as PEMAT (measuring Understandability and Actionability) and DISCERN (assessing Quality, Reliability, Clarity, and Comprehensiveness) specifically evaluate PEMs (21, 22).
While this scoping review aligns with a similar topic explored by the authors in (23), we identified several limitations in their approach. First, while they reviewed 201 studies in their work, they only extracted 7 variables found in their supplemental documents, which may limit the depth of analysis and the ability to draw comprehensive insights. Additionally, their exclusion criteria did not exclude studies that involved a single LLM, weakening the comparative evaluations that are crucial for understanding the relative performance of various LLMs. Furthermore, their literature search was restricted to the PubMed database, potentially overlooking relevant studies indexed in other libraries, thus narrowing the scope of their findings. Finally, their thematic analyses did not include variables related to the utilization of external knowledge sources, the involvement of actual patients, the most frequently used PEM assessment frameworks, readability tools, and the methods of evaluating the accuracy of the LLM-generated PEMs.
2. OBJECTIVE
The aim of this work is to produce a comprehensive mapping of the existing literature comparing LLMs in generating PEMs. Specifically, this review aims to a) Expand the number and depth of extracted variables to facilitate more thorough analyses, b) Search more databases with potentially relevant studies, c) Highlight the comparative performance of LLMs in generating PEMs.
3. MATERIAL AND METHODS
Authors of this article followed the JBI narrative system when conducting this scoping review.
Supplementary materials
All supplementary figures, tables, and the complete list of reviewed studies (Supplementary Documents 1-6) are publicly available in our GitHub repository at: https://github.com/armbased/The-Use-of-Large-Language-Models-in-Generating-Patient-Education-Materials-a-Scoping-Review.
Search strategy
An initial search strategy was developed by an information specialist for five databases: Ovid MEDLINE(R) ALL, Embase, Scopus, APA PsycInfo, and CINAHL. The search terms included combinations of large language models, patient education, and patients. The search terms included ‘large language model*’, ‘LLMs’, ‘BERT’, ‘transformer model*’, ‘ChatGPT’, ‘GPT-4’, ‘LLaMA’, ‘Google Bard’, ‘Google Gemini’, ‘Anthropic Claude’, and ‘Microsoft Copilot’, combined with terms related to patient education such as ‘educat*’, ‘empower*’, ‘aware*’, and ‘patient*’. Searches were limited to studies published from 2020 to September 2024, and restricted to the English language. The databases searched included Ovid MEDLINE(R) ALL, Embase, Scopus, APA PsycInfo, and CINAHL. Supplementary Document 1, Tables 1, 2, and 3 present the detailed search strategies for these databases.
Inclusion and exclusion criteria
Articles were eligible for inclusion if they evaluated the use of LLMs for patient education. Eligible studies included are those that compared two or more LLMs. The process of formulating the search terms followed the PICO system (24), and each database was searched using relevant indexing terms. Included studies were limited to journal articles published in English from 2020 to September 29, 2024.
Articles were excluded if they did not provide a comparison between two or more LLMs, were not journal articles, did not focus on patient education, or were not in English. Two researchers, Alhasan AlSammarraie and Abdelrahman AlSaify, used the Rayyan AI tool (25) to make inclusion and exclusion decisions independently. Cohen Kappa was calculated to measure agreement between researchers. Disagreements were resolved through discussions.
Data extraction
For the purpose of this work, the authors have compiled a list of 21 variables for extraction through a thematic inductive approach. The variables were grouped into five primary themes: Study Demographics, LLM Characteristics, LLM Prompt-Related Variables, Generated PEM Assessment, and Comparative Outcomes. Table 1 lists all the extracted variables, themes, and a data point example for each variable.
Table 1. Extracted variables with their corresponding themes and examples.
| Variable | Theme | Example |
|---|---|---|
| Primary Investigator’s Name | Study Demographics | Al Hassan |
| Country of Affiliation. | Study Demographics | Qatar |
| Year of Publication | Study Demographics | 2023 |
| Target Population or Disease | Study Demographics | Diabetes |
| LLM Models | LLM Characteristics | ChatGPT-4, Bard |
| LLM Count | LLM Characteristics | 3 |
| Custom LLM Configuration | LLM Characteristics | RAG, Fine-Tuning |
| Usage of Translation Service | LLM Characteristics | Yes, No |
| Language of the Prompts | LLM Prompt-Related Variables | English, Mandarin |
| Prompt Count | LLM Prompt-Related Variables | 10 |
| Sources of the Prompts | LLM Prompt-Related Variables | Clinical Guidelines |
| PEM Evaluation Metrics | Generated PEM Assessment | Accuracy, Readability |
| Metrics Count | Generated PEM Assessment | 4, 5 |
| Assessment Framework(s) | Generated PEM Assessment | PEMAT, NA |
| Accuracy Assessment Method(s) | Generated PEM Assessment | Expert opinion |
| Readability Assessment Method(s) | Generated PEM Assessment | Flesch-Kincaid |
| Readability Assessment Tools Count | Generated PEM Assessment | 3 |
| Patient Interaction with PEM | Generated PEM Assessment | Yes, No |
| Participating Patients Count | Generated PEM Assessment | 5 |
| Most Accurate LLM | Comparative Outcomes | ChatGPT-4 |
| Most Readable LLM | Comparative Outcomes | Bard |
To extract the variables from the studies authors of this work decided to leverage the capabilities of LLMs to facilitate a more robust and standardized extraction. The data extraction process consisted of three primary stages: a) Data Vectorization, b) Vector Search and Querying ChatGPT-4o, and c) Data Cleaning and Analysis. Each stage is detailed below.
Vectorization of Data: To manage resources and minimize hallucinations (e.g., inaccuracies or fabricated information generated by LLMs), we implemented a targeted search algorithm to select relevant textual segments for each data variable from each study. Initially,textual data from each included study were segmented into 500-character chunks. These segments were converted into embeddings using OpenAI’s ‘text-embedding-ada-002’ model (referred to as ada v2), chosen for its effectiveness in handling large volumes of text (26) and based on OpenAI’s recommendation (27). The chunk size of 500 characters was optimized after balancing accuracy and prompt length, assuming the retrieval of the five most similar chunks per variable using cosine similarity. Detailed embedding generation steps are provided in Supplementary Document 1, Figure 1.
Figure 1. PRISMA flowchart for the included studies.
Querying ChatGPT-4o: Following vectorization, semantic similarity identification was used to select relevant embeddings for each variable. These embeddings were then provided to ChatGPT-4o API for detailed analysis and data extraction. The LLM was specifically instructed to structure responses consistently according to predefined variables. This querying process was systematically repeated across all variables and studies.
Accuracy Verification and Data Cleaning: The accuracy and consistency of extracted data were thoroughly validated in multiple stages. Initially, LLM-generated outputs were cross-checked against study abstracts, indicating minimal discrepancies. Subsequent in-depth verification revealed a 97% accuracy rate when compared against study abstracts. Additionally, any inconsistencies identified - such as formatting issues or discrepancies in responses - triggered manual reviews of full-text articles for correction. Finally, all data underwent full review and refinement to rectify errors, remove unnecessary sentences, and correct formatting errors. Supplementary Document 2, Algorithm 1 abstracts the whole data extraction process.
4. RESULTS
PRISMA Flowchart and Cohen Cappa
The identification of studies through the search strategies discussed previously is shown in Figure 2, which results in 69 included studies for review [1]-[69] with a Cohen Kappa score of 0.7 which indicates a substantial agreement.
Figure 2. The LLMs used in the studies. Others*: Baichuan2, ChatGLM, ChatGPT-4Turbo, Claude-2.1, Claude-3, Claude-3 Vision, CLOVAX, DermGPT, DoctorGPT, ERNIEBot, GoogleAssistant, LeChat, Llama-2-70B, MedAlpaca, MetaAI, Mistral Large, Mixtral-8x7B, ORCA-mini, Qwen.

To ensure consistency and comparability across studies, we implemented a standardization process for two key variables.
Extracting relevant information proved challenging, as each study presented similar concepts using distinct terminology and phrasing. There are two main variables where standardization was required: Sources of the LLM Questions/Prompts and LLM Response Accuracy Measurement Method. To define those standard terminologies, we followed an inductive thematic analysis approach which involved examining the raw extracted variables from 20 studies to identify the primary themes. Using the primary themes, we deduced abstract categories for those two variables. When we applied those categories to the remaining studies, we found that they all fit within the defined themes and categorizations. Whereas there are 9 types of identified prompt sources listed in Supplementary Document 2, Table 1, there are 3 LLM Response Accuracy Measurement Methods detailed in Supplementary Document 2, Table 2.
Table 2. PEM evaluated metrics by the included studies. The * indicates the technical metrics.
| Metric | Frequency | References | Definition |
|---|---|---|---|
| Citation Support* | 1 | [41]. | Whether the LLM includes and properly references credible sources. |
| Patient Satisfaction | 1 | [40]. | The extent to which a patient would find the answer helpful and comforting. |
| Similarity* | 1 | [64]. | How closely the responses of one LLM aligns with another. |
| Bias* | 2 | [41], [59]. | The presence of unfair or prejudiced assumptions in the text. |
| Hallucinations* | 2 | [19], [41]. | Fabricating information or presenting false details as fact. |
| Reasoning* | 2 | [22], [59]. | The clarity and logical soundness of the argument or explanation. |
| Response Length* | 2 | [7], [43]. | The conciseness or verbosity of the answer. |
| Responsiveness* | 2 | [7], [43]. | The time to complete the LLM response. |
| Reproducibility* | 5 | [1], [18], [19], [23], [56]. | Consistency of the answer when asked multiple times. |
| Safety | 5 | [22], [37], [41], [59], [69]. | Avoidance of harmful, unethical, or disallowed content. |
| Clarity | 6 | [5], [9], [19], [39], [41], [60]. | How easily the text can be understood. |
| Actionability | 8 | [2], [6], [13], [21], [30], [38], [54], [55]. | Whether the response provides usable advice or next steps. |
| Tone | 11 | [5], [9], [10], [20], [30], [39]–[41], [47], [51], [60]. | The emotional or stylistic manner of the answer. |
| Appropriateness | 13 | [9], [19], [22], [23], [27], [33], [39], [41], [45], [46], [50], [60], [69]. | Suitability of the response for the context and audience. |
| Understandability | 13 | [2], [6], [9], [13], [21], [30], [38], [51], [53]–[55], [59], [60]. | How straightforward and comprehensible the language is. |
| Reliability* | 15 | [10], [15]–[17], [20], [26]–[28], [31], [43], [47], [49], [57], [61], [64]. | Trustworthiness and factual correctness of the content. |
| Quality* | 19 | [5], [11], [15]–[17], [25], [26], [28], [37], [46], [49], [50], [53], [54], [57], [61], [65], [66], [69]. | Overall caliber and usefulness of the response. |
| Comprehensiveness | 24 | [1], [3], [4], [7], [10], [11], [18], [19], [22], [32]–[34], [36], [40], [44], [46]–[48], [50], [51], [56], [60], [67], [68]. | The degree to which the answer covers all relevant points. |
| Readability* | 51 | [1]–[3], [6]–[11], [13]–[17], [20], [21], [23]–[26], [28], [30], [31], [34], [35], [38], [41], [42], [44]–[57], [59]–[67]. | The ease with which the text can be read and parsed. |
| Accuracy* | 54 | [1]–[5], [7], [10]–[13], [16], [18]–[23], [25]–[34], [36], [38]–[44], [46]–[51], [53]–[56], [58]–[60], [62], [63], [66]–[69]. | Correctness and precision of the information provided. |
Demographics
We distributed the countries of affiliation into 7 categories. Each category corresponds to a defined frequency as found in Supplementary Document 5, Figure 1. To illustrate, all countries (found in the legend) under category 1, have been the country of first-author-affiliation for 1 time only and so forth. USA affiliated researchers account for just under 50% (n=32) of all included studies, with the runner up, Turkey, contributing 12% (n=8).
Extracting the year of publication showed that of the included studies, 40% (n=27) were published in 2023, and 60% (n=42) in 2024.
The targeted population (disease category) data show that Ophthalmology and Oncology are the two most discussed topics contributing 35% (n=25) of the study pool. Plastic surgery, Orthopedics, and Neurology each contributed 7% (n=5). Supplementary Document 5, Figure 2 summarize the target population data.
LLM Characteristics
Figure 2 showcases the distribution of the LLMs used within the included studies. OpenAI’s ChatGPT-3.5 and -4 occupied 1st (n=56) and 3rd (n=36) place in frequency of use. They collectively have been discussed over 80 times. Google’s Chatbot, Bard, comes in 2nd place, with 40 references including it in their work.
The distribution of the number of LLMs tested per study is presented in Supplementary Document 6, Table 1. The weighted average number of LLMs tested per study is approximately 3 LLM. The two references with the highest LLM count are [12], [58] with 10 and 8 LLMs included respectively.
Out of all the included studies, only three references [31], [41], [59] used custom LLM configurations. They adopted the Retrieval Augmented Generation (RAG) approach to further enhance the accuracy of the LLMs.
None of the included studies employed a translation service to cater to a non-English-speaking audience.
LLM Prompt-Related Variables
The overwhelmingly dominant language used in prompting the LLMs in each study was English with 66 studies. However, 3 studies [40], [54], [67] considered non-English queries. Whereas the work in [40] examined LLMs in the context of Korean, authors of [54], [67] investigated LLMs in the context of Mandarin Chinese.
As for the number of prompts/questions used in the studies, Supplementary Document 5, Figure 3 summarizes the ranges of question counts against the frequency. The maximum number of prompts examined in a single study is 150 by the authors in [58].
Figure 3. Distribution of patient education assessment frameworks.
There are nine sources of questions used in the reviewed studies (Supplementary Document 5, Figure 4). professional organizations, and physicians’ expertise were the two primary sources of questions/prompts used to examine the LLMs. These sources were used 32 times to generate LLM prompts.
Generated PEM Assessment
An array of evaluation metrics for LLM responses is discussed across the included studies. Table 2 provides a summary of all the metrics used to assess the LLMs. Accuracy and readability were evaluated in over 78% (n=54) and 73% (n=51) of the studies, while approximately 35% (n=24) measured the comprehensiveness of the responses generated.
Supplementary Document 6, Table 2 presents the number of evaluation metrics used to assess PEMs in each included study. The highest number of metrics, nine, was reported in [41]. The most common counts of metrics per study were 4 and 3, which combined account for approximately 55% (n=37).
PEMAT and DISCERN were the most frequently mentioned frameworks for LLM-PEM assessment as highlighted by Figure 3, with a combined total of 24 mentions across the studies, with some studies referencing both frameworks such as [2], [13]. Notably, 48 studies did not utilize any standardized assessment frameworks which indicate a low overall reliance on such methods.
Accuracy was assessed 54 times across the included studies. Among these, only 3 distinct categories were defined for how accuracy was measured. Supplementary Document 4, Table 3 provides a summary of the distribution of these methods. The most prevalent approach was relying on an expert opinion (physician), accounting for nearly 80% (n=44) of all accuracy assessments. The second most common method involved comparing LLM responses to clinical guidelines, which was used 7 times.
Readability was the second most frequently assessed metric, with 51 studies measuring it and only 18 studies omitting it. Supplementary Document 5, Figure 5 illustrates the distribution of readability tests used across the included studies. The most widely deployed readability tests are FKL, FRE, and SMOG. FKL was mentioned 36 times, whereas FRE and SMOG were employed 24, and 15 times respectively.
The analysis of readability test usage per study is presented in Supplementary Document 4, Table 4. A total of six studies employed the maximum count of seven readability tests, highlighting a strong emphasis on readability in PEM assessment. However, the majority of studies measuring readability relied on a single assessment method, with such studies - accounting for 50% (n=25) of all readability-assessing studies.
Out of all the included references, only three studies involved patients in their work [21], [40], [55]. In these studies, the authors incorporated patients’ opinions to assist in assessing the PEM generated by the LLMs, relying on feedback from 14, 5, and 46 patients, respectively.
Comparative Outcomes
Supplementary Document 5, Figure 6 illustrate the most accurate LLMs identified by the included studies that assessed accuracy. ChatGPT-4 emerged as the most dominant, being reported as the most accurate in 21 instances. ChatGPT-3.5 and Bard followed, with 19 and 9 studies, respectively.
Supplementary Document 5, Figure 7 highlights the frequency with which each LLM was identified as generating the most readable responses. Similar to the accuracy results, ChatGPT-4 was found to be the most readable; however, this time, it shares the top spot with Bard, with both being identified as the most readable in 14 studies. ChatGPT-3.5 closely followed, being recognized for generating the most readable responses in 12 studies
5. DISCUSSION
Our findings highlight that while there is a growing interest in PEM generation using LLMs, there are still areas with untapped potential for exploration.
Out of all the studies we investigated, only three explicitly mentioned the use of specialized external knowledge bases leveraging RAGs. Implementing RAGs is among the simplest approaches to transforming general-purpose LLMs into specialized domain agents. Additionally, RAGs can help reduce the frequency of hallucinations typically associated with general-purpose LLMs (28) . Although the inherent generality of LLMs allows them to cater broadly to various fields, this generality can be a limitation, particularly in domains requiring detailed, precise information. Specifically, in the context of PEMs - a highly sensitive and accuracy-dependent field - this limitation is particularly pronounced. The authors of this study argue that integrating RAG-based frameworks could enhance the accuracy and reliability of PEMs generated by LLMs. Nevertheless, the practical effectiveness of RAGs remains closely tied to the accuracy, quality, and cleanliness of the external knowledge bases used. Consequently, rigorous data validation, careful selection, and continuous updates of these knowledge bases are essential for fully realizing the potential benefits of RAGs in medical contexts.
Another promising avenue is exploring open-source LLMs, such as Llama, DeepSeek, Qwen, and similar models, which have been increasingly popular due to their decentralized and accessible design. These open-source models can be operated locally, offering an alternative to cloud-based applications. This feature is especially important for PEMs, given the sensitive nature of patient data. The assumption is that when interacting with chatbots, patients might share personally identifying information to receive guidance or suggestions. By enabling local deployments, risks of data leaks can be minimized because no information is transferred outside the host computer that operates the open-source LLMs, enhancing data confidentiality and patient privacy. The authors believe that these models could effectively deliver essential patient education while providing stronger assurances of confidentiality compared to cloud-based, proprietary platforms such as ChatGPT, Gemini, and Copilot.
Another direction to consider is multilingual PEM generation. Currently, fewer than 20% of the global population speaks English (29), indicating that a significant portion of individuals globally may have limited access to patient education materials presented solely in English. Expanding PEM generation to multiple languages would support broader dissemination of healthcare information and potentially enhance patient comprehension across diverse linguistic groups. Examining PEMs across different languages could also provide insights into cultural and linguistic factors influencing the acceptability and effectiveness of such educational materials. This highlights the importance of including multiple languages in PEM development and evaluation processes. Therefore, future research should consider investigating the impact of multilingual PEMs, assessing user comprehension, and comparing user engagement across languages to support their applicability to diverse populations worldwide.
Finally, the assessment methodology of PEMs appears scattered across existing studies. Currently, there is no standardized approach or consistent set of criteria for evaluating the quality and effectiveness of PEMs, which makes comparing findings across different studies challenging. This variability in evaluation approaches limits the ability to reliably assess the overall quality, accuracy, and effectiveness of PEMs. Additionally, given the increasing integration of language models in PEM generation, new evaluation dimensions specific to LLM-generated content are required. Factors that are particularly relevant include the potential for bias, response time, reproducibility of outputs, and the occurrence of hallucinations. Incorporating these variables into a unified assessment framework would help researchers systematically evaluate the quality and reliability of PEMs. Therefore, future research efforts should aim to develop and validate a structured assessment methodology that includes these critical variables. Establishing such an approach could facilitate meaningful comparisons across studies, enhancing the transparency of PEM evaluations and fostering improvements in the accuracy and overall trustworthiness of PEMs. Additionally, further research could examine the interrelation between these metrics, explore potential trade-offs, and develop guidelines for effectively minimizing risks associated with LLM-generated PEMs.
6. CONCLUSION
This scoping review has mapped the current landscape of research evaluating the use of LLMs in generating PEMs. The adopted search strategy that queried 5 databases yielded 69 studies for review; thematic analyses were performed by extracting 21 variables from all included studies. ChatGPT-3.5 (n=56), ChatGPT-4 (n=36), and Bard (n=40) were the most evaluated platforms. ChatGPT-4 was found the most accurate (n=21) while it also shared the top spot for most readable LLM with Bard (n=14). Ophthalmology and Oncology were the two most commonly evaluated topics, with accuracy and readability being the most frequently assessed metrics. However, the widespread variability and fragmentation in assessment methodologies, including the limited application of standardized frameworks such as PEMAT and DISCERN, indicate a clear need for the development and adoption of a comprehensive and standardized evaluation approach. Such an approach should integrate traditional evaluation criteria (e.g., accuracy, readability, understandability, and actionability) alongside LLM-specific metrics, such as bias, response time, reproducibility, and the presence of hallucinations. Additionally, the review highlighted sizeable gaps, particularly regarding multilingual PEM generation and the integration of LLM techniques like RAG. Expanding research to include multilingual PEM generation, particularly in widely spoken languages is crucial to enhance global applicability and address disparities in patient understanding. Exploring open-source LLMs, which offer greater flexibility, data privacy, and customization potential, is another critical direction which may assist in mitigating the privacy concerns associated with cloud-based LLM platforms.
In conclusion, future research should focus on establishing a unified PEM assessment framework incorporating LLM-specific metrics, promoting multilingual PEM generation, and leveraging open-source and retrieval-augmented technologies to improve both the reliability and effectiveness of PEMs. This integrated approach hopes to enhance patient education outcomes globally, ensuring accurate, understandable, culturally relevant, safe, and reliable PEMs across diverse populations.
Acknowledgment:
We acknowledge the use of ChatGPT-4o and ChatGPT-4o-mini in the development of this paper. These platforms assisted in structuring our content, generating ideas, extracting the variables, and refining the draft. Their suggestions improved the organization and clarity of our work. However, all AI-generated outputs were carefully reviewed, edited, and integrated by the authors to ensure accuracy and maintain academic integrity. The final manuscript reflects our original insights and analysis, with AI serving solely as a supportive tool.
Authors’ contributions:
The both authors weres involved in all steps of preparation this article including final proofreading.
Conflict of interest:
None to declare.
Financial support and sponsorship:
None.
REFERENCES
- 2.World Health Organization. Health Literacy: The Solid Facts. Accessed: 2024-12-11. 2013. Available from: https://apps.who.int/iris/handle/10665/326432 . [Google Scholar]
- 1.Patient education. American Academy of Family Physicians. Vol. 62. American Family Physician; 2000 Oct 1. pp. 1712–1714. [PubMed] [Google Scholar]
- 3.Gutas D. Greek Thought, Arabic Culture: The Graeco-Arabic Translation Movement in Baghdad and Early ‘Abbasaid Society (2nd-4th/5th-10th c.) 1st ed. Routledge; 2012 Oct 12. Available from: https://www.taylorfrancis.com/books/9780203017432. [Accessed on: 2024 Nov 5] [DOI] [Google Scholar]
- 4.Nutton V. Ancient Medicine. 3rd ed. London: Routledge; 2023 Sep 5. Available from: https://www.taylorfrancis.com/books/9781003296102. [Accessed on: 2024 Nov 5] [DOI] [Google Scholar]
- 5.Paakkari L, Okan O. COVID-19: health literacy is an underestimated problem. The Lancet Public Health. 2020 May;5:e249–e250. doi: 10.1016/S2468-2667(20)30086-4.. Available from: https://linkinghub.elsevier.com/retrieve/pii/S2468266720300864. [Accessed on: 2024 Nov 11] [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Zulman DM, Verghese A. Virtual Care, Telemedicine Visits, and Real Connection in the Era of COVID-19: Unforeseen Opportunity in the Face of Adversity. JAMA. 2021 Feb 2;325:437. doi: 10.1001/jama.2020.27304. Available from: https://jamanetwork.com/journals/jama/fullarticle/2775696. [Accessed on: 2024 Dec 11] [DOI] [PubMed] [Google Scholar]
- 7.OpenAI. ChatGPT. https://openai.com/chatgpt/overview/. Accessed: 2024-12-11. [Google Scholar]
- 8.Google. Gemini AI. https://gemini.google.com/. Accessed: 2024-12-11. [Google Scholar]
- 9.Microsoft. Microsoft 365 Copilot. https://copilot.microsoft.com/. Accessed: 2024-12-11. [Google Scholar]
- 10.Devlin J, Chang MW, Lee K, Toutanova K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv preprint arXiv:1810.04805. 2018 Available from: https://arxiv.org/abs/1810.04805 . [Google Scholar]
- 11.OpenAI. GPT-4 Technical Report. 2023 Available from: https://cdn.openai.com/papers/gpt-4.pdf . [Google Scholar]
- 12.HuggingFace. BART: Bidirectional and Auto-Regressive Transformers Documentation. https://huggingface.co/docs/transformers/model_doc/bart. Accessed: 2024-12-11. [Google Scholar]
- 13.Radford A, Narasimhan K, Salimans T, Sutskever I. Improving Language Understanding by Generative Pre-Training. 2018 Available from: https://www.cs.ubc.ca/~amuham01/LING530/papers/radford2018improving.pdf . [Google Scholar]
- 14.Gobara S, Kamigaito H, Watanabe T. Do LLMs Implicitly Determine the Suitable Text Difficulty for Users? arXiv preprint arXiv:2402.14453. 2024 Available from: https://arxiv.org/abs/2402.14453 . [Google Scholar]
- 15.Wong MF, Guo S, Hang CN, Ho SW, Tan CW. Natural Language Generation and Understanding of Big Code for AI-Assisted Programming: A Review. Entropy. 2023 Jun 1;25:888. doi: 10.3390/e25060888. Available from: https://www.mdpi.com/1099-4300/25/6/888. [Accessed on: 2024 Dec 11] [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Chen ZZ, Ma J, Zhang X, Hao N, Yan A, Nourbakhsh A, Yang X, McAuley J, Petzold L, Wang WY. A Survey on Large Language Models for Critical Societal Domains: Finance, Healthcare, and Law. arXiv preprint arXiv 2405.01769. 2024 Available from: https://arxiv.org/abs/2405. 01769. [Google Scholar]
- 17.Clinic C. AI in Healthcare: Revolutionizing Diagnosis, Treatment, and Operations. https://health.clevelandclinic.org/ai-in-healthcare. Accessed: 2024-12-11. [Google Scholar]
- 18.Flesch R. A new readability yardstick. Journal of Applied Psychology. 1948;32:221–233. doi: 10.1037/h0057532. Available from: https://doi.apa.org/doi/10.1037/h0057532. [Accessed on: 2024 Dec 8] [DOI] [PubMed] [Google Scholar]
- 19.Kincaid JP, Fishburne RP, Rogers RL, Chissom BS. Derivation of new readability formulas (Automated Readability Index, Fog Count and Flesch Reading Ease Formula) for Navy enlisted personnel. Techincal Report Research Branch Report 8-75. Chief of Naval Technical Training. 1975 [Google Scholar]
- 20.McLaughlin GH. SMOG Grading - A New Readability Formula. Journal of Reading. 1969;12:639–646. [Google Scholar]
- 21.Charnock D, Shepperd S, Needham G, Gann R. DISCERN: an instrument for judging the quality of written consumer health information on treatment choices. Journal of Epidemiology Community Health. 1999 Feb 1;53:105–111. doi: 10.1136/jech.53.2.105. Available from: https://jech.bmj.com/lookup/doi/10.1136/jech.53.2.105. [Accessed on: 2024 Dec 12] [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Healthcare Research A for and (AHRQ) Q. The Patient Education Materials Assessment Tool (PEMAT) and User’s Guide. Accessed: 2024-12-11. 2013. [Google Scholar]
- 23.Aydin S, Karabacak M, Vlachos V, Margetis K. Large language models in patient education: a scoping review of applications in medicine. Frontiers in Medicine. 2024 Oct 29;11:1477898. doi: 10.3389/fmed.2024.1477898. Available from: https://www.frontiersin.org/articles/10.3389/fmed.2024.1477898/full. [Accessed on: 2025 Jan 1] [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Cochrane Cochrane Library: PICO Search. Accessed: 2025-01-14. Cochrane Library, 2025. Available from: https://www.cochranelibrary.com/about/pico-search . [Google Scholar]
- 25.Rayyan: Systematic Reviews Web App. Accessed: 2025-01-14. Rayyan Systems Inc., 2025. Available from: https://www.rayyan.ai/ [Google Scholar]
- 26.OpenAI. Embeddings. https://platform.openai.com/docs/guides/embeddings . Accessed: 2024-12-11. [Google Scholar]
- 27.OpenAI. New and improved embedding model. https://openai.com/index/new-and-improved-embedding-model/. Accessed: 2024-12-11. [Google Scholar]
- 28.Eberhard DM, Simons GF, Fennig CD. Ethnologue: Languages of the World. 26th. SIL International, 2023. Available from: https://www.ethnologue.com . [Google Scholar]


