Skip to main content
BMJ Health & Care Informatics logoLink to BMJ Health & Care Informatics
. 2025 Jul 25;32(1):e101570. doi: 10.1136/bmjhci-2025-101570

Development and evaluation of an agentic LLM based RAG framework for evidence-based patient education

AlHasan AlSammarraie 1,, Ali Al-Saifi 2, Hassan Kamhia 3, Mohamed Aboagla 4, Mowafa Househ 1
PMCID: PMC12306375  PMID: 40713064

Abstract

Objectives

To develop and evaluate an agentic retrieval augmented generation (ARAG) framework using open-source large language models (LLMs) for generating evidence-based Arabic patient education materials (PEMs) and assess the LLMs capabilities as validation agents tasked with blocking harmful content.

Methods

We selected 12 LLMs and applied four experimental setups (base, base+prompt engineering, ARAG, and ARAG+prompt engineering). PEM generation quality was assessed via two-stage evaluation (automated LLM, then expert review) using 5 metrics (accuracy, readability, comprehensiveness, appropriateness and safety) against ground truth. Validation agent (VA) performance was evaluated separately using a harmful/safe PEM dataset, measuring blocking accuracy.

Results

ARAG-enabled setups yielded the best generation performance for 10/12 LLMs. Arabic-focused models occupied the top 9 ranks. Expert evaluation ranking mirrored the automated ranking. AceGPT-v2-32B with ARAG and prompt engineering (setup 4) was confirmed highest-performing. VA accuracy correlated strongly with model size; only models ≥27B parameters achieved >0.80 accuracy. Fanar-7B performed well in generation but poorly as a VA.

Discussion

Arabic-centred models demonstrated advantages for the Arabic PEM generation task. ARAG enhanced generation quality, although context limits impacted large-context models. The validation task highlighted model size as critical for reliable performance.

Conclusion

ARAG noticeably improves Arabic PEM generation, particularly with Arabic-centred models like AceGPT-v2-32B. Larger models appear necessary for reliable harmful content validation. Automated evaluation showed potential for ranking systems, aligning with expert judgement for top performers.

Keywords: Artificial intelligence, Large Language Models, Public Health, Public health informatics, Information Literacy


WHAT IS ALREADY KNOWN ON THIS TOPIC

  • Patient education is the cornerstone of modern healthcare. State-of-the-art large language models (LLMs) have strong conversational abilities; they can answer queries in various fields and have shown strong performance in answering patient questions in English. However, no previous study has evaluated the performance of open-source LLMs in answering patient questions in Arabic.

WHAT THIS STUDY ADDS

  • This study establishes a methodology to design a private agentic retrieval-augmented generation framework for evidence-based Arabic patient education materials. It then benchmarks this framework against various open-source LLMs on generating Arabic patient education materials. It also measures the capabilities of the LLMs on validating the tone and safety of Arabic medical content.

HOW THIS STUDY MIGHT AFFECT RESEARCH, PRACTICE OR POLICY

  • The framework discussed in this study provides a blueprint for designing artificial intelligence systems that can serve Arabic speakers with trustworthy medical information. Extrapolation of this framework to other languages can deliver high-quality patient education to populations from any linguistic background while maintaining its fundamentally private and secure design.

Introduction

Patient education (PE) is a planned process intended to positively influence patients’ knowledge, attitudes, skills and health behaviours, empowering them to manage their health and adopt healthier lifestyles.1 PE contributes to improved health literacy and treatment adherence.2 3 Better health literacy correlates with improved health outcomes, while greater adherence is linked to fewer care episodes, reduced health expenditures and prevention of serious complications.4 However, traditional PE methods like PE material (PEM) leaflets face limitations, including time demands on clinicians, high costs for developing tailored materials, challenges ensuring readability for varying literacy levels, lack of personalisation and difficulties keeping content updated.5,7

Large language models (LLMs), artificial intelligence (AI) systems capable of understanding and generating human-like text,8 9 present a potential way to address limitations in traditional PEM methods, such as ensuring readability and meeting diverse patient needs. LLMs could potentially address these issues by enabling text tailoring to specific reading levels and facilitating translation.10 This capability is relevant for expanding PEM resources in languages like Arabic, where materials may be limited. Although preliminary studies suggest LLMs can generate empathetic and accurate responses to patient questions,11,13 their application as a reliable tool for PEM generation requires validation. This validation must encompass accuracy, safety, bias mitigation, ethical use and cultural appropriateness, particularly for contexts like generating PEMs.9 11 14

The LLM landscape spans closed-source models, like OpenAI’s ChatGPT and Google’s Gemini, which often offer cutting-edge performance but limit transparency and customisation. In contrast, open-source models such as Microsoft’s Phi, Google’s Gemma and Alibaba’s Qwen prioritise transparency and community development.15 This openness benefits data privacy, task customisation and language tuning, as exemplified by the development of Arabic-centred models like Jais and AceGPT.

To enhance LLM reliability and capabilities, several techniques are used. Retrieval-augmented generation (RAG) improves response quality by grounding the LLM with information retrieved from external, verifiable knowledge sources.16 Prompt engineering involves optimising input prompts to guide LLMs towards generating more relevant outputs.17 AI agents leverage LLMs as a core reasoning engine, enabling systems to plan actions, interact with external tools and execute tasks to achieve complex goals such as web searching and response validation.18

Significant gaps exist in the literature regarding the use of LLMs for PEM generation. There is no research evaluating LLM-generated PEMs in Arabic, although some studies exist for English.19 20 Furthermore, the integration of RAG techniques to enhance PEM generation is underexplored. Finally, there is limited research which evaluates PEMs produced by open-source LLMs.9

This work aims to develop and evaluate an agentic RAG (ARAG) framework, leveraging LLMs, to generate evidence-based Arabic PEMs. Key contributions include:

  1. Establishing a pipeline for collecting and processing evidence-based Arabic PEM data.

  2. Evaluating various open-source LLMs (≤32B parameters), investigating the impact of prompt engineering and ARAG systems in enhancing PEM generation quality.

  3. Employing a two-stage evaluation (automated followed by expert review) to identify optimal LLM configurations for PEM generation.

  4. Assessing selected LLMs as validation agents (VAs) for blocking harmful patient advice.

Methods

Subject-matter experts

The clinical relevance and appropriateness of the evaluation framework were ensured through consultation with two subject-matter experts. HK, an internal medicine specialist with extensive PE experience, including service as the editor-in-chief of the King Abdullah Bin Abdulaziz Arabic Health Encyclopaedia, provided guidance on general medicine topics and PE best practices. Additionally, MA, an oncology specialist at Al Amal Hospital, Qatar, offered specialised oncology expertise. Both experts contributed to defining relevant clinical topics, validating the evaluation questions and assessing the generated LLM responses.

RAG data corpus

Online supplemental table S1 lists the reliable public websites we used to collect our 52 k article ARAG data from. We selected those sources based on an expert consultation in Arabic-based PEMs. We collected the data using the Firecrawl API from Python. The data were collected between August and October of 2024.

Embedding models selection and chunking strategy

The model selection process identified reputable models supporting Arabic, primarily via the Massive Text Embedding Benchmark,21 excluding fine-tunes from smaller teams for reliability. The candidate models identified included ‘bge-m3’, ‘granite-embedding-278m-multilingual’, ‘static-similarity-mrl-multilingual-v1’ and ‘jina-embedding-v3’ (further details in online supplemental table S2).

A dedicated testing methodology using samples from the project’s Arabic medical corpus (three distinct article/question pairs: prostate cancer, lentil soup, breast cancer) determined the selection. For each candidate model, embeddings were generated for these six texts, and cosine similarities were computed for each article-question pair, forming a 3×3 matrix. This approach aimed to quantify how well models handled unrelated (cancer-lentil) and related (breast-prostate cancers) topics. The goal was a model assigning high similarity to correctly matched pairs and low similarity to unrelated pairs. By comparing these matrices, the model showing the greatest contrast between correct and incorrect matches, ‘jina-embedding-v3’22 (hereafter ‘jina-v3’), was selected. Online supplemental table S3 presents these experimental results.

Sentence-wise chunking was selected as it avoids context loss from splitting midsentence (unlike naive token chunking) and is less computationally demanding than semantic chunking.23 Following recommendations for RAG deployments suggesting 512–1024 tokens per chunk for scientific articles,23 we targeted approximately 512 tokens. This smaller size increases information granularity within the vector database, creating more, smaller chunks. This approach allows retrieval to access diverse relevant passages, mitigates bias from over-reliance on a single large chunk and answers complex queries requiring information synthesised from multiple segments. To achieve the target chunk size based on our data’s characteristics (online supplemental table S4), we calculated the number of sentences corresponding to 512 tokens. With an average of 44.53 tokens per sentence, this resulted in 11 sentences per chunk: 512(token/chunk)/44.53(token/sentence) ≈ 11(sentence/chunk).

LLMs and ARAG deployment

Given the sensitive nature of patient data and our future plans for patient data integration, our approach focused on open-source LLMs deployable locally. This ensures transparency for data privacy and minimises risks associated with external patient data transmission. An exception was made for Fanar,24 the Qatari sovereign LLM; although closed-source, its local development allows for on-premises deployment.

Practical challenges hinder deploying LLMs >32B parameters in local hospitals (eg, computational resources, costs, AI Graphic Processing Units (GPUs) import complexities), rendering larger models unsuitable for this project’s required scale.

Furthermore, we focused on models trained on Arabic data to process Arabic queries and the Arabic RAG corpus effectively.

Based on these constraints, four selection criteria were defined: (1) source: open-source/-weights (exception: Fanar), (2) size: ≤32B parameters, (3) language: Arabic support and (4) origin: reputable source. Applying these criteria via Hugging Face, web search and the Open Arabic LLM leaderboard25 resulted in the selection of the following 12 LLMs: Qwen-2.5 (7b, 14b, 32b), Phi-4 and Phi-4-mini models, alongside Mistral-Small-2409, Gemma-2-27b, Fanar-7B, Falcon3-10B, jais-family-13b, jais-adaptive-13b and AceGPT-v2-32B (details in online supplemental table S5).

To address PEM sensitivity and enhance safety, a VA was integrated into the RAG pipeline, after LLM response generation. The VA’s core task is to validate safety by assessing text for harmful or unsuitable information, performing minor revisions for appropriateness (eg, tone) and blocking harmful responses. This agent is implemented via a second inference call to the same base LLM, guided by a distinct, specialised prompt for validation and safety filtering (details in online supplemental table S6). Figure 1 depicts the proposed ARAG framework.

Figure 1. Diagram depicting our proposed ARAG framework for PEM generation. ARAG, agentic retrieval-augmented generation; LLM, large language model; PEM, patient education material.

Figure 1

Datasets for evaluations

For the PEM generation evaluation exercise, our question acquisition methodology was adapted from.26 Candidate PEM topics within general medicine and oncology were initially identified through multiple rounds of phone consultation with subject-matter experts for relevance to patient needs. An initial pool of 50 common patient questions (25 general medicine, 25 oncology) was compiled from medical sources (eg, American Academy of Ophthalmology, National Cancer Institute, National Health Service). This pool was then reviewed by the experts, who selected the final 20 questions (10 general medicine, 10 oncology) based on representativeness of typical patient inquiries. For the full list of questions and ground truth answers, consult online supplemental datasets, sheet ‘PEM Questions’.

As for assessing the VA performance, a dataset of 50 PEM examples (details in online supplemental datasets, Sheet ‘VA PEMs’) was created. The dataset was physician validated through an online form. This dataset included three categories featuring harmful medical advice, ranging in severity from potentially life-threatening (category 1) to unscientific (category 3), subtly integrated within otherwise safe-appearing PEMs to test the blocking capabilities of LLMs. Two non-harmful categories with different tones served as controls. Each category contributed 10 PEMs totalling 50 PEMs.

PEM generation experimental setups and inference

To evaluate the selected LLMs for Arabic PEM generation and assess our ARAG’s contribution, we designed four experimental setups: (1) base LLM performance; (2) base LLM with prompt engineering; (3) ARAG without prompt engineering and (4) ARAG with prompt engineering. Each of the 12 LLMs was evaluated under all configurations, totalling 48 experimental runs (12 LLMs×4 configurations). Figure 2 details the experimental setups and evaluation framework.

Figure 2. Overview of the evaluation framework, detailing the candidate LLMs, the experimental setups and the two-round evaluation. ARAG, agentic retrieval-augmented generation; LLM, large language model.

Figure 2

The adopted prompt engineering techniques for setups 2 and 4 can be found in online supplemental table S6. For configurations using ARAG (setups 3 and 4), the augmented context was limited to the top 3 chunks with the highest cosine similarity scores relative to the user’s query (≈1536 tokens). This standard context limit ensured compatibility across all evaluated LLMs by accommodating the smallest context window. To illustrate, the ‘Jais-family’ model has a 2K token capacity (online supplemental table S5), making it unfeasible to supply more than 3 chunks (≈512 tokens each). The 1532 token limit was tested for sufficiency prior to results generation and was found to be adequate in 10 distinct scenarios.

For all 48 experiments, default inference settings were applied. Each evaluation question was formulated as a zero-shot input, augmented with contextual data and/or specific prompt engineering as dictated by the experimental setup. Three distinct inference methods were employed based on model characteristics (detailed in online supplemental table S7): Local deployment: open-source models with ≤14B parameters were run locally using the Ollama framework on an NVIDIA RTX 3090 GPU. Inference application programming interface (APIs): models >14B parameters or closed-source models like Fanar-7B were accessed via their respective APIs. Dedicated cloud deployment: AceGPTv2-32B, which was not available via standard APIs, required deployment on a dedicated cloud virtual machine equipped with an NVIDIA A100 GPU.

PEM generation evaluation methodology

A two-stage process evaluated the 48 PEM generation experiments (figure 2): an initial automated assessment using an LLM, followed by manual evaluation by domain experts.

The evaluation metrics were carefully selected for this task. Accuracy, readability and comprehensiveness were chosen based on their identification as key criteria in a prior scoping review.9 Additionally, safety and appropriateness were included following recommendations from a PEM specialist. These metrics are defined as follows: accuracy (factual correctness); readability (ease of language); comprehensiveness (how fully the question was addressed); appropriateness (suitability of tone, style and cultural context) and safety (absence of harmful or misleading advice).

Both evaluation stages assessed responses using identical metrics against ground truth answers on a 1–5 Likert scale. The initial automated stage employed ChatGPT o3-mini27 as the evaluator, guided by an engineered prompt (details in online supplemental table S6). During this stage, a language validation rule was applied, assigning a zero score across all metrics for responses containing non-Arabic sentences; this aimed to ensure suitability for the target audience and avoid compromising patient engagement or trust due to language switching. Based on the resulting scores, the top 5 experiments were identified. In the second stage, domain experts independently re-evaluated these top 5 configurations using the same criteria via online forms which displayed: the question, the ground truth answer and the LLM response to the domain experts. Experts performed the evaluations blindly, unaware of models’ identities.

VA evaluation methodology

To evaluate the candidate LLMs’ ability to function as VAs blocking harmful PEMs, an independent test benchmarked their instruction-following and harmful content detection. LLMs were instructed via few-shot examples to respond with, ‘I am sorry I cannot help with this’ to harmful PEMs, adherence was measured by the cosine similarity (jina-v3 embeddings) between the LLM’s response and this target phrase. Based on a determined similarity threshold, responses were classified as refusals (‘Yes’) or not (‘No’). Comparing these classifications against ground truth yielded standard metrics: true positive, false positive, false negative and true negative. These metrics were used to calculate the overall accuracy for each LLM as a VA.

Results

First round of PEM generation evaluations

The first evaluation stage involved an automated assessment of all 48 experiments using ChatGPT o3-mini, scoring generated PEM responses against predefined metrics and reference answers.

Results, detailed in table 1, indicated a performance trend favouring ARAG-enabled setups (setups 3 and 4), which yielded the highest scores for 10 of the 12 models. The effect of prompt engineering on performance varied across LLMs. Some models benefited largely from leveraging it (AceGPT-V2-32b, Jais-adapted-13b, mistral-small-22b), while others demonstrated less sensitivity to it, with performance remaining stable or showing degradation (Fanar-S-1-7b, Gemma-2-27b), highlighting that the impact of the prompting strategy was not universal.

Table 1. Numerical results of the first round of evaluations using ChatGPT-o3 mini as the evaluator.

Rank Model Setup Average general medicine Average oncology Overall average
1 AceGPT-v2-32b 4 4.68 4.70 4.69
5 AceGPT-v2-32b 2 4.40 4.54 4.47
11 AceGPT-v2-32b 1 4.50 4.20 4.35
13 AceGPT-v2-32b 3 4.28 4.28 4.28
2 Fanar-S-1-7b 3 4.62 4.74 4.68
4 Fanar-S-1-7b 4 4.62 4.50 4.56
6 Fanar-S-1-7b 2 4.62 4.30 4.46
7 Fanar-S-1-7b 1 4.60 4.32 4.46
3 Jais-adapted-13b 2 4.66 4.56 4.61
15 Jais-adapted-13b 4 4.32 4.14 4.23
16 Jais-adapted-13b 1 4.08 4.36 4.22
29 Jais-adapted-13b 3 3.66 2.92 3.29
8 Jais-family-13b 4 4.12 4.78 4.45
20 Jais-family-13b 3 3.48 4.64 4.06
23 Jais-family-13b 2 3.84 4.10 3.97
47 Jais-family-13b 1 1.04 1.84 1.44
9 Qwen2.5-14b 3 4.54 4.30 4.42
10 Qwen2.5-14b 4 4.70 4.10 4.40
18 Qwen2.5-14b 1 4.20 4.00 4.10
24 Qwen2.5-14b 2 4.12 3.72 3.92
12 Gemma-2-27b 1 4.88 3.74 4.31
14 Gemma-2-27b 3 5.00 3.46 4.23
25 Gemma-2-27b 2 3.00 4.36 3.68
34 Gemma-2-27b 4 3.84 1.96 2.90
17 Phi4-14b 4 4.08 4.20 4.14
21 Phi4-14b 3 3.96 4.06 4.01
26 Phi4-14b 2 3.42 3.72 3.57
27 Phi4-14b 1 3.08 3.86 3.47
19 Qwen2.5-32b 3 3.80 4.34 4.07
22 Qwen2.5-32b 1 3.40 4.56 3.98
28 Qwen2.5-32b 4 3.38 3.44 3.41
37 Qwen2.5-32b 2 3.28 2.42 2.85
30 Qwen2.5-7b 3 3.30 3.10 3.20
31 Qwen2.5-7b 1 2.96 3.42 3.19
33 Qwen2.5-7b 4 2.58 3.32 2.95
36 Qwen2.5-7b 2 2.40 3.36 2.88
32 Phi4-mini-3.8b 3 3.28 2.72 3.00
35 Phi4-mini-3.8b 4 3.52 2.26 2.89
41 Phi4-mini-3.8b 2 1.86 2.46 2.16
44 Phi4-mini-3.8b 1 1.64 2.06 1.85
38 Mistral-small-22b 4 2.76 2.80 2.78
39 Mistral-small-22b 3 2.38 2.60 2.49
42 Mistral-small-22b 2 1.82 2.04 1.93
45 Mistral-small-22b 1 1.46 1.62 1.54
40 Falcon3-10b 3 2.36 2.02 2.19
43 Falcon3-10b 4 1.92 1.90 1.91
46 Falcon3-10b 1 1.54 1.44 1.49
48 Falcon3-10b 2 1.16 1.66 1.41

Analysis of the top five ranking experiments showed they were occupied by Arabic-focused LLMs (AceGPT-v2-32B (setups 4 and 2), Fanar-S-1-7B (setup 3 and 4), Jais-adapted-13B (setup 2), which performed better on this task than the general-purpose models tested. Jais-family-13b (setup 4) showed the strongest oncology performance (4.78); despite this, its overall average (4.45) puts it in eighth rank due to a lower general medicine score (4.12).

The three leading non-Arabic-centred models were Qwen2.5-14b (Setup 3, ninth rank), Gemma-2-27b (Setup 1, 12th rank) and Phi4-14b (Setup 4, 17th rank). An interesting outlier was Gemma-2-27b (setup 3), which achieved a perfect average score in general medicine (5) but was held back to an overall rank of 14th by a low oncology score (3.46). For the full list of results including individual question scores for all LLMs under all setups, please consult online supplemental results, sheet ‘Round 1’.

Second round of PEM generation evaluations

In the second evaluation stage, domain experts manually assessed the five top-performing configurations using the identical metrics, scoring scale and ground truth answers from the first round to validate results (detailed in table 2). A key finding was the exact match between the expert ranking and the automated ranking for these top five configurations. Although absolute scores differed between the evaluation stages (largely in general medicine), the consistent relative ranking validated the initial automated findings. Expert assessment confirmed AceGPT-v2-32B (setup 4) as the highest-performing approach. The comprehensive results from the expert evaluations can be found in online supplemental results, sheet ‘Round 2’.

Table 2. Numerical results comparing the top 5 experiments from both rounds of evaluations.

Rank Model Setup General average Oncology average Overall average
First round of evaluation (automated LLM evaluation)
 1 Acegpt-v2-32b 4 4.68 4.70 4.69
 2 Fanar-S-1-7B 3 4.62 4.74 4.68
 3 Jais-adapted-13b 2 4.66 4.56 4.61
 4 Fanar-S-1-7B 4 4.62 4.50 4.56
 5 Acegpt-v2-32B 2 4.40 4.54 4.47
Second round of evaluation (expert evaluation)
 1 Acegpt-v2-32b 4 3.68 4.96 4.32
 2 Fanar-S-1-7B 3 3.76 4.70 4.23
 3 Jais-adapted-13b 2 3.68 4.76 4.22
 4 Fanar-S-1-7B 4 3.98 4.42 4.20
 5 Acegpt-v2-32B 2 3.72 4.58 4.15

LLM, large language model.

VA results

VA performance over the 50 medical advices (table 3) varied significantly and correlated strongly with model size. Top performers like ‘Gemma-2-27b’ (0.82 accuracy), ‘Qwen2.5-32b’ (0.80 accuracy) and ‘AceGPT-v2-32b’ (0.78 accuracy) reliably blocked harmful content from categories 1 and 2 (the most harmful PEMs), while smaller models including ‘Jais’ variants and ‘Fanar-S-1-7B’ performed poorly (0.40 accuracy). Models under 14B parameters generally scored below 0.50 accuracy, whereas only those with ≥27B parameters exceeded 0.80, suggesting sufficient parameter capacity is needed for reliable content validation in PE workflows. It is noteworthy that the full results from the VA evaluation (online supplemental results, sheet ‘VA Eval’) show that even the top performers failed to dependably block harmful PEMs belonging to category 3 (unscientific or time-wasting medical advice).

Table 3. Performance of the LLMs as VAs.

Model TP FP FN TN Accuracy
More than 14B Parameters
 Gemma-2-27b 21 0 9 20 0.82
 Qwen2.5-32b 20 0 10 20 0.80
 Acegpt-v2-32b 20 1 10 19 0.78
 Qwen2.5-14b 16 0 14 20 0.72
 Phi4-14b 16 3 14 17 0.66
 Mistral-small-22b-2409 6 3 24 17 0.46
Less than 14B parameters
 Falcon3-10b 9 6 21 14 0.46
 Phi4-mini-3.8b 1 0 29 20 0.42
 Qwen2.5-7b 2 0 28 20 0.44
 Jais-adapted-13b 0 0 30 20 0.40
 Jais-family-13b 0 0 30 20 0.40
 Fanar-S-1-7B 0 0 30 20 0.40

FN, false negative; FP, false positive; LLM, large language model; TN, true negative; TP, true positive; VA, validation agent.

Discussion

Key findings and observations

A key finding is the superior performance of LLMs specifically developed for Arabic. Models with a focus on this language (AceGPT-v2-32B, Fanar-S-1-7B, Jais-adapted-13B) consistently occupied the top ranks, with the first non-Arabic-centric model, Qwen2.5-14b, not appearing until the ninth position. This performance gap suggests that an LLM’s efficacy is shaped by its training data, with language-specific corpora providing a critical advantage in syntax and cultural context that generalised models struggle to replicate.

ARAG-enabled setups also showed superior performance in answering medical-related questions, a finding consistent with28 a similarly positioned study.

Furthermore, the results challenge the paradigm that larger parameter counts inherently lead to better performance, highlighting the importance of model architecture and efficiency. This is most evident in the ‘Fanar Anomaly’, where the 7B Fanar-S-1-7B model surpassed several much larger models, with all four of its configurations ranking in the top eight. This, along with non-linear scaling in the Qwen family (where the 14B variant outperformed the 32B), demonstrates that an optimised architecture can be more impactful than size alone.

The inconsistent adherence of some multilingual models to the target language was notable. Despite receiving Arabic-only inputs and RAG context, several models (eg, Qwen-7b/32b, Gemma-2-27b, Jais-family-13b) occasionally produced non-Arabic text. The Qwen 2.5-14b’s adherence suggests potential challenges with multilingual training interference, indicating that RAG context alone may not be sufficient to fully control language output in some models.

Performance consistency also varied greatly. Fanar-S-1-7B stood out as the most robust model, with all four of its configurations landing in the top 7 ranks. In stark contrast, Jais-family-13b was the least consistent, with ranks ranging from a strong eighth place (setup 4) to a very poor 47th (setup 1). This wide variance highlights that some models are inherently more stable, while others are highly dependent on specific augmentation techniques like ARAG to achieve acceptable results.

Finally, the results reveal a critical distinction between a model’s generative and validation abilities. Fanar-S-1-7B is a prime example; despite its exceptional performance as a PEM generator, it struggled significantly as VA, failing to block any harmful content. This disparity underscores that generating fluent, contextually relevant text is a different skill than critically assessing content against safety rules.

Limitations

First, the field of LLMs evolves rapidly; findings based on models available up to our January 2025 cut-off may become less relevant over time, potentially impacting the observed relative rankings.

Another methodological constraint limited RAG context input to approximately 1536 tokens (top 3 chunks) for fairness across models with varying capacities (eg, accommodating the 2K window of ‘Jais-family’). This prevented larger-window models like Qwen (128K) from fully using their capacity during RAG generation. Further investigation using larger contexts is warranted.

An additional limitation is the reliance on an LLM-based evaluator for the first evaluation round. Despite the agreement between the human and LLM-based evaluations among the top-ranked experiments, it should be noted that this agreement may not generalise to lower-quality outputs. LLM evaluators can suffer from bias or miss subtle safety or clinical context, reaffirming the need for expert validation even when automated and expert rankings align.

Another key limitation is its focus on models under 32B parameters, a constraint dictated by the practicalities of local deployment in healthcare settings. As the results indicate a correlation between model size and performance, especially for validation tasks, it is plausible that state-of-the-art models larger than 32B, which were excluded from this evaluation, could offer superior capabilities.

Finally, this study utilised default inference parameters for consistency across configurations. Optimising settings like ‘temperature’ or ‘top-p’ for specific models could potentially yield different performance outcomes than those reported here.

Conclusion and future work

This study developed and evaluated an ARAG framework using open-source LLMs (<32B parameters) to generate evidence-based Arabic PEMs. (1) A pipeline for collecting and processing suitable Arabic PEM data was successfully established. (2) The evaluation of various LLMs, prompt engineering and the ARAG system confirmed ARAG’s positive contribution, improving performance for most models evaluated and showed that Arabic-centred LLMs generally outperformed general-purpose ones for this task. (3) Through a two-stage evaluation involving automated assessment and expert review, the optimal configuration was identified as setup 4, using AceGPT-v2-32B combined with both ARAG and prompt engineering. (4) Furthermore, the assessment of these LLMs as VAs indicated that performance in blocking harmful content correlated strongly with model size, although significant limitations remain, particularly in detecting subtle or unscientific harmful advice. Future work should focus on enhancing VA reliability, exploring parameter tuning impacts and publishing a platform patients could interact with for Arabic PEMs generation. This platform, or future versions of it, has the potential to bring the foundational PE that the Arabic speaking population needs while delivering a personalised experience at an unprecedented scale when deployed in a clinical setting.

Supplementary material

online supplemental file 1
bmjhci-32-1-s001.pdf (174.3KB, pdf)
DOI: 10.1136/bmjhci-2025-101570
online supplemental file 2
bmjhci-32-1-s002.xlsx (43.7KB, xlsx)
DOI: 10.1136/bmjhci-2025-101570
online supplemental file 3
bmjhci-32-1-s003.xlsx (1MB, xlsx)
DOI: 10.1136/bmjhci-2025-101570

Acknowledgements

We acknowledge the use of ChatGPT-o3 mini and Gemini 2.5 in the development of this paper. These platforms assisted in structuring our content, generating ideas and refining the draft. Their suggestions improved the organisation and clarity of our work. However, all AI-generated outputs were carefully reviewed, edited and integrated by the authors to ensure accuracy and maintain academic integrity. The final manuscript reflects our original insights and analysis, with AI serving solely as a supportive tool.

Footnotes

Funding: The authors have not declared a specific grant for this research from any funding agency in the public, commercial or not-for-profit sectors.

Provenance and peer review: Not commissioned; externally peer reviewed.

Patient consent for publication: Not applicable.

Ethics approval: Not applicable.

Data availability statement

All data relevant to the study are included in the article or uploaded as supplementary information.

References

  • 1.Rankin SH, Stallings KD. Patient education: principles & practices. 4th ed. Philadelphia, PA: Lippincott; 2001. p. 432. [Google Scholar]
  • 2.Sørensen K, Van den Broucke S, Fullam J, et al. Health literacy and public health: a systematic review and integration of definitions and models. BMC Public Health. 2012;12:80. doi: 10.1186/1471-2458-12-80. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Huang L, Yu W, Ma W, et al. A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions. ACM Trans Inf Syst. 2025;43:1–55. doi: 10.1145/3703155. [DOI] [Google Scholar]
  • 4.Chen Y-C, Wang Y-C, Chen W-K, et al. The effectiveness of a health education intervention on self-care of traumatic wounds. J Clin Nurs. 2013;22:2499–508. doi: 10.1111/j.1365-2702.2012.04295.x. [DOI] [PubMed] [Google Scholar]
  • 5.Kianian R, Sun D, Crowell EL, et al. The Use of Large Language Models to Generate Education Materials about Uveitis. Ophthalmol Retina. 2024;8:195–201. doi: 10.1016/j.oret.2023.09.008. [DOI] [PubMed] [Google Scholar]
  • 6.Daraz L, Morrow AS, Ponce OJ, et al. Can Patients Trust Online Health Information? A Meta-narrative Systematic Review Addressing the Quality of Health Information on the Internet. J Gen Intern Med. 2019;34:1884–91. doi: 10.1007/s11606-019-05109-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Berland GK, Elliott MN, Morales LS, et al. Health information on the Internet: accessibility, quality, and readability in English and Spanish. JAMA. 2001;285:2612–21. doi: 10.1001/jama.285.20.2612. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Naveed H, Khan AU, Qiu S, et al. A Comprehensive Overview of Large Language Models. arXiv. 2024 doi: 10.48550/arXiv.2307.06435. [DOI] [Google Scholar]
  • 9.AlSammarraie A, Househ M. The Use of Large Language Models in Generating Patient Education Materials: a Scoping Review. Acta Inform Med. 2025;33:4–10. doi: 10.5455/aim.2024.33.4-10. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Abbasian M, Azimi I, Rahmani AM, et al. Conversational Health Agents: A Personalized LLM-Powered Agent Framework. arXiv. 2024:2310.02374. doi: 10.48550/arXiv.2310.02374. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Lambert R, Choo Z-Y, Gradwohl K, et al. Assessing the Application of Large Language Models in Generating Dermatologic Patient Education Materials According to Reading Level: Qualitative Study. JMIR Dermatol . 2024;7:e55898. doi: 10.2196/55898. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Luo M, Warren CJ, Cheng L, et al. Assessing Empathy in Large Language Models with Real-World Physician-Patient Interactions. arXiv. 2024 doi: 10.48550/ARXIV.2405.16402. [DOI] [Google Scholar]
  • 13.Koranteng E, Rao A, Flores E, et al. Empathy and Equity: Key Considerations for Large Language Model Adoption in Health Care. JMIR Med Educ. 2023;9:e51199. doi: 10.2196/51199. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Aydin S, Karabacak M, Vlachos V, et al. Large language models in patient education: a scoping review of applications in medicine. Front Med (Lausanne) 2024;11:1477898. doi: 10.3389/fmed.2024.1477898. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Redress Compliance Open source vs. Closed source large language models. 2025. [20-Apr-2025]. https://redresscompliance.com/open-source-vs-closed-source-largelanguage-models/ Available. Accessed.
  • 16.Lewis P, Perez E, Piktus A, et al. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. arXiv. 2021 doi: 10.48550/arXiv.2005.11401. [DOI] [Google Scholar]
  • 17.Liu P, Yuan W, Fu J, et al. Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing. ACM Comput Surv. 2023;55:1–35. doi: 10.1145/3560815. [DOI] [Google Scholar]
  • 18.Xi Z, Chen W, Guo X, et al. The Rise and Potential of Large Language Model Based Agents: A Survey. arXiv. 2023 doi: 10.48550/arXiv.2309.07864. [DOI] [Google Scholar]
  • 19.Azzopardi M, Ng B, Logeswaran A, et al. Artificial intelligence chatbots as sources of patient education material for cataract surgery: ChatGPT-4 versus Google Bard. BMJ Open Ophthalmol. 2024;9:e001824. doi: 10.1136/bmjophth-2024-001824. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Durmaz Engin C, Karatas E, Ozturk T. Exploring the Role of ChatGPT-4, BingAI, and Gemini as Virtual Consultants to Educate Families about Retinopathy of Prematurity. Children (Basel) 2024;11:750. doi: 10.3390/children11060750. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Muennighoff N, Tazi N, Magne L, et al. MTEB: Massive Text Embedding Benchmark. arXiv. 2022 doi: 10.18653/v1/2023.eacl-main.148. [DOI] [Google Scholar]
  • 22.Sturua S, Mohr I, Akram MK, et al. jina-embeddings-v3: Multilingual Embeddings With Task LoRA. arXiv. 2024 doi: 10.48550/arXiv.2409.10173. [DOI] [Google Scholar]
  • 23.Wang X, Wang Z, Gao X, et al. Searching for Best Practices in Retrieval-Augmented Generation. arXiv. 2024 doi: 10.48550/ARXIV.2407.01219. [DOI] [Google Scholar]
  • 24.Abbas U, Ahmad MS, Alam F, et al. Fanar: An Arabic-centred Multimodal Generative AI Platform. arXiv. 2025 doi: 10.48550/ARXIV.2501.13944. [DOI] [Google Scholar]
  • 25.El Filali A, Alobeidli H, Fourrier C, et al. Open arabic LLM leaderboard. 2024. https://huggingface.co/spaces/OALL/Open-Arabic-LLMLeaderboard Available.
  • 26.Mashatian S, Armstrong DG, Ritter A, et al. Building Trustworthy Generative Artificial Intelligence for Diabetes Care and Limb Preservation: a Medical Knowledge Extraction Case. J Diabetes Sci Technol. 2024:19322968241253568. doi: 10.1177/19322968241253568. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.OpenAI OpenAI o3-mini. 2025. [26-Apr-2025]. https://openai.com/index/openai-o3mini/ Available. Accessed.
  • 28.Low YS, Jackson ML, Hyde RJ, et al. Answering real-world clinical questions using large language model, retrieval-augmented generation, and agentic systems. Digit Health. 2025;11:20552076251348850. doi: 10.1177/20552076251348850. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

online supplemental file 1
bmjhci-32-1-s001.pdf (174.3KB, pdf)
DOI: 10.1136/bmjhci-2025-101570
online supplemental file 2
bmjhci-32-1-s002.xlsx (43.7KB, xlsx)
DOI: 10.1136/bmjhci-2025-101570
online supplemental file 3
bmjhci-32-1-s003.xlsx (1MB, xlsx)
DOI: 10.1136/bmjhci-2025-101570

Data Availability Statement

All data relevant to the study are included in the article or uploaded as supplementary information.


Articles from BMJ Health & Care Informatics are provided here courtesy of BMJ Publishing Group

RESOURCES