Skip to main content
AMIA Annual Symposium Proceedings logoLink to AMIA Annual Symposium Proceedings
. 2026 Feb 14;2025:1051–1060.

Design and Evaluation of EMPATHICA: A Chatbot for Enhancing Medication Literacy

Yuri Quintana 1, Katherine Bloom 2, Gyana Srivastava 2,3, Ava Homiar 2, Annlouise Assaf 4, Glenda Thomas 5, Elizabeth Lowe 5, David Hampton 6, Viola Wontor 4
PMCID: PMC12919623  PMID: 41726533

Abstract

Health literacy significantly impacts patient outcomes, yet many struggle to understand complex medication information. Medication non-adherence often results from poor comprehension of drug instructions and contributes to preventable hospitalizations and poor treatment outcomes. With the increasing use of digital health interventions, AI-powered chatbots present an opportunity to improve patient access to understandable and personalized medication information. This study evaluates the usability, accessibility, and effectiveness of EMPATHICA, an AI-powered chatbot designed to provide patient-centric medication information. The study assesses whether chatbot-generated responses improve patient comprehension and engagement. The research observed the interaction between participants and the web-based application, where participants asked the chatbot to measure the chatbot’s usability and accuracy using various qualitative and quantitative measures and expert physician evaluation of the responses. AI-driven chatbots have the potential to bridge health literacy gaps by providing clear and accessible medication information. By evaluating EMPATHICA, this study contributes to the growing field of AI applications supporting patient-informed medication use.

Keywords: Health literacy, Chatbot, Medication adherence, Patient Education

1. Introduction

Health literacy significantly impacts patient health outcomes, yet studies indicate that only 12% of Americans exhibit proficient health literacy, leaving a significant proportion of the population struggling to understand medical information 1. The U.S. Department of Health and Human Services reports that only a third of the U.S. population has basic or below-basic health literacy skills 2. This gap disproportionately affects low-literacy populations, increasing the risk of medication non-adherence, hospitalizations, and preventable adverse drug events 3. Medication non-adherence alone contributes to 125,000 deaths annually and accounts for up to 50% of treatment failures 4.

Beyond literacy barriers, the complexity of drug interactions presents an additional challenge. Research has shown that gender and age biases exist in drug administration, and understanding drug-drug interactions is becoming increasingly difficult as the number of prescribed medications rises 5. Patients require solutions that simplify medication information and provide contextual guidance tailored to their health status. Misinterpretation of prescription labels further exacerbates medication adherence issues, leading to adverse drug reactions and suboptimal therapeutic outcomes6. Studies have shown that simplified, literacy-friendly medication summaries can improve patient comprehension from 50% to 71% 7. However, with the growing complexity of medication regimens, patients require more accessible and personalized tools to navigate drug interactions, dosage instructions, and potential side effects 5.

The DCI Network 8 has developed EMPATHICA (Empower, Medication information, Patient-centric, Accessible, Technology-driven, Healthcare journey, Innovative, Communication, AI-powered), an AI-powered chatbot designed to address these challenges and provide clear, reliable, and patient-specific medication information. This study describes the development process of EMPATHICA, including stakeholder engagement, technical design, and evaluation methods. By leveraging AI-driven personalization, EMPATHICA aims to bridge existing gaps in medication literacy, offering relevant, accessible, and user-friendly guidance tailored to individual patient needs. The importance of patient-centered medication information cannot be overstated. The FDA regulates and approves Patient Package Inserts (PPI)9 in the medication package, but evidence suggests that a one-size-fits-all approach does not meet diverse patient needs 10. This research seeks to determine to what extent EMPATHICA, an AI-powered chatbot, enhances the usability, accessibility, and accuracy of medication information for patients with varying health literacy levels. This study contributes to the evolving landscape of digital health interventions by investigating the integration of AI-driven solutions in patient education. If effective, EMPATHICA can be a scalable model for improving medication literacy, reducing health disparities, and enhancing patient outcomes through AI-enhanced communication.

2. Methods

2.1. Design Workshop

DCI Network convened a multidisciplinary group comprising healthcare professionals, informaticians, patient advocates, and AI researchers. The collaboration followed a co-design approach, integrating user feedback at each stage [5]. The iterative development process included the following:

  • User Needs Assessment: Identifying barriers in medication comprehension through literature review and patient focus groups.

  • Prototype Development: Building chatbot prototypes incorporating conversational AI, FHIR-based medication databases, and multi-modal content (text, visuals, and video).

  • Pilot Testing: Evaluating usability and comprehension with limited diverse patient populations.

The June 2023 DCI Network retreat at Harvard University involved stakeholders in co-creating and evaluating new user interfaces for medication information dissemination 11. Subsequently, the DCI Network Conference and Retreat on Patient Engagement, held at the Harvard Faculty Club in June 2024, focused on diverse perspectives on patient-centric AI chatbots and mobile applications for medication management. Key discussions included patient voices, digital strategies, behavioral economics, and gamification to enhance patient participant adherence. The event also sought to review best practices for chatbot usability, effectiveness, and safety while fostering interdisciplinary collaborations among healthcare professionals, technologists, and patient advocacy groups. The conference was followed by a one-day retreat featuring interactive sessions, including brainstorming discussions on engagement strategies for AI-driven medication tools. An AI Chatbot Evaluation Workshop examined performance metrics, data privacy, and patient safety, with breakout groups drafting preliminary evaluation protocols. The retreat concluded with synthesizing key insights and a roadmap for continued collaboration in advancing AI-driven patient engagement solutions. The retreat also established a set of evaluation objectives to measure patient comprehension and usability testing to ensure chatbot-generated information aligned with patient needs and literacy levels 12.

2.2. Technical Architecture

The chatbot leverages a hybrid approach of cloud-based and open-source large language models (LLMs) for natural language understanding and generation. Mistral was the only LLM that was used in both cloud and open-source, hosted testing. EMPATHICA integrates:

  • Cloud-Based Models: OpenAI’s ChatGPT 13, Google’s Gemini 14, Mistral 15.

  • Open-Source Models: Mistral 15, Meta’s LLaMA 16, Google’s Gemma 17, and DeepSeek 18

EMPATHICA’s architecture incorporates Ollama, a platform that enables the local deployment of large language models, as a central component for managing and connecting various cloud-based and open-source LLMs. Acting as an intermediary, Ollama facilitates the integration of diverse models, such as OpenAI’s ChatGPT, Google’s Gemini, Mistral, Meta’s LLaMA, and Google’s Gemma, offering enhanced privacy, reduced latency, and greater data control in the process. The only downside of locally hosting LLMs is the server requirements, which can be quite intensive to ensure a well-run and speedy system. EMPATHICA additionally integrates a front-end powered by the open-source OpenWebUI, offering users an intuitive interface to interact with the Ollama models. This setup ensures seamless querying and accessibility for managing various language models within the system.

Ollama and OpenWebUI (EMPATHICA) operate within a Docker container hosted on an NVIDIA-enabled EC2 server in the Amazon Cloud. This enables a quick response time for user queries by ensuring that the processing for the LLM is handed off to the GPU instead of the CPU. A Retrieval-Augmented Generation (RAG)19 training approach is implemented to ensure accuracy and relevance. When a user asks a question, Ollama’s RAG system first retrieves relevant information from a knowledge base, which is a vector database containing FDA-approved medication labels and a prompt guide with guardrail guidelines. This retrieved information is then fed to the chosen LLM, allowing it to generate a response grounded in verified data tailored to the user’s query and literacy level while preventing hallucinations. This architecture optimizes responses by providing a structured and reliable information retrieval process before the LLM generates the final answer.

To ensure reliability, the chatbot incorporates retrieval-augmented generation (RAG) techniques, which pull medication information from verified sources before generating responses. When a user asks a question, the LLM’s RAG system first retrieves relevant information from a knowledge base (vector database) that contains the FDA-approved medication labels for patient-specific recommendations. The chatbot uses FDA-approved drug labels 20. A prompt guide of guardrail guidelines was also used when generating responses to ensure safety approaches to tailoring answers to literacy levels and to prevent hallucinations. The “prompt response guidelines document” described in the EMPATHICA paper was not created in isolation but through a collaborative and iterative process. Initially, a multidisciplinary group comprising pharmacists, informaticians, and patients convened to outline the foundational principles and key elements that should be included in the guidelines. This group brought together diverse perspectives, ensuring that medical accuracy, technical feasibility, and patient needs were all adequately addressed.

Once a preliminary draft was created, the guidelines were reviewed with the Large Language Models (LLMs). This step allowed the researchers to understand how the LLMs interpreted the instructions and identify potential improvement areas. The LLMs provided suggestions and highlighted ambiguities or inconsistencies in the initial draft. Based on this feedback, the prompt response guidelines were revised multiple times. Each revision incorporated improved details, clarified instructions, and included more concrete examples to guide the LLMs’ responses. This iterative refinement process ensured that the final prompt response guidelines were robust, comprehensive, and effectively guided the LLMs in generating accurate, empathetic, and patient-centric medication information.

A prompt response guidelines document was provided to the LLM that includes the following:

  1. Guidelines for Answering Questions on Medications: This guide outlines the goal and source hierarchy for providing accurate medical information, exclusions for responses, and examples of queries and responses.

  2. Personalizing Responses: This details how to personalize responses using saved facts, age, gender, contextual relevance, location, interests, tone, and language. It also covers handling time-sensitive saved facts, when to use or avoid user interests, location privacy considerations, dynamic conflict handling, and contradictory saved facts.

  3. Privacy Guidelines: Focuses on protecting user privacy by avoiding disclosure of user data, minimizing data usage, using subtle personalization, and not stereotyping based on sensitive characteristics.

  4. Accuracy of Responses: This emphasizes safety, FDA-compliant information, user privacy, personalization, and disclaimers. It also includes validation checklists, quality tracking, guidelines for dosing recommendations, safety warnings, emergency alerts, and handling contradictory information.

  5. Format of Responses: Describes how to format responses, including context prefixes, literacy level adjustments, language response, warnings to consult FDA and healthcare providers, and final checks before providing an answer.

When a pilot testing session is stated, we provide the LLM with a fictitious user persona (Emily or Mary) that describes the user’s medical history and background information and the role that the LLM should take on, namely a pharmacist, when responding to the user. The user personas included the following information:

A role guideline provided to the LLM defines the role of a virtual pharmacist named Sally (fictitious name) and guidelines for giving medication information, including the following:

  • Sally is described as having 30 years of experience and aims to provide simplified, empathetic responses.

  • Guidelines for interaction include introducing herself only initially, keeping responses simple and empathetic, asking clarifying questions, and using a warm tone.

  • The document specifies using different persona documents for interactions with Emily and Mary.

  • The instructions state to ask for clarification questions if unsure of any information in the prompt.

  • It directs Sally to refer to specific medication documents for information on safety and interactions.

  • The relevant persona document (Emily or Mary) should be referenced for each interaction.

2.3. Evaluation Framework

The evaluation of EMPATHICA, the AI-powered chatbot for enhancing medication literacy, was structured into two distinct phases, as detailed in the paper. Phase 1: Readability Analysis focused on objectively quantifying the complexity and accessibility of the chatbot’s generated responses. This was achieved by employing a comprehensive suite of readability tests, each designed to measure different aspects of text comprehension. These tests included: Flesch-Kincaid Grade Level, Gunning Fog Index, Coleman-Liau Index, SMOG Index, Automated Readability Index, FORCAST Grade Level, Powers Sumner Kearl Grade, Rix Readability, Raygor Readability, Fry Readability, Flesch Reading Ease, CEFR Level, IELTS Level, Spache Score, New Dale-Chall Score, Lix Readability, and Lensear Write Score 21. By applying this diverse range of metrics, the researchers aimed to gain a holistic understanding of how easy or difficult it was for individuals with varying literacy levels to understand the information provided by EMPATHICA. This phase was crucial for ensuring that the chatbot’s responses were tailored to be accessible to a broad patient population, including those with low health literacy.

Phase 2: Empathy and Accuracy Assessment shifted the focus from purely quantitative measures to qualitative evaluation. Expert reviewers, including healthcare professionals, meticulously examined the chatbot’s responses in this phase. They assessed these responses for several critical factors: empathy, which involved evaluating the chatbot’s ability to provide supportive and contextually sensitive information; correctness, ensuring that the information provided was medically accurate and aligned with established guidelines; and reference to

FDA-approved labels, verifying that the chatbot’s responses were grounded in official, regulatory-approved drug information. All Large Language Models (LLMs) integrated into EMPATHICA were initially evaluated for readability. Subsequently, a second round of evaluation explicitly focused on empathy, utilizing a more detailed and refined version of the prompt response guideline. This iterative approach allowed the researchers to progressively improve the chatbot’s performance and ensure it met the desired clarity and compassionate communication standards.

All LLMs were initially evaluated to assess readability. A second round evaluated empathy in responses using a more detailed version of a prompt response guideline. The following scales were used for readability.

  1. Readability Analysis: Comparing generated responses for Flesch-Kincaid Grade Level, Gunning Fog Index, Coleman-Liau Index, SMOG Index, Automated Readability Index, FORCAST Grade Level, Powers Sumner Kearl Grade, Rix Readability, Raygor Readability, Fry Readability, Flesch Reading Ease, CEFR Level, IELTS Level, Spache Score, New Dale-Chall Score, Lix Readability, and Lensear Write Score 21,22.

  2. Empathy and Accuracy Assessment: Expert reviewers assessed chatbot responses for empathy, correctness, and reference to FDA-approved labels.

3. Results

3.1. Phase 1 - Readability Analysis

The chatbot’s responses were analyzed using a standard readability tool. Results indicated that:

  • Cloud-based models (e.g., Gemini, ChatGPT) produced more readable responses, averaging an FKGL of 8-10.

  • Open-source models (e.g., Mistral, LLaMA) generated responses at a higher reading level (FKGL 10-13), making them less accessible for patients with low literacy 22.

Mistral’s responses had the highest complexity (FKGL 11.27), while Google’s Gemini produced the most readable outputs (FKGL 10.66) 22.

The first evaluation was aimed at determining the reading level of the responses generated by the LLMs. Tables 1 and 2 summarize the findings across multiple health literacy assessments. Table 3 summarizes the reading levels of the medication levels used to train the LLMs.

Table 1 –

User Persona Profile Information

  • User Persona Profile Information

  • Health Profile (Diagnosis, timeline, treatments, medications, allergies)

  • Medical Records (Lab results, medical records storage, appointment schedule)

  • Family Information (Marital status, parents, siblings)

  • Education History (School years, college years, early career)

  • Employment Details (Employment history, current income)

  • Personality Traits and Characteristics

  • Interests and Hobbies (Current and past)

  • Coping Mechanisms

  • Cognitive, Emotional, and Psychological Factors

  • Daily Life and Routines

  • Social Circle

  • Technology Use

  • Information Sources

  • Knowledge Gaps and Misconceptions

  • Financial Management

  • Values and Beliefs

Table 2.

Reading level/scores of LLM responses to questions posed for Mary, as calculated by readable.com

LLM Flesch-Kincaid Grade Level Gunning Fog Index Coleman-Liau Index SMOG Index Autom ated Reada bility Index FORC AST Grade Level Powers Sumner Kearl Grade Rix Reada bility Raygor Reada bility Fry Readability Flesch Reading Ease CEFR Level IELTS Level Spache Score New Dale-C hall Score Lix Reada bility Lensear Write
Mistral 12.17 13.84 13.37 14.13 12.54 11.59 6.27 11.00 12.00 14.00 41.03 C2 8+ 3.82 6.73 49.20 69.08
LLama 10.52 12.77 13.98 13.05 10.77 12.08 6.14 9.00 13.00 0.00 41.59 C2 8+ 3.34 7.01 47.37 72.98
DeepSeek 9.94 11.79 13.02 12.43 9.70 11.70 5.85 9.00 0.00 0.00 45.35 C2 8+ 2.82 6.61 44.30 76.23
Gemini 10.51 12.38 13.23 12.94 10.44 11.68 5.97 9.00 12.00 0.00 44.14 C2 8+ 3.21 6.20 44.03 73.58
OpenAI 10.89 12.42 13.62 13.37 11.00 11.48 5.97 9.00 12.00 0.00 42.67 C2 8+ 3.23 6.51 45.46 75.20
Mistral (Cloud) 12.17 13.84 13.37 14.13 12.54 11.59 6.27 11.00 12.00 14.00 41.03 C2 8+ 3.82 6.73 49.20 69.08
Gemma 11.44 13.23 14.48 13.36 10.94 12.30 6.20 9.00 0.00 0.00 34.93 C2 8+ 3.21 7.29 47.40 72.04
Avg 11.09 12.89 13.58 13.34 11.13 11.77 6.09 9.57 8.71 4.0 41.53 C2 8+ 3.35 6.72 46.70 72.59

Table 3.

Reading level of medication labels as calculated by Readable.com21

Medica tion Label Flesch- Kincai d Grade Level Gunning Fog Index Colem an-Liau Index SMOG Index Autom ated Reada bility Index FORC AST Grade Level Powers Sumner Kearl Grade Rix Readability Raygor Readability Fry Readability Flesch Reading Ease CEFR Level IELTS Level Spache Score New Dale-C hall Score Lix Readability Lensear Write
Acetaminophen 8.32 7.09 10.80 9.85 6.79 12.08 4.76 6.00 0.00 0.00 46.41 C2 8+ 3.73 8.00 38.31 98.46
Cyclophosphamide 10.46 7.72 12.69 9.88 8.80 12.84 4.92 6.00 0.00 0.00 28.37 C2 8+ 2.57 9.28 43.53 108.67
Dexamethasone 13.31 14.36 17.25 12.04 11.93 13.40 6.54 8.00 0.00 0.00 11.63 C2 8+ 3.42 9.16 52.55 83.59
Doxorubicin Hydrochloride 11.72 11.11 14.83 10.96 10.03 12.85 5.74 7.00 0.00 0.00 21.92 C2 8+ 2.64 8.66 45.90 92.05
Filgrastim 9.07 10.28 11.32 10.96 7.23 11.69 5.52 7.00 0.00 0.00 43.70 C2 8+ 2.56 7.71 41.08 87.14
Melatonin 5.72 3.36 3.63 8.18 1.99 11.38 3.85 4.00 0.00 0.00 61.30 C2 8+ 4.74 9.34 27.83 130.37
Ondansetron Hydrochloride 11.13 11.23 14.08 10.93 9.45 13.04 5.78 7.00 0.00 0.00 25.80 C2 8+ 2.44 8.80 46.73 91.59
Paclitaxel 9.16 10.21 11.29 10.40 7.22 12.07 5.52 6.00 0.00 0.00 40.05 C2 8+ 2.75 7.86 41.25 96.23
Average 9.86 9.42 11.99 10.40 7.93 12.42 5.33 6.38 0.00 0.00 34.90 C2 8+ 3.11 8.60 42.15 98.51

Phase 2 Evaluation for Readability and Empathy: With a more detailed prompt response guideline, we evaluated if the responses were at a lower reading level than in the first round. Tables 4 and 5 show the results of the updated responses generated by the LLMs using several health literacy assessments as calculated by Readable.com.

Table 4.

Reading levels/scores of LLM responses for Emily as calculated by Readable.com21

LLM Flesch-Kincaid Grade Level Gunning Fog Index Coleman-Liau Index SMOG Index Automated Readability Index FORC AST Grade Level Powers Sumner Kearl Grade Rix Readability Raygor Readability Fry Readability Flesch Reading Ease CEFR Level IELTS Level Spache Score New Dale-C hall Score Lix Reada bility Lensear Write
Mistral 12.03 12.54 17.59 13.97 13.90 12.60 6.08 11.00 0.00 0.00 31.95 C2 8+ 3.64 7.99 52.62 68.47
LLama 11.20 13.21 14.29 13.39 11.26 12.22 6.17 10.00 0.00 0.00 39.20 C2 8+ 3.46 6.97 47.98 71.00
DeepSeek 11.34 12.35 15.38 12.77 11.17 12.56 6.00 9.00 0.00 0.00 32.99 C2 8+ 3.13 7.53 48.59 74.18
Gemma 10.55 12.73 12.49 12.92 10.72 11.29 6.03 9.00 11.00 12.00 47.90 C2 8+ 4.02 5.69 42.49 73.29
Avg 11.28 12.71 14.94 13.26 11.76 12.17 6.07 9.75 2.75 3.00 38.01 C2 8+ 3.56 7.05 47.92 71.74

Table 5.

Reading levels/scores of LLM responses for Mary as calculated by Readable.com21

LLM Flesch-Kincaid Grade Level Gunning Fog Index Coleman-Liau Index SMOG Index Automated Readability Index FORC AST Grade Level Powers Sumner Kearl Grade Rix Readability Raygor Readability Fry Readability Flesch Reading Ease CEFR Level IELTS Level Spache Score New Dale-C hall Score Lix Reada bility Lensear Write
Mistral 12.35 14.60 18.81 14.00 15.82 12.57 6.50 11.00 0.00 0000 35.04 C2 8+ 3.78 7.45 51.95 65.55
LLama 11.85 13.83 15.11 13.59 11.75 12.34 6.33 10.00 0.00 0.00 33.75 C2 8+ 3.34 7.26 49.23 70.70
DeepSeek 11.34 12.35 15.38 12.77 11.17 12.56 6.00 9.00 0.00 0.00 32.99 C2 8+ 3.13 7.53 48.59 74.18
Gemma 10.11 12.40 12.83 12.65 10.36 11.38 5.96 9.00 12.00 12.00 48.30 C2 8+ 3.80 5.84 42.96 75.64
Avg 11.41 13.30 15.53 13.25 12.28 12.21 6.20 9.75 3.00 3.00 37.52 C2 8+ 3.51 28.08 48.18 71.52

3.2. Empathy and Accuracy Evaluation

Expert evaluators reviewed chatbot interactions for correctness and empathy:

1. Empathy

Measure Used: Empathy was evaluated using the Empathy Scale for Human–Computer Communication (ESHCC), which scores chatbot responses across four dimensions: Responsiveness, Emotional Understanding, Supportiveness, and Engagement. Each transcript was rated on a 7-point Likert scale (1 = not at all, 7 = extensively) based on whether the chatbot acknowledged user emotions, responded to prior input, offered reassurance, or maintained an interactive, patient-centered tone.

Interpretation of Results:

Scores of 24–28 indicated a strong empathetic presence across dimensions. Scores between 18 and 23 reflected moderate empathy with some gaps in personalization or emotional support. Scores below 18 suggested minimal empathetic behavior, with generic or functional responses dominating the interaction.

Findings: Most models demonstrated consistent emotional understanding (e.g., recognizing distress or discomfort) but varied widely in responsiveness and supportiveness. Engagement was often limited to functional follow-up questions, with few examples of true conversational empathy. Overall, while chatbots showed some affective awareness, their ability to offer emotionally attuned, patient-specific support remains limited.

2. Accuracy

Measure Used: Expert review against FDA-approved drug labels23, UpToDate24, and NCCN guidelines25: Accuracy was evaluated by comparing chatbot-provided clinical content to authoritative sources. Reviewers examined responses for medical correctness, inclusion of key safety information, and factual alignment with real-world treatment guidance.

Interpretation of Results:

  • Scores of 6–7 indicated complete, accurate, and clinically consistent information.

  • Scores of 4–5 reflected mostly accurate answers with some simplification or omissions.

  • Scores below 4 suggested incorrect, misleading, or unsafe content.

Findings: Cloud-based models more reliably referenced FDA-approved medication details and guidelines, while open-source models occasionally omitted or misstated critical drug information.

3. Emergency Recognition and Flagging

Measure Used: This domain assessed the chatbot’s ability to recognize emergency scenarios and prompt users to seek urgent care. A custom rubric was adapted for this benchmark. Red-flag symptoms included terms such as “chest pain,” “difficulty breathing,” “high fever,” and “severe vomiting.” Chatbots were rated based on whether they recognized these symptoms and provided clear instructions to call emergency services or seek medical help. Generic or vague responses were penalized.

Interpretation of Results:

  • Scores of 6–7 indicated clear, situation-appropriate escalation.

  • Scores of 4–5 included general advice without urgency.

  • Scores below 4 reflected a failure to recognize critical symptoms.

Findings: Most models, including high-performing ones, struggled with emergency recognition, indicating a need for stronger triage logic.

5. Usability

Measure Used: User-system interaction quality was evaluated using a subset of the BUS-II framework26, focusing on four key dimensions: Clarity, Appropriateness, Helpfulness, and Politeness. Each transcript was scored on a 1–5 scale per category, assessing the chatbot’s communication quality, tone, and ability to provide meaningful responses.

Interpretation of Results:

Scores of 5 in each category reflected excellent performance—clear, respectful, and practically useful responses. Scores of 3–4 indicated moderate performance, with some issues in specificity, actionability, or tone.

Scores below 3 denoted vague, unhelpful, or inappropriate communication. Total scores (maximum 20) were interpreted as follows:

  • 18–20: High-quality user experience across dimensions.

  • 14–17: Adequate performance with minor shortcomings.

  • Below 14: Limited effectiveness, often due to unclear or generic responses.

Findings: Most chatbots demonstrated consistent politeness and basic helpfulness. Gemma2_9b (Mary) received the highest score (18/20), indicating well-rounded and user-friendly communication. In contrast, deepseek-r1_8b scored lowest (14/20), reflecting less tailored advice and lower clarity, suggesting a need for improved linguistic precision and contextual adaptation

4. Discussion

Results highlight the trade-offs between cloud-based and open-source LLMs. While cloud-based models excel in readability and empathy, their proprietary nature raises concerns about transparency and adaptability. Open-source models provide greater control but require fine-tuning for readability and accuracy. Open-source, locally installed models also have the benefit of security and data privacy, as you are limiting the flow of personal information from your server to a third party. Future iterations of EMPATHICA will incorporate reinforcement learning and human-in-the-loop validation to balance these factors. Emily and Mary were both assessed using readable.com and datayze.com. Methodological differences in how the two platforms calculate the scores could slightly alter the results. The calculation formulas for the assessment are widely available and generally consistent across platforms.

Cloud-based models like Gemini and ChatGPT produced more readable responses than open-source models like Mistral and LLaMA. The readability scores, measured using the Flesch-Kincaid Grade Level (FKGL), showed that cloud-based models generated text at an FKGL of approximately 8-10, making them more accessible to a broader audience. In contrast, open-source models tended to produce more complex text, with FKGL scores ranging from 10-13. Additionally, medication labels had an average FKGL of 12.13, indicating that they are generally difficult for many patients to understand without simplification.

EMPATHICA’s architecture leverages Ollama29 as a central component for managing and connecting various Large Language Models (LLMs), both cloud-based and open-source. Ollama acts as an intermediary, enabling the system to utilize models like OpenAI’s ChatGPT13, Google’s Gemini14, Mistral15, Meta’s LLaMA, and Google’s Gemma. A Retrieval-Augmented Generation (RAG) approach is implemented to ensure accuracy and relevance. When a user asks a question, Ollama’s RAG system first retrieves relevant information from a knowledge base of collected information, which can be either uploaded documents or contextual conversations, which is a vector database containing FDA-approved medication labels and a prompt guide with guardrail guidelines. This retrieved information is then fed to the chosen LLM, allowing it to generate a response grounded in verified data, tailored to the user’s query and literacy level while also preventing hallucinations. This architecture optimizes responses by providing a structured and reliable information retrieval process before the LLM generates the final answer. Notably, using locally hosted, open-source models via Ollama provides enhanced data privacy and security, as sensitive patient information remains within the local environment and is not transmitted to external servers. This is particularly important in healthcare settings where patient data must be handled with the utmost confidentiality.

Empathy and personalization were more effectively integrated into responses generated by cloud-based models, demonstrating a greater ability to provide patient-centered communication. These models delivered responses that were not only informative but also conveyed reassurance and support tailored to the user’s concerns. On the other hand, open-source models required additional fine-tuning to achieve the same level of empathy. Without such refinements, their responses often felt more clinical or impersonal, potentially affecting patient trust and engagement with the information provided.

The ability to accurately reference FDA-approved medication labels varied between cloud-based and open-source models. Cloud-based models more consistently included references to official medication information, ensuring responses aligned with regulatory standards. In contrast, open-source models occasionally omitted critical safety warnings, which could pose a risk if patients rely on incomplete or inaccurate medication details. Ensuring accuracy in drug information is essential, as misinformation can lead to medication errors and adverse health outcomes.

All models identified emergency scenarios well and directed users to seek medical assistance when necessary. When users mentioned symptoms indicating a serious health issue, such as chest pain, the models correctly advised them to contact emergency services. However, some responses lacked explicit recommendations for provider referrals in

non-emergency situations. This gap highlights the need for continued refinement to ensure that users receive appropriate guidance for urgent conditions and general medication inquiries that warrant professional medical input.

Implementing updated prompt guidelines in the second evaluation round led to notable improvements in readability and accuracy. By refining the instructions provided to the models, responses became more accessible while still maintaining high standards for safety and empathy. The adjustments resulted in a slight reduction in FKGL scores, making the information easier for patients with varying levels of health literacy to understand. This improvement underscores the importance of ongoing evaluation to enhance the effectiveness of AI-driven medication guidance.

Future work will include a forthcoming clinical study evaluating EMPATHICA with cancer patients at BIDMC30. Future work will assess patient trust in AI-generated medication information, the impact on medication adherence and comprehension, and chatbot safety in delivering medical guidance. The study will include task-based usability testing and open-ended chatbot interactions, allowing for quantitative and qualitative data collection 30. Metrics will consist of time to complete medication queries, task success rate, and error rate in AI-generated responses

5. Conclusion

EMPATHICA represents a significant step toward addressing health literacy challenges in medication management. Preliminary results indicate that this chatbot could improve medication comprehension and adherence by integrating AI, patient-centered design, and rigorous evaluation. Ongoing clinical validation will determine its efficacy and inform future enhancements for scalable deployment.

Figures & Tables

Image 1.

Image 1.

Chatbot Architecture

Table 1.

Reading level/scores of LLM responses to questions posed for Emily, as calculated by readable.com

LLM Flesch-Kincaid Grade Level Gunning Fog Index Coleman-Liau Index SMOG Index Autom ated Reada bility Index FORC AST Grade Level Powers Sumner Kearl Grade Rix Reada bility Raygor Reada bility Fry Readability Flesch Reading Ease CEFR Level IELTS Level Spache Score New Dale-C hall Score Lix Reada bility Lensear Write
Mistral 12.15 14.48 14.24 14.33 12.82 11.82 6.45 11.00 13.00 0.00 39.37 C2 8+ 3.67 6.98 50.48 68.37
LLama 10.47 11.68 13.96 12.29 10.08 12.14 5.82 8.00 0.00 0.00 39.69 C2 8+ 3.10 7.11 43.94 76.21
DeepSeek 11.45 13.11 14.57 13.39 11.05 12.26 6.17 9.00 0.00 0.00 35.06 C2 8+ 3.16 7.26 47.62 72.13
Gemma 8.62 10.30 11.12 11.24 8.27 11.00 5.47 8.00 9.00 9.00 55.47 C2 8+ 3.34 5.15 36.77 80.65
Gemini 11.04 12.88 13.60 13.33 11.19 11.70 6.08 10.00 12.00 0.00 42.54 C2 8+ 3.45 6.50 45.48 71.84
OpenAI/ChatG PT 10.48 11.83 13.61 12.72 10.79 11.64 5.83 9.00 12.00 0.00 44.67 C2 8+ 3.29 6.59 44.79 75.18
Mistral (Cloud) 12.15 14.48 14.24 14.33 12.82 11.82 6.45 11.00 13.00 0.00 39.37 C2 8+ 3.67 6.98 50.48 68.37
Avg 10.91 12.68 13.62 13.09 10.86 11.77 6.04 9.43 8.43 1.29 42.31 C2 8+ 3.38 6.65 45.65 73.25

Table 6.

Empathy subscores evaluated by model using ESHCC27

LLM Version Persona Responsiveness Emotional Understanding Supportiveness Engagement Total
Gemma Gemma 2 9b Emily 1 7 1 7 16
LLaMA LLaMA 3 8b Emily 1 7 1 1 10
Mistral Mistral 7b Emily 7 7 1 1 16
Deepseek DeepSeek R1 8b Emily 1 7 1 1 10
Gemma Gemma 2 9b Mary 1 7 1 7 16
LLaMA LLaMA 3 8b Mary 1 7 1 1 10
Mistral Mistral 7b Mary 1 7 1 7 16
Deepseek DeepSeek R1 8b Mary 1 7 1 1 10

Table 7.

Accuracy levels evaluated by model

LLM Version Persona Accuracy (Guidelines)28 Emergency Flagging Usability (BUS-11 subset)26
Gemma Gemma 2 9b Emily 6.5 3 14
LLaMA LLaMA 3 8b Emily 6 4 16
Mistral Mistral 7b Emily 5.5 2.5 16
DeepSeek DeepSeek R1 8b Emily 6 3.5 14
Gemma Gemma 2 9b Mary 6.5 3 18
LLaMA LLaMA 3 8b Mary 6 3.5 16
Mistral Mistral 7b Mary 5.5 2.5 16
Deepseek DeepSeek R1 8b Mary 6 3 14

References

  • 1.Smith D. L. Compliance packaging: A patient education tool. Am. Pharm. 1989;29:42–53. [Google Scholar]
  • 2. https://health.gov/our-work/health-literacy .
  • 3.Cutilli C. C., Bennett I. M. Understanding the health literacy of America: results of the National Assessment of Adult Literacy. Orthop. Nurs. 2009;28:27–32. 33–4. doi: 10.1097/01.NOR.0000345852.22122.d6. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Viswanathan M., et al. Interventions to improve adherence to self-administered medications for chronic diseases in the United States: a systematic review. Ann. Intern. Med. 2012;157:785–795. doi: 10.7326/0003-4819-157-11-201212040-00538. [DOI] [PubMed] [Google Scholar]
  • 5.Correia R. B., de Araújo L. P., Mattos M. M., Rocha L. M. City-wide analysis of electronic health records reveals gender and age biases in the administration of known drug-drug interactions. arXiv [cs.SI] 2018.
  • 6.Patel A., et al. Patient counseling materials: The effect of patient health literacy on the comprehension of printed prescription drug information. Res. Social Adm. Pharm. 2018. doi:10.1016/j.sapharm.2018.04.035.
  • 7.Bailey S. C., et al. Predicting medication understanding and adherence using simple literacy screening. J Gen Intern Med. 2009;24:1173–1178. [Google Scholar]
  • 8. https://www.dcinetwork.org/about-us .
  • 9.Patient Labeling Resources. U.S. Food and Drug Administration. 2024. https://www.fda.gov/drugs/fdas-labeling-resources-human-prescription-drugs/patient-labeling-resources .
  • 10.Mcinnes D. K., et al. Patient-centered medication information: Applying health literacy principles to drug labels. J Health Commun. 2015;20:79–87. [Google Scholar]
  • 11. https://www.dcinetwork.org/patients2024/retreat .
  • 12.Quintana Y., Homiar A., Srivastava G. Evaluation of a patient-centric medication information app with a chatbot: Research protocol. Beth Israel Deaconess Medical Center and Dana-Farber Cancer Center. 2025.
  • 13.OpenAI. https://openai.com .
  • 14.Google DeepMind. Google DeepMind. https://deepmind.google .
  • 15.Mistral AI. https://mistral.ai .
  • 16. https://ai.meta.com .
  • 17.Making AI helpful for everyone. https://ai.google .
  • 18.DeepSeek. https://www.deepseek.com/
  • 19.Martineau K. What is retrieval-augmented generation (RAG)? IBM Research. 2023. https://research.ibm.com/blog/retrieval-augmented-generation-RAG .
  • 20.DailyMed. https://dailymed.nlm.nih.gov .
  • 21.About Readability. Readable. 2021. https://readable.com/readability/
  • 22.DCI Network Evaluation Team. Readability and Empathy Evaluation of AI-Driven Chatbot Responses.
  • 23.Drugs@FDA: FDA-Approved Drugs
  • 24.UpToDate®. Nurse Pract. 2025;50:22. [Google Scholar]
  • 25.Chen S., et al. Use of artificial intelligence chatbots for cancer treatment information. JAMA Oncol. 2023;9:1459–1462. doi: 10.1001/jamaoncol.2023.2954. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Re-examining the Chatbot Usability Scale (BUS-11) to assess user experience. ResearchGate.
  • 27.Concannon S., Tomalin M. Measuring perceived empathy in dialogue systems. AI Soc. 2023. doi:10.1007/s00146-023-01715-z.
  • 28.Tang C., Hu J., Yu M. Evaluating the clinical accuracy and safety of large language models in medicine. JAMA Netw Open. 2023;6 [Google Scholar]
  • 29.Ollama. https://ollama.com/
  • 30.Quintana Y. IRB-approved evaluation protocol for chatbot safety and comprehension in medication information. Beth Israel Deaconess Medical Center. 2025.

Articles from AMIA Annual Symposium Proceedings are provided here courtesy of American Medical Informatics Association

RESOURCES