Table 1.
Characteristics of the included studies (n=20).
| Author, year, and country | Study design | Chronic diseases | LLMa types | Outcome assessment | Key study findings |
| AI-Anezi (2024) [47], Saudi Arabia | Quasi-experimental study | Cancer, diabetes, and kidney failure | ChatGPT 3.5, engaged by participants for ≥15 min daily for 2 weeks | Semistructured interviews | ChatGPT 3.5 improved disease awareness, health behaviors, and accessible support while reducing specialist reliance, yet faces issues with disease diagnosis, empathy, data privacy, and managing complex conditions. |
| Alanezi et al (2024) [48], Saudi Arabia | Quasi-experimental study | Chronic mental health conditions | ChatGPT 3.5, engaged by participants for ≥15 min daily for 2 weeks | Semistructured interviews | ChatGPT 3.5 enhanced mental health literacy and self-care and delivered crisis interventions. It faces challenges in data privacy, accuracy, and catering to cultural and linguistic diversities. |
| Alanezi (2024) [21], Saudi Arabia | Quasi-experimental study | Cancer | ChatGPT 3.5, engaged by participants for 2 weeks | Focus group interviews | ChatGPT 3.5 improved cancer knowledge, self-management, emotional aid, and social resource access; however, it faces privacy, reliability, and personalization challenges. |
| Aliyeva et al (2024) [35], United States | Simulation study | Severe hearing loss | ChatGPT 4.0, posed with five postoperative management questions | Survey | ChatGPT 4.0 had 100% accuracy, rapid response times, 98% clarity, and 92% relevance in its recommendations. |
| Choo et al (2024) [46], South Korea | A simulation study | Colorectal cancer | ChatGPT, used to generate treatment recommendations | Survey | ChatGPT showed 86.7% oncological management alignment with the multidisciplinary team. |
| Dergaa et al (2024) [49], Qatar | A simulation study | Mental health | ChatGPT, engaged as a digital psychiatric provider | Qualitative assessment | ChatGPT offered quick, empathetic, and guideline-concordant responses, whereas it struggled with clarification and customizing plans for complex scenarios. |
| Dergaa et al (2024) [50], Qatar | A simulation study | Hypertension, osteoarthritis, stress, diabetes, and asthma | ChatGPT 4.0, interacted with five hypothetical patient profiles to prescribe a 30-day fitness program | Qualitative assessment | While ChatGPT 4.0 can generate safety-conscious exercise programs, it lacks variability and cannot perform initial assessments or adjust regimens in real time. |
| Franco D’Souza et al (2023) [51], India | A simulation study | Psychiatric disorders | ChatGPT 3.5, interacted with 100 clinical case vignettes | Survey | ChatGPT 3.5 performed well in generating management strategies followed by diagnosis for psychiatric conditions. |
| Kianian et al (2024) [36], United States | A simulation study | Glaucoma | ChatGPT used to generate patient handouts | Survey | ChatGPT generated readable health information at a ninth-grade reading level and scored the quality of health resources with high precision (r=0.725; P<.001). |
| Lim et al (2024) [45], Singapore | A simulation study | Colorectal cancer | Retrieval-augmented generation-enhanced ChatGPT 4.0, instructed to provide colonoscopy screening recommendation | Survey | The enhanced model had higher accuracy in recommending colorectal screening intervals (79% vs 50.5%; P<.01) and experienced few hallucinations compared with the standard model. |
| Mondal et al (2023) [52], India | A simulation study | Lifestyle-related chronic diseases | ChatGPT 3.5, presented 20 cases of chronic disease management | Survey | ChatGPT 3.5 generated readable text with a mean FKREb score of 27.8 and had significantly higher accuracy (1.83, SD 0.37) and applicability (1.9, SD 0.21) than the hypothesized median score of 1.5. |
| Papastratis et al (2024) [53], Greece | A simulation study | Noncommunicable diseases | ChatGPT 3.5 and ChatGPT 4, interacted with 15 profiles to generate weekly meal plans. | Survey | ChatGPT 3.5 and 4.0 showed lower nutrient accuracy (81.5% and 81.6%) than a knowledge-based recommender (91%) but improved to 86% in ChatGPT 4.0 by inputting personalized energy target. |
| Pradhan et al (2024) [37], United States | A simulation study | Liver cirrhosis | ChatGPT 4.0, DocsGPT, Google Bard, and Bing Chat for generating a one-page patient education sheet | Survey | LLM-generated materials exhibited higher FKRE scores, 76%-99% accuracy rates, and comparable actionability to human-derived materials. |
| Puerto Nino et al (2024) [43], Canada | A simulation study | Benign prostate enlargement | ChatGPT 4.0+, fed with 88 queries for benign prostate enlargement | Survey | ChatGPT 4.0+ had a precision score range of 0.50-1 and a median general quality score of 4. |
| Seth et al (2023) [41], Australia | A simulation study | Carpal tunnel syndrome | ChatGPT (no version number), used to generate management strategies with six inquiries | Survey | ChatGPT accurately diagnosed carpal tunnel syndrome and recommended treatment options but faced challenges with erroneous references and insufficient information depth. |
| Singer et al (2024) [38], United States | A simulation study | Ophthalmology issues | Aeyeconsult powered by ChatGPT 4.0, interacted with 260 eyecare questions | Survey | Aeyeconsult outperformed ChatGPT 4.0 in accuracy (83.4% vs 69.2%) and demonstrated greater consistency in responses across repeated attempts on OphthoQuestions. |
| Spallek et al (2023) [42], Australia | A simulation study | Mental health and substance use disorders | ChatGPT 4.0 pro, interacted with queries for mental health and substance use | Survey | ChatGPT 4.0 had higher reading levels and accuracy but lacked human expert depth and breadth, with 23% featuring stigmatizing phrases. |
| Willms and Liu (2024) [44], Canada | An autoethnographic case study | Chronic disease prevention by increasing physical activity | ChatGPT 3.0, used to generate adaptive physical activity interventions | Qualitative assessment | ChatGPT 3.0 had acceptable accuracy and relevance in responding to prompts but sometimes provided false academic references. |
| Yang et al (2024) [39], United States | A case study | Diet management for preventing chronic illnesses | ChatDiet based on ChatGPT 3.5 Turbo to provide food recommendations | Causal graphs and qualitative assessment | ChatDiets effectively personalized food recommendations (85%-95% effectiveness) and demonstrated interactivity, but occasional hallucinations |
| Yeo et al (2023) [40], United States | A simulation study | Liver cirrhosis and hepatocellular carcinoma | ChatGPT Dec 15 version, entered with 164 questions for liver disease management | Survey with qualitative assessment | ChatGPT had 79.1% and 74% accuracy rates and provided emotional support; however, it might be unable to identify eligibility for hepatocellular carcinoma screening and liver transplantation. |
aLLM: large language model.
bFKRE: Flesch-Kincaid reading ease score.