Abstract
This editorial explores the evolving and transformative role of large language models (LLMs) in enhancing the capabilities of virtual assistants (VAs) in the health care domain, highlighting recent research on the performance of VAs and LLMs in health care information sharing. Focusing on recent research, this editorial unveils the marked improvement in the accuracy and clinical relevance of responses from LLMs, such as GPT-4, compared to current VAs, especially in addressing complex health care inquiries, like those related to postpartum depression. The improved accuracy and clinical relevance with LLMs mark a paradigm shift in digital health tools and VAs. Furthermore, such LLM applications have the potential to dynamically adapt and be integrated into existing VA platforms, offering cost-effective, scalable, and inclusive solutions. These suggest a significant increase in the applicable range of VA applications, as well as the increased value, risk, and impact in health care, moving toward more personalized digital health ecosystems. However, alongside these advancements, it is necessary to develop and adhere to ethical guidelines, regulatory frameworks, governance principles, and privacy and safety measures. We need a robust interdisciplinary collaboration to navigate the complexities of safely and effectively integrating LLMs into health care applications, ensuring that these emerging technologies align with the diverse needs and ethical considerations of the health care domain.
Keywords: large language models, voice assistants, virtual assistants, chatbots, conversational agents, health care
Virtual assistants (VAs)—mostly voice assistants, chatbots, and dialogue-based interactive applications—have been leading conversational technologies, being used for health care communications and remote monitoring [1-4]. However, their accuracy and reliability in understanding and responding to medical questions have been limitations [5], which have been slowly improving over the years [6]. Large language models (LLMs) offer scalable and customizable solutions for these limitations of VAs. The body of literature demonstrating the capabilities of LLMs in medicine and health care has been growing, and a number of studies benchmarking LLMs’ performance against each other or humans’ medical knowledge, decision-making processes, and empathic responses have been published [7-11]. Furthermore, LLM-based services improve equitable access to information and reduce language barriers via contextual and culturally aware systems [12] and privacy-preserving, local solutions for low-resource settings [13-16]. This research and evidence give a glimpse at the future of personalized VAs.
In the context of mental health and health information–seeking behavior, we investigated the performance of VAs in responding to postpartum depression–related frequently asked questions. The evidence from our two studies (conducted in 2021 with VAs [17] and 2023 with LLMs [18]) provides comparable findings on the 2-year difference in technology; illuminates the evolving roles of artificial intelligence, natural language processing, and LLMs in health care; and shows the promise of a more accurate and reliable digital health landscape.
In our first study in 2021 [17], we investigated the clinical accuracy of Google Assistant, Amazon Alexa, Microsoft Cortana, and Apple Siri voice assistant responses to postpartum depression–related questions. In our second study in 2023 [18], we replicated our research by using LLMs, and the new study showed significant improvements in the accuracy and clinical relevance of responses. Specifically, GPT-4’s responses to all postpartum depression–related questions were more accurate and clinically relevant in contrast to the VAs, of which the proportion of clinically relevant responses did not exceed the 29% reported in our earlier study. In addition, the interrater reliability score for LLMs (GPT-4: κ=1; P<.05) was higher than that for VAs (κ=0.87; P<.001), underscoring LLMs’ consistency in clinically relevant responses, which is vital to achieve for health care applications. LLMs also recommended consultations with health care providers—a feature that adds an extra layer for safety—which was not observed in our earlier study with VAs. This dramatic improvement suggests a paradigm shift in the capabilities of digital health tools for health information–seeking activities. The high clinical accuracy and reliability of LLMs point toward a promising future for their integration into existing VA platforms. LLMs can offer dynamic adaptability for VAs via custom applications and decentralized LLM architectures [19,20]. Given their capabilities for open-source developments and collaborations, LLMs could serve as cost-effective and inclusive frameworks for collaborative developments (ie, among technology providers, patients, and clinical experts) in fine-tuning and training VAs for specific medical purposes.
The empirical data from our studies, as well as the literature [21], indicate a compelling trajectory toward LLMs being used to potentially improve the clinical and instructional capabilities of conversational technologies. This suggests a shift in our earlier spectrum model for VAs in health care (Figure 1) [22], in which we proposed 4 service levels for a spectrum of VA use that were associated with the risk, value, and impact of VAs. These levels were the “information” (eg, asking Amazon Alexa to start self-care guidance), “assistance” (eg, setting up reminders for medication or self-therapy), “assessment” (eg, identification, detection, prediction with digital biomarkers, and management), and “support” (prescribing, substituting, or supplementing medication and therapy tools) levels. In 2020, the evidence on the utilization of VAs in health care indicated that VAs were at the “information” and “assistance” levels [22]. However, LLMs are opening up opportunities for VAs, potentially toward the “assessment” and “support” levels. As Figure 1 shows, the level of a service and the associated risk, value, and impact of the service can change based on the targeted problems and solutions. A digital health ecosystem represents the ecosystem where we may envision a future of support from VAs enhanced by LLMs with speech interaction and audio-based sensing capabilities [23]. Such enhancements may include quantifying human behavior–related factors beyond VA engagement, such as social engagement, emotions, neurodevelopmental and behavioral health, sleep health (snoring, heart rate, and movements), respiratory symptoms (sneezing and coughing), and motion (gait, exercise, and sedentary behavior).
Figure 1.

Spectrum of virtual assistants (outlines the risk, value, and impact in health care services) and applications in digital health ecosystems. These can change based on the targeted problems and solutions. This figure was created with BioRender.com (BioRender).
Despite this promising horizon, we need to approach cautiously. As LLMs are becoming highly appealing tools that can be used as VAs in health care, it is imperative to establish a platform that facilitates democratized access and interdisciplinary collaboration during the development of such applications [19,24]. This platform should be designed to bring together a diverse range of stakeholders, including technologists, ethicists, researchers, health care professionals, and patients. This would ensure that the development and integration of VAs are guided by a balanced perspective that considers ethical guidelines and regulatory oversight [25,26], governance principles [27,28], privacy and safety measures [29], feasibility, efficacy, and patient-centric approaches and assessment methods [30,31]. By prioritizing such collaborative and inclusive dialogues, we can better navigate the complex challenges and harness the full potential of these advanced technologies in health care, ensuring that they are developed responsibly, ethically, and in alignment with the diverse needs of all users.
Acknowledgments
The author thanks Yasemin Sezgin for her constructive feedback. Figure 1 was created with BioRender.com (BioRender).
Abbreviations
- LLM
large language model
- VA
virtual assistant
Footnotes
Conflicts of Interest: ES serves on the editorial board of JMIR Publications.
References
- 1.Sezgin E, Huang Y, Ramtekkar U, Lin S. Readiness for voice assistants to support healthcare delivery during a health crisis and pandemic. NPJ Digit Med. 2020 Sep 16;3:122. doi: 10.1038/s41746-020-00332-0. doi: 10.1038/s41746-020-00332-0.10.1038/s41746-020-00332-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Corbett CF, Combs EM, Chandarana PS, Stringfellow I, Worthy K, Nguyen T, Wright PJ, O'Kane JM. Medication adherence reminder system for virtual home assistants: mixed methods evaluation study. JMIR Form Res. 2021 Jul 13;5(7):e27327. doi: 10.2196/27327. https://formative.jmir.org/2021/7/e27327/ v5i7e27327 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Xu L, Sanders L, Li K, Chow JCL. Chatbot for health care and oncology applications using artificial intelligence and machine learning: systematic review. JMIR Cancer. 2021 Nov 29;7(4):e27850. doi: 10.2196/27850. https://cancer.jmir.org/2021/4/e27850/ v7i4e27850 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Sawad AB, Narayan B, Alnefaie A, Maqbool A, Mckie I, Smith J, Yuksel B, Puthal D, Prasad M, Kocaballi AB. A systematic review on healthcare artificial intelligent conversational agents for chronic conditions. Sensors (Basel) 2022 Mar 29;22(7):2625. doi: 10.3390/s22072625. https://www.mdpi.com/resolver?pii=s22072625 .s22072625 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Palanica A, Thommandram A, Lee A, Li M, Fossat Y. Do you understand the words that are comin outta my mouth? Voice assistant comprehension of medication names. NPJ Digit Med. 2019 Jun 20;2:55. doi: 10.1038/s41746-019-0133-x. doi: 10.1038/s41746-019-0133-x.133 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Palanica A, Fossat Y. Medication name comprehension of intelligent virtual assistants: a comparison of Amazon Alexa, Google Assistant, and Apple Siri between 2019 and 2021. Front Digit Health. 2021 May 19;3:669971. doi: 10.3389/fdgth.2021.669971. https://europepmc.org/abstract/MED/34713143 . [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Singhal K, Azizi S, Tu T, Mahdavi SS, Wei J, Chung HW, Scales N, Tanwani A, Cole-Lewis H, Pfohl S, Payne P, Seneviratne M, Gamble P, Kelly C, Babiker A, Schärli N, Chowdhery A, Mansfield P, Demner-Fushman D, Arcas BAY, Webster D, Corrado GS, Matias Y, Chou K, Gottweis J, Tomasev N, Liu Y, Rajkomar A, Barral J, Semturs C, Karthikesalingam A, Natarajan V. Large language models encode clinical knowledge. Nature. 2023 Aug;620(7972):172–180. doi: 10.1038/s41586-023-06291-2. https://europepmc.org/abstract/MED/37438534 .10.1038/s41586-023-06291-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Thirunavukarasu AJ, Hassan R, Mahmood S, Sanghera R, Barzangi K, El Mukashfi M, Shah S. Trialling a large language model (ChatGPT) in general practice with the Applied Knowledge Test: observational study demonstrating opportunities and limitations in primary care. JMIR Med Educ. 2023 Apr 21;9:e46599. doi: 10.2196/46599. https://mededu.jmir.org/2023//e46599/ v9i1e46599 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Rao A, Pang M, Kim J, Kamineni M, Lie W, Prasad AK, Landman A, Dreyer K, Succi MD. Assessing the utility of ChatGPT throughout the entire clinical workflow: development and usability study. J Med Internet Res. 2023 Aug 22;25:e48659. doi: 10.2196/48659. https://www.jmir.org/2023//e48659/ v25i1e48659 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Sharma A, Lin IW, Miner AS, Atkins DC, Althoff T. Human–AI collaboration enables more empathic conversations in text-based peer-to-peer mental health support. Nat Mach Intell. 2023 Jan 23;5(1):46–57. doi: 10.1038/s42256-022-00593-2. [DOI] [Google Scholar]
- 11.Ayers JW, Poliak A, Dredze M, Leas EC, Zhu Z, Kelley JB, Faix DJ, Goodman AM, Longhurst CA, Hogarth M, Smith DM. Comparing physician and artificial intelligence chatbot responses to patient questions posted to a public social media forum. JAMA Intern Med. 2023 Jun 1;183(6):589–596. doi: 10.1001/jamainternmed.2023.1838.2804309 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Yao B, Jiang M, Yang D, Hu J. Empowering LLM-based machine translation with cultural awareness. arXiv. Preprint posted online on May 23, 2023. https://arxiv.org/pdf/2305.14328.pdf . [Google Scholar]
- 13.Wiest IC, Ferber D, Zhu J, van Treeck M, Meyer SK, Juglan R, Carrero ZI, Paech D, Kleesiek J, Ebert MP, Truhn D, Kather JN. From text to tables: a local privacy preserving large language model for structured information retrieval from medical documents. medRxiv. doi: 10.1101/2023.12.07.23299648. Preprint posted online on Dec 8, 2023. https://www.medrxiv.org/content/10.1101/2023.12.07.23299648v1.full-text . [DOI] [Google Scholar]
- 14.Cai W. Feasibility and prospect of privacy-preserving large language models in radiology. Radiology. 2023 Oct;309(1):e232335. doi: 10.1148/radiol.232335. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Yang R, Tan TF, Lu W, Thirunavukarasu AJ, Ting DSW, Liu N. Large language models in health care: development, applications, and challenges. Health Care Science. 2023 Jul 24;2(4):255–263. doi: 10.1002/hcs2.61. https://onlinelibrary.wiley.com/doi/10.1002/hcs2.61 . [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Ogueji K, Zhu Y, Lin J. Small data? No problem! Exploring the viability of pretrained multilingual language models for low-resourced languages. In: Ataman D, Birch A, Conneau A, Firat O, Ruder S, Sahin GG, editors. Proceedings of the 1st Workshop on Multilingual Representation Learning. Kerrville, TX: Association for Computational Linguistics; 2021. pp. 116–126. [Google Scholar]
- 17.Yang S, Lee J, Sezgin E, Bridge J, Lin S. Clinical advice by voice assistants on postpartum depression: cross-sectional investigation using Apple Siri, Amazon Alexa, Google Assistant, and Microsoft Cortana. JMIR Mhealth Uhealth. 2021 Jan 11;9(1):e24045. doi: 10.2196/24045. https://mhealth.jmir.org/2021/1/e24045/ v9i1e24045 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Sezgin E, Chekeni F, Lee J, Keim S. Clinical accuracy of large language models and Google search responses to postpartum depression questions: cross-sectional study. J Med Internet Res. 2023 Sep 11;25:e49240. doi: 10.2196/49240. https://www.jmir.org/2023//e49240/ v25i1e49240 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Xu B, Liu X, Shen H, Han Z, Li Y, Yue M, Peng Z, Liu Y, Yao Z, Xu D. Gentopia: a collaborative platform for tool-augmented LLMs. arXiv. doi: 10.18653/v1/2023.emnlp-demo.20. Preprint posted online on Aug 8, 2023. https://arxiv.org/pdf/2308.04030.pdf . [DOI] [Google Scholar]
- 20.Chen C, Feng X, Zhou J, Yin J, Zheng X. Federated large language model: a position paper. arXiv. Preprint posted online on Jul 18, 2023. https://arxiv.org/pdf/2307.08925.pdf . [Google Scholar]
- 21.Thirunavukarasu AJ, Ting DSJ, Elangovan K, Gutierrez L, Tan TF, Ting DSW. Large language models in medicine. Nat Med. 2023 Aug;29(8):1930–1940. doi: 10.1038/s41591-023-02448-8.10.1038/s41591-023-02448-8 [DOI] [PubMed] [Google Scholar]
- 22.Sezgin E, Militello LK, Huang Y, Lin S. A scoping review of patient-facing, behavioral health interventions with voice assistant technology targeting self-management and healthy lifestyle behaviors. Transl Behav Med. 2020 Aug 7;10(3):606–628. doi: 10.1093/tbm/ibz141.5885015 [DOI] [PubMed] [Google Scholar]
- 23.Agbavor F, Liang H. Predicting dementia from spontaneous speech using large language models. PLOS Digit Health. 2022 Dec 22;1(12):e0000168. doi: 10.1371/journal.pdig.0000168. https://europepmc.org/abstract/MED/36812634 .PDIG-D-22-00226 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.AMA issues new principles for AI development, deployment and use. American Medical Association. 2023. Nov 28, [2023-12-25]. https://www.ama-assn.org/press-center/press-releases/ama-issues-new-principles-ai-development-deployment-use .
- 25.Meskó B, Topol EJ. The imperative for regulatory oversight of large language models (or generative AI) in healthcare. NPJ Digit Med. 2023 Jul 6;6(1):120. doi: 10.1038/s41746-023-00873-0. doi: 10.1038/s41746-023-00873-0.10.1038/s41746-023-00873-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Hacker P, Engel A, Mauer M. Proceedings of the 2023 ACM Conference on Fairness, Accountability, and Transparency. New York, NY: Association for Computing Machinery; 2023. Jun 12, Regulating ChatGPT and other large generative AI models; pp. 1112–1123. [Google Scholar]
- 27.Mökander J, Schuett J, Kirk HR, Floridi L. Auditing large language models: a three-layered approach. AI Ethics. 2023 May 30; doi: 10.1007/s43681-023-00289-2. https://link.springer.com/article/10.1007/s43681-023-00289-2 . [DOI] [Google Scholar]
- 28.Liao F, Adelaine S, Afshar M, Patterson BW. Governance of clinical AI applications to facilitate safe and equitable deployment in a large health system: key elements and early successes. Front Digit Health. 2022 Aug 24;4:931439. doi: 10.3389/fdgth.2022.931439. https://europepmc.org/abstract/MED/36093386 . [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.De Angelis L, Baglivo F, Arzilli G, Privitera GP, Ferragina P, Tozzi AE, Rizzo C. ChatGPT and the rise of large language models: the new AI-driven infodemic threat in public health. Front Public Health. 2023 Apr 25;11:1166120. doi: 10.3389/fpubh.2023.1166120. https://europepmc.org/abstract/MED/37181697 . [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Ding H, Simmich J, Vaezipour A, Andrews N, Russell T. Evaluation framework for conversational agents with artificial intelligence in health interventions: a systematic scoping review. J Am Med Inform Assoc. 2023 Dec 09; doi: 10.1093/jamia/ocad222. Epub ahead of print.7467291 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Chang Y, Wang X, Wang J, Wu Y, Yang L, Zhu K, Chen H, Yi X, Wang C, Wang Y, Ye W, Zhang Y, Chang Y, Yu PS, Yang Q, Xie X. A survey on evaluation of large language models. arXiv. Preprint posted online on Dec 29, 2023. https://arxiv.org/pdf/2307.03109.pdf . [Google Scholar]
