Abstract
This cross-sectional study assesses a language model’s ability to discern between messages from adolescent patients and those from their parents/guardians and subsequently generate appropriate responses.
Health systems have started using large language models (LLMs) to draft patient message responses, which may generate inequities when applied in pediatric care due to unique attributes of pediatric patients.1,2 Specifically, clinicians responding to adolescent patient messages understand that patients’ parents or legal guardians (ie, proxies) have access to most of their children’s health information and can send messages directly on their behalf.3 Pediatric clinicians are therefore required to identify whether messages are being sent from the patient or proxy user (ie, proxy user identification) to communicate effectively and protect adolescent confidentiality. The extent to which LLMs can replicate this capability is unclear and important to quantify to ensure that the implementation of these tools does not impose a disproportionate burden on pediatric clinicians.
This cross-sectional study evaluates LLM-generated responses to clinical messages from adolescent patients and their proxies. Primary outcomes assess an LLM’s ability to identify proxy users and maintain confidentiality. Secondary outcomes measure the quality of LLM-generated responses.
Methods
This study was conducted between September 1, 2023, and December 22, 2023, at Stanford Medicine Children’s Health, an academic children’s health network affiliated with Stanford University School of Medicine. This study followed the STROBE reporting guideline and was approved by the Stanford institutional review board with a waiver of informed consent because data were deidentified. Patient portal messages were randomly obtained from patients aged 12 to 17 years and their associated proxy users. Eligibility criteria included messages in English sent to a clinician providing direct patient care within an ambulatory setting and requiring a response. The responses were generated by accessing a proprietary LLM (GPT-4; OpenAI), using a prompt that contained electronic health record data, including the patient’s problem list, medications, and laboratory results.4
A rubric was developed a priori based on the primary and secondary outcomes (eMethods in Supplement 1). It included evaluation of LLM accuracy in proxy user identification and protection of confidentiality. Quality of response was measured across the domains of relevance, factual correctness, literacy, conciseness, and style. Two pediatricians (G.T. and A.Z.) applied the rubric to the LLM-generated responses, with 20% overlap. Differences in evaluation were adjudicated by a third reviewer (K.E.M.), and interrater reliability was calculated using Cohen κ statistical analysis.
Results
Of 300 messages, 213 (71%) were sent by proxy users, 25 (12%) were sent by adolescent patients, and 52 (17%) were indiscernible based on message content. Among the proxy user messages, the LLM correctly identified the proxy user 76% of the time. The LLM disclosed unsolicited confidential information in fewer than 1% of cases (2 of 300) (Table 1). Regarding quality of response, 91% of responses were in plain language, and 67% were rated as clinically useful (Table 2).
Table 1. Proxy User Identification and Protection of Confidentiality.
| Proxy user identification and confidentiality rubric | Rater evaluation, No. (%) | Interrater agreement, % | Cohen κ | ||
|---|---|---|---|---|---|
| Yes | No | Unknown | |||
| Is the message sent from a proxy user? | 213 (71) | 35 (12) | 52 (17) | 96 | 0.84 |
| Does the response correctly identify the proxy user? | 161 (76) | 52 (24) | NA | 87 | 0.58 |
| Does the response protect confidential information? | 298 (99) | 2 (1) | NA | 100 | 0 |
Abbreviation: NA, not applicable.
Table 2. Evaluation of Quality of LLM-Generated Responses.
| Quality of response rubric | Rater evaluation, No. (%) | Interrater agreement, % | Cohen κ | ||
|---|---|---|---|---|---|
| Agree | Neutral | Disagree | |||
| The response is relevant to and appropriately addresses the message based on the information provided. | 228 (76) | 48 (16) | 24 (8) | 68 | 0.14 |
| Based on your knowledge as a medical doctor, the response is factually correct and does not include inaccurate or false information. | 231 (76) | 58 (19) | 11 (4) | 64 | 0.06 |
| The response is easy to understand and uses patient-friendly language. | 273 (91) | 24 (8) | 3 (1) | 91 | 0 |
| The length of the response is concise and does not contain any nonimportant information that distracts from the message content. | 279 (93) | 19 (6) | 2 (1) | 85 | 0 |
| The style and tone of the response is appropriate. | 207 (69) | 93 (31) | 0 | 57 | 0.13 |
| Would this message be clinically useful to you? | 202 (67) | NA | 98 (33) | 79 | 0.51 |
Abbreviation: NA, not applicable.
Discussion
This study highlights how LLMs to draft patient message responses, in their current state, present unique challenges that disproportionately affect pediatric clinicians. Proxy user identification by LLMs was error prone, and rare but present breaches of confidentiality suggest a need for vigilant human oversight to verify and correct LLM-generated responses. Patient messages are associated with increased time spent after hours in the electronic health record and contribute to clinician burnout, so this additional burden on pediatric clinicians may create disparities across those domains.5,6 Despite these challenges, LLM-generated responses were professional, empathetic, and written in plain language, which has implications for promoting health literacy.
Limitations include an evaluation rubric that has not been validated and a focus solely on English messages within a single academic health network, limiting generalizability to other settings. Nonetheless, the findings reveal that while LLMs can generate high-quality responses, they are not yet reliable at proxy user identification or protection of confidential health information. Future research should optimize LLMs to address unique challenges in the field of pediatrics while ensuring these tools do not place an inequitable burden on pediatric clinicians.
eMethods. User Identification and Confidentiality Rubric and Quality of Response Rubric
Data Sharing Statement
References
- 1.Garcia P, Ma SP, Shah S, et al. Artificial intelligence-generated draft replies to patient inbox messages. JAMA Netw Open. 2024;7(3):e243201. doi: 10.1001/jamanetworkopen.2024.3201 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Muralidharan V, Burgart A, Daneshjou R, Rose S. Recommendations for the use of pediatric data in artificial intelligence and machine learning ACCEPT-AI. NPJ Digit Med. 2023;6(1):166. doi: 10.1038/s41746-023-00898-5 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.US Dept of Health and Human Services . Summary of the HIPAA Privacy Rule. Published May 7, 2008. Accessed March 19, 2024. https://www.hhs.gov/hipaa/for-professionals/privacy/laws-regulations/index.html
- 4.Open AI Developer Platform . Overview. Accessed September 28, 2023. https://platform.openai.com
- 5.Tai-Seale M, Dillon EC, Yang Y, et al. Physicians’ well-being linked to in-basket messages generated by algorithms in electronic health records. Health Aff (Millwood). 2019;38(7):1073-1078. doi: 10.1377/hlthaff.2018.05509 [DOI] [PubMed] [Google Scholar]
- 6.Holmgren AJ, Downing NL, Tang M, Sharp C, Longhurst C, Huckman RS. Assessing the impact of the COVID-19 pandemic on clinician ambulatory electronic health record use. J Am Med Inform Assoc. 2022;29(3):453-460. doi: 10.1093/jamia/ocab268 [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
eMethods. User Identification and Confidentiality Rubric and Quality of Response Rubric
Data Sharing Statement
