Skip to main content
The Gerontologist logoLink to The Gerontologist
. 2024 Jun 4;64(8):gnae062. doi: 10.1093/geront/gnae062

Are Virtual Assistants Trustworthy for Medicare Information: An Examination of Accuracy and Reliability

Emily Langston 1, Neil Charness 2, Walter Boot 3,
Editor: Joseph E Gaugler
PMCID: PMC11258897  PMID: 38832398

Abstract

Background and Objectives

Advances in artificial intelligence (AI)-based virtual assistants provide a potential opportunity for older adults to use this technology in the context of health information-seeking. Meta-analysis on trust in AI shows that users are influenced by the accuracy and reliability of the AI trustee. We evaluated these dimensions for responses to Medicare queries.

Research Design and Methods

During the summer of 2023, we assessed the accuracy and reliability of Alexa, Google Assistant, Bard, and ChatGPT-4 on Medicare terminology and general content from a large, standardized question set. We compared the accuracy of these AI systems to that of a large representative sample of Medicare beneficiaries who were queried twenty years prior.

Results

Alexa and Google Assistant were found to be highly inaccurate when compared to beneficiaries’ mean accuracy of 68.4% on terminology queries and 53.0% on general Medicare content. Bard and ChatGPT-4 answered Medicare terminology queries perfectly and performed much better on general Medicare content queries (Bard = 96.3%, ChatGPT-4 = 92.6%) than the average Medicare beneficiary. About one month to a month-and-a-half later, we found that Bard and Alexa’s accuracy stayed the same, whereas ChatGPT-4’s performance nominally decreased, and Google Assistant’s performance nominally increased.

Discussion and Implications

LLM-based assistants generate trustworthy information in response to carefully phrased queries about Medicare, in contrast to Alexa and Google Assistant. Further studies will be needed to determine what factors beyond accuracy and reliability influence the adoption and use of such technology for Medicare decision-making.

Keywords: Artificial intelligence, LLM, Medicare

Background and Objectives

In 2021, ~99% of older adults in the United States were enrolled in Medicare, the United States’ federally funded health insurance plan for those 65+ years or those with certain disabilities (Centers for Medicare and Medicaid Services; US Census Bureau, 2023). Medicare has been described as a complex and overwhelming system (Castillo et al., 2023). To successfully navigate Medicare, beneficiaries must find information about plans, plan eligibility, enrollment steps, and deadlines, and evaluate this information from the perspective of their personal preferences and needs. Older adults may lack important knowledge about what services are covered by Medicare plans (Ankuda et al., 2020; Sivakumar et al., 2016). Furthermore, poor health literacy can lead beneficiaries to choose plans that do not align with their actual preferences (Braun et al., 2018).

When enrolling in Medicare, older adults can access resources provided by the Federal government, such as Medicare.gov, the official Medicare handbook, and the 24/7 Medicare helpline accessible by voice or by typing into a chat box. Beneficiaries may also gather information from state health insurance assistance programs, friends, and family, as well as from commercial plan providers. However, such varied Medicare sources can overwhelm beneficiaries, who may receive misleading or conflicting advice (Castillo et al., 2023). A challenge for the information seeker is deciding which information channels are trustworthy, particularly for online sources (Sbaffi & Rowley, 2017).

Medicare.gov is one trustworthy resource with accurate information about Medicare. However, older adults can experience difficulty navigating and completing tasks with Medicare.gov (Czaja et al., 2008; Sharit et al., 2011). A 2019 Government Accountability Office report found that beneficiaries reported difficulty finding information with the Medicare.gov plan finder feature (U.S. Government Accountability Office, 2019). Additionally, Stults et al. (2017, 2018) found that beneficiaries felt overwhelmed and confused when using Medicare.gov to find a prescription drug plan.

Here we investigate the performance of artificial intelligence (AI)-based virtual assistants for Medicare information-seeking.

AI-based virtual assistants are any AI-based technology which users can query or request a task to be completed. Virtual assistants can be voice-activated, for example, by smart speakers such as Alexa, or can be text-input-based chatbots such as ChatGPT and Google Bard. AI-based virtual assistant ownership is becoming increasingly popular, with one-quarter of Americans owning a smart speaker as of 2019 and 19% of those age 65+ years (Auxier, 2019).

In this study, we investigated the accuracy of four AI-based virtual assistants: Amazon Alexa, Google Assistant, ChatGPT-4, and Google Bard for Medicare queries. There are several differences between these systems (see Supplementary Table S1). However, all four of these virtual assistants rely on some form of natural language processing (NLP) to understand and respond to user input (Hoy, 2018; Orrù et al., 2023). NLP refers to the systems’ abilities to determine the meaning behind user input and respond appropriately (Fanni et al., 2023). With help of NLP, Amazon Alexa and Google Assistant respond to users’ commands using a “command and control” system (Chen, Grant, et al., 2023; Hoy, 2018). In response to a command, these systems provide users with outputs from specialized servers (Hoy, 2018). Alternatively, ChatGPT and Bard rely on large language models (LLMs), a type of NLP that allows ChatGPT and Bard to be trained on large amounts of textual data (Orrù et al., 2023). LLMs function by identifying statistical relationships, or dependencies, between certain phrases in a body of text they are trained on and then respond to users’ queries by generating outputs based on identified statistical relationships (Demszky et al., 2023).

Despite their popularity, virtual assistants have been shown to output inaccurate information. Amazon Alexa can be vulnerable to third-party applications that rephrase Alexa’s responses to intentionally mislead users (Sharevski et al., 2022). Furthermore, LLMs such as ChatGPT-4 may provide users with “hallucinations,” plausible but inaccurate answers in the form of “confidently” wrong answers that users might respond favorably to, as opposed to explaining that they do not have an answer to the users’ questions (More, 2023; Walters & Wilder, 2023).

Several polls conducted in 2022 suggest that the public is skeptical about the use of AI particularly for healthcare provision. Results from a representative sample of Americans queried by Pew Research show that more than half of adults between ages 18 and 49 reported being very or somewhat uncomfortable if their healthcare provider were to rely on AI (Tyson et al., 2023). This distrust increases with age: for those 50 years and older, almost two-thirds disapprove of such use. A 2023 poll showed that although most of Americans had heard of ChatGPT, very few had tried to use it (Vogels, 2023), with about 21% of those users being ages 18–29 but only 2% being ages 65+. Comparatively, a 2017 poll reported that only 37% of Americans age 50 or older have experience using virtual voice assistants, compared to 55% of Americans age 18–49 (Olmstead, 2017). Such a digital divide in technology use is challenging to address (Charness & Boot, 2022).

A recent meta-analysis of trust in AI by Kaplan et al. (2021) indicated that three dimensions predicted trust: characteristics of the human trustor, the AI trustee, and the shared context of the interaction between them. Kaplan et al. (2021) found that the most heavily weighted factors concerning the AI component were AI performance (d = 1.48) and reliability (d = 2.70). Additionally, Pattanaphanchai et al. (2013) propose that accuracy is an important factor that users should consider when assessing the trustworthiness of online information. Thus, we chose to investigate the accuracy and reliability of these AI-based virtual assistants for Medicare queries.

An additional benchmark proposed for judging the utility of AI systems is whether they are inferior to, comparable to, or outperform people (Di Lillo et al., 2023). We used psychometrically validated knowledge measurement instruments developed for the Centers for Medicare and Medicaid Services, to compare the accuracy of these virtual assistants’ responses to Medicare queries to a Medicare beneficiary sample queried between May and August 2003 (Bann & McCormack, 2005).

Although our ultimate goal for this CREATE V project (www.create-center.org) is to provide guidelines and support to older adults for Medicare decision-making, here we restrict our attention to the first stage of decision-making, information-seeking, and the potential role of AI-based virtual assistants in supporting information gathering. By investigating the accuracy and reliability of information provided by these virtual assistants, we can assess their feasibility as trustworthy sources for older adults seeking Medicare information. We chose to investigate the accuracy and reliability of both earlier commercially available AI-based virtual assistants such as Amazon’s Alexa and Google’s home assistant as well as two new LLM-based chatbots, ChatGPT-4 and Google’s Bard.

Hence, our goals in this study were (1) to assess scoring reliability for complex responses from these virtual assistants without having to rely on difficult-to-recruit Medicare experts (e.g., in contrast to Singhal et al., 2023’s use of medical specialists), (2) to compare the accuracy of recently developed AI-based chatbots to that of earlier AI-based virtual assistants, Alexa and Google Assistant, as well as to a prior large sample of Medicare beneficiaries, and (3) to assess virtual assistant stability over a one- to one-and-a-half-month period.

Research Design and Methods

Two commercially available LLM-based virtual assistants, Google Bard, and ChatGPT-4, and two commercially available non-LLM-based virtual assistants, Amazon Alexa, and Google Assistant, were chosen to sample both voice-based (Alexa and Google Assistant) and text-based AI assistants (Bard and ChatGPT-4). These virtual assistants were selected for their popularity among users (Dastin, 2023; Hu, 2023; Wohr, 2023). Questions for assessing virtual assistants’ proficiency at answering Medicare queries were selected from a Medicare Current Beneficiary Survey (MCBS) by Bann and McCormack (2005). MCBSs are annually distributed by Medicare to collect nationally representative estimates of beneficiaries’ demographic information, health status, care costs, program enrollment, satisfaction with care, and understanding of the current Medicare system.

The original Bann and McCormack (2005) survey included 56 “Medicare Knowledge” Questions from the following sections: “eligibility for Medicare and the structure of original Medicare,” “Medicare + Choice,” “plan choices and decision making,” “information and assistance,” “beneficiary rights,” “Medigap/employer-sponsored supplemental insurance,” as well as 26 “health literacy” questions from the following two sections: “terminology” and “reading comprehension.”

To avoid anthropomorphizing virtual assistants, that do not possess “knowledge” in the cognitive sense of the word, we call what Bann and McCormack (2005) refer to as “Medicare knowledge” questions, “general Medicare content” questions. We used all of the general Medicare content questions within this questionnaire except for one question in the “Medigap/Employee-Sponsored Supplemental Insurance” section without a published answer. Within the “health literacy” section, we used all of the terminology questions, and excluded questions in the reading comprehension section. After omitting these questions, we retained 65 out of the 81 original questions used by Bann and McCormack (2005).

Because Medicare’s Prescription Drug Plan, “Part D,” went into effect following the creation of the survey by Bann and McCormack (2005), we created a section of Part D queries to assess the virtual assistants’ ability to answer content questions about Medicare Part D. This section included 12 true/false questions about Plan D enrollment, coverage, and costs. The questions were drawn from the Part D section of the official U.S. government Medicare handbook for 2023. In total, 77 queries were presented to the virtual assistants.

All of the general Medicare content and Part D questions we used were formatted as having true/false or yes/no answers, whereas all of the terminology questions were formatted to have multiple-choice answers. To allow these questions to be presented to Alexa and Google Assistant verbally, true/false questions were presented to Alexa and Google Assistant by turning the original questions into interrogative statements (see Supplementary Tables S2 and S3). Alternatively, we presented ChatGPT-4 and Bard with the original form of the query (see Supplementary Tables S4 and S5). All Part D questions were created and queried as interrogative statements, such as “Is it true or false that I can drop and switch my Medicare drug plan at any time?” Terminology questions were presented to Alexa and Google Assistant by adding the phrase, “Which of the following two statements defines the term, ‘[term in question]’” (see Supplementary Tables S2 and S3). Alternatively, we presented ChatGPT-4 and Bard with the original form of the terminology (see Supplementary Tables S4 and S5).

There were two instances in which terminology questions started with stems that gave a short description of the term in question. For example, one terminology question asked “A provider network is a group of doctors, hospitals, and other healthcare professionals who: specialize in treating people with diseases, work with an HMO to take care of its members.” When querying Google Assistant and Alexa with these questions, the phrase “who do which of the following” was added so that the stem read “a provider network is a group of doctors, hospitals, and other healthcare professionals who do which of the following?” We presented multiple-choice queries to Google Assistant and Alexa by adding possible multiple-choice answers after presenting the question stem (see Supplementary Tables S2 and S3). Alternatively, we presented ChatGPT-4 and Bard with the original forms of the multiple-choice queries (see Supplementary Tables S4 and S5).

Voice assistants were queried by a native English speaker. We created a Google account to query Bard and Google Assistant. We ensured that Google was not tracking user activity or history, and not using personalized results or advertisements. An Amazon account was created to query Alexa. Google Bard was queried on bard.google.com, using the Google Chrome browser, on a Mac running on macOS 13 Ventura. Google Assistant (version 1.9.8073) was queried using Google’s “Assistant” app on an iPhone 13. Amazon Alexa was queried using a fourth-generation Echo Dot. We queried ChatGPT-4 by using Microsoft Edge’s Bing Chat, a ChatGPT-4-powered chat feature available through Bing’s search engine. This was also done on a Mac running on macOS 13 Ventura. Bing Chat provides users with a choice of three “conversation styles,” a “more creative” style, a “more precise” style, and a “more balanced” style, that determine how Bing Chat will respond to a user’s query. When querying ChatGPT-4 on Bing Chat, we chose the default style “more balanced”.

Google Assistant and Alexa were both queried with our subset of the Bann and McCormack (2005) questionnaire on June 8, 2023, and ChatGPT-4 and Bard were queried on June 26, 2023. To assess reliability, terminology questions were requeried by Google Assistant, Bard, and ChatGPT-4 on July 25, 2023, and by Alexa on July 27, 2023. Bard and ChatGPT-4 chat logs were cleared between queries. Text answers from ChatGPT-4 and Bard were copied and pasted into a score sheet. Our conversation with Google Assistant was recorded and transcribed from audio. Our conversation with Amazon Alexa was accessed by viewing Alexa’s voice history on the Amazon account that we created for this study. In the case of ChatGPT-4, Bard, and Google Assistant, sources, pictures, videos, and charts included in an answer were recorded as well.

The virtual assistants’ query responses were coded independently, as correct or incorrect, by two of the authors (E. Langston and N. Charness). Raters scored answers as correct if the content of the answer aligned with the sentiment expressed by the corresponding answers provided within the assessments taken from Bann and McCormack (2005). In certain cases, Google Assistant would simply respond with a link to a suggested relevant website. These responses were coded as being incorrect. The virtual assistants’ answers were not judged on conciseness. Superfluous information was not counted against them.

Results

Interrater agreement

Raters coded the recorded responses from the virtual assistants independently and then compared their results. Cases where there was disagreement were discussed by the raters and then rescored to assess accuracy levels. An example of virtual assistant responses and their corresponding ratings can be seen in Table 1. Agreement was assessed using Cohen’s kappa (κ). Raters were in 100% agreement for all virtual assistant responses to terminology questions queried during the initial querying session as well as those requeried one month to a month-and-a-half following the initial session. Raters were also in 100% agreement for all virtual assistant responses to Part D queries. There was strong agreement between raters regarding ChatGPT-4’s (κ = 0.88, 88.3% agreement), Bard’s (100% agreement), Alexa’s (κ = 0.88, 88.0% agreement), and Google Assistant’s (100% agreement) responses to general Medicare content questions.

Table 1.

Examples of Virtual Assistant Responses and Rater Coding in Response to the Question “Is It True or False That Part A of the Medicare Program Covers Hospital Stays?”

Responder type AI response Rater coding
Bard “True. Part A of the Medicare program covers hospital stays. It also covers care in a skilled nursing facility, hospice care, and some home healthcare. Part A is funded by payroll taxes, so most people who are eligible for Medicare automatically get Part A. However, there are some exceptions. For example, if you’re younger than 65 and you have a disability, you may have to pay a monthly premium for Part A. If you have Part A, you’ll generally have to pay a deductible for each hospital stay. The deductible for 2023 is $1,556. After you meet the deductible, Medicare will pay for most of your hospital care. However, there are some coinsurance costs that you’ll have to pay. Part A also covers some of the costs of inpatient care in a skilled nursing facility. The maximum number of days that Medicare will cover for skilled nursing facility care is 100 days per benefit period. A benefit period begins the day you’re admitted to a hospital or skilled nursing facility and ends when you haven’t received any inpatient care for 60 days in a row. If you have any questions about Part A of the Medicare program, you should talk to your doctor or a Medicare representative.” Correct
ChatGPT-4 “That’s true. Medicare Part A is hospital insurance and it typically covers inpatient surgeries, bloodwork and diagnostics, and hospital stays. It also includes coverage for blood transfusions when done during a hospital stay, limited skilled nursing facility care, limited home healthcare, and hospice care.” Correct
Alexa “Medicare Part A covers the hospital insurance program, including hospital expenses skilled nursing facilities, home healthcare and hospice care.” Correct
Google Assistant “Here are some results from the web.” Incorrect

Virtual Assistant Proficiency

Virtual assistant accuracy was scored by calculating the number of questions answered correctly in the first round of querying. We utilized data from Bann and McCormack (2005) to compare virtual assistant proficiency to that of a large and diverse sample of Medicare current beneficiaries (n = 2,634). This beneficiary sample was selected to be representative of gender, race, education, age, income, enrollment in managed care, and purchase of private supplemental insurance. Participants were divided into two cohorts and given different sections of the survey to complete. Cohort one received questions related to “knowledge and health literacy,” whereas cohort two received questions related to “knowledge, cognition, as well as the remaining items in the supplement” (Bann & McCormack, 2005). Bann and McCormack (2005) didn’t specify the number of participants in each cohort. To calculate confidence intervals to compare the accuracy of the Bann and McCormack (2005) participants with that of the virtual assistant responders, we estimated a sample size of 1,317 participants (exactly 50% of the whole population) per cohort. Comparing the standard error from a sample size of 1,317 participants to the standard error from a conservative sample size of 1,000 participants, a situation possible given the potential for attrition or incomplete responses, we found a difference of less than 0.01. Thus, we found it acceptable to move forward with a sample size of 1,317 participants per each cohort. To generate accuracy scores for Medicare beneficiaries by sections (i.e., terminology questions, general Medicare content questions, and so on), we averaged across the percent correct scores given for each question. Accuracy scores for virtual assistant responders were calculated by averaging the percentage of questions answered correctly for each section.

As seen in Table 2, Bard correctly answered 100% of the terminology questions, 96.3% of the general Medicare content questions, and 100% of the Part D questions, with an overall score of 97.4% of queries answered correctly. ChatGPT-4 correctly answered 100% of the terminology questions, 92.6% of the general Medicare content questions, and 91.7% of the Part D questions for an overall score of 93.4% of queries answered correctly. Alexa answered 0.00% of the terminology questions correctly, 38.9% of the general Medicare content questions correctly, and 16.7% of the Part D questions correctly, for an overall score of 30.3% of queries answered correctly. Google Assistant correctly answered 20.0% of the terminology questions, 0.00% of the general Medicare content questions, and 0.00% of the Part D questions, for an overall score of 2.63% of queries answered correctly. Comparatively, Medicare current beneficiary respondents correctly answered 68.4% of terminology questions, 95% CI [65.9–70.9], and 53.0% of general Medicare content questions, 95% CI [50.3–55.7], for an overall score of 55.4% of queries answered correctly.

Table 2.

Means and Standard Deviations of the Percentage of Questions Answered Correctly by Virtual Assistants for All Question Types During the Initial Querying Phase

Responder type Terminology General Medicare content Part D
M SD M SD M SD
Bard 100.0a 0.0 96.3a 0.2 100.0 0.00
ChatGPT-4 100.0a 0.0 92.6a 0.3 91.7 0.3
Alexa 0.0a 0.0 38.9a 0.5 16.7 0.4
Google Assistant 20.0a 0.4 0.0a 0.0 0.0 0.0

Notes: M = mean; SD = standard deviation.

aIndicates that a significant difference was found when individually comparing the AI’s response accuracy to the confidence intervals of Bann and McCormack (2005) participants’ response accuracy for general Medicare content queries or terminology queries.

When comparing the results of the original terminology questions queried in June 2023, with the results of those queried in July 2023, there were some slight differences in virtual assistant performance. As seen in Table 3, during this retest phase, Bard answered terminology queries with 100% accuracy as it did previously, whereas Alexa answered terminology queries with 0.00% accuracy as it did previously. Interestingly, ChatGPT-4 saw a slight drop in the percentage of terminology queries it answered correctly, from 100% accuracy in June to 90.0% accuracy in July. When using a McNemar Chi-Square test for repeated measures to compare the number of terminology questions answered correctly and incorrectly by ChatGPT-4 during the initial test and retest phase, we found that this decrease in performance was not significant, χ2 (1, N = 20) = 0.05, p = .41 (one-sided). Furthermore, Google Assistant saw a nonsignificant improvement in its response accuracy of terminology queries, from 20.0% accuracy in June to 50.0% accuracy in July, χ2 (1, N = 20) = 1.25, p = .13 (one-sided).

Table 3.

Means and Standard Deviations of the Percentage of Questions Answered Correctly by Virtual Assistants for Terminology Questions During the Initial Testing and Retest Phases

Responder type Initial testing phase Retest phase
M SD M SD
Bard 100.0 0.0 100.0 0.0
ChatGPT-4 100.0 0.0 90.0 0.3
Alexa 0.0 0.0 0.0 0.0
Google
Assistant
20.0 0.4 50.0 0.5

Note: M = mean; SD = standard deviation.

Virtual assistants also differed in their proficiency in certain subject areas, as seen in Table 4. Within the category of general Medicare content questions, Alexa’s top performing category, 82.4% correct, was “Eligibility for and Structure of Original Medicare,” whereas “Beneficiary Rights” and “Medigap/Employer-Sponsored Supplemental Insurance” were its lowest performing categories with 0.0% correct. Bard correctly answered all questions in every category of general Medicare content questions, aside from those in the “Eligibility for and Structure of Original Medicare” and “Medicare + Choice” sections, with 94.1% accuracy. ChatGPT-4 was 100.0% correct within the “Eligibility for and Structure of Original Medicare,” “Information and Assistance,” and “Beneficiary Rights” sections, and performed the worst in the “Medigap/Employer-Sponsored Supplemental Insurance” category with 75.0% correct. Google Assistant notably did not correctly answer any of the questions within the general Medicare content section, though managed 20% correct responses in the terminology section. Comparatively, Bann and McCormack (2005) showed that Medicare current beneficiaries most accurately answered questions in the Beneficiary Rights section, averaging 71.5% accuracy, and demonstrated the lowest accuracy in the Medigap/Employer-Sponsored Supplemental Insurance section, averaging 34.2% accuracy.

Table 4.

Means and Standard Deviations of the Percentage of Questions Answered Correctly by Virtual Assistants for General Medicare Content Question Subcategories

Responder type Eligibility for and structure of original Medicare Medicare + Choice Plan choices and health plan decision-making Information and assistance Beneficiary rights Medigap/employer-sponsored supplemental insurance
M SD M SD M SD M SD M SD M SD
Bard 94.1 0.2 94.1 0.2 100.0 0.0 100.0 0.0 100.0 0.0 100.0 0.0
ChatGPT-4 100.0 0.0 88.2 0.3 83.3 0.4 100.0 0.0 100.0 0.0 75.0 0.5
Alexa 82.4 0.4 29.4 0.5 16.7 0.4 16.7 0.4 0.0 0.0 0.0 0.0
Google Assistant 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

Note: M = mean; SD = standard deviation.

Chi-square tests for responder type (Bard, ChatGPT-4, Alexa, Google Assistant, and Bann & McCormack, 2005 participants) by the percent of correct and incorrect responses for general Medicare content questions and terminology questions, showed that there was a significant association between accuracy of response and responder type for general Medicare content questions, χ2 (4, N = 500.00) = 259.96, p < .001, as well as for terminology questions during the initial testing phase, χ2 (4, N = 500.00) = 345.91, p < .001, and the test retest phase, χ2 (4, N = 500.00) = 264.70, p < .001. To better understand the relationship between responder type and response accuracy, we compared the percent correct responses for each virtual assistant responder type against the confidence intervals taken from the accuracy scores of Bann and McCormack (2005).

As seen in Table 2, we found significant associations between the percentage of terminology and general Medicare content questions answered correctly and incorrectly between the individual virtual assistants and Medicare beneficiaries. Specifically, Bard and ChatGPT-4 answered terminology and general Medicare content questions more accurately than Medicare beneficiaries. This was true for terminology questions queried during the initial testing round, and the retest round. Conversely, Medicare beneficiaries were found to answer both terminology and general Medicare content questions more accurately than Alexa and Google Assistant. This was true for terminology query performance during the initial testing round and the retest round. When comparing responders’ performance in various subcategories of general Medicare content questions (see Tables 4 and 5), we found a similar pattern. Bard and ChatGPT-4 performed significantly better than Medicare beneficiaries in all general Medicare content subcategories, whereas Google Assistant and Alexa performed significantly worse than Medicare beneficiaries did in these categories.

Table 5.

Means and 95% Confidence Intervals of the Percentage of Questions Answered Correctly by Medicare Beneficiaries for General Medicare Content Question Subcategories

Responder type Eligibility for and structure of original Medicare Medicare + Choice Plan choices and health plan decision-making Information and assistance Beneficiary rights Medigap/employer-sponsored supplemental insurance
M CI M CI M CI M CI M CI M CI
Medicare Beneficiaries 65.9 [63.4–68.5] 41.2 [38.6–43.9] 53.5 [50.8–56.2] 52.3 [49.6–55.0] 71.5 [69.1–73.9] 34.2 [31.6–36.8]

Note: M = mean; CI = 95% confidence interval.

Chi-square tests between Bard and ChatGPT-4 for the number of correct and incorrect responses to the terminology retest query session, found no significant difference in performance, χ2 (1, N = 20) = 1.05, p = .31. Nor did we find a significant difference between Bard and ChatGPT-4 for answering knowledge questions, χ2 (1, N = 108) = 0.71, p = .40 or Part D questions, χ2 (1, N = 24) = 1.04, p = .31. Chi-square tests between Google Assistant and Alexa for the number of correct and incorrect responses to question types revealed that there was no significant difference between Google Assistant and Alexa’s performance accuracy for terminology questions during the initial round of terminology testing, χ2 (1, N = 20) = 2.22, p = .14. However, there was a significant difference between Google Assistant and Alexa during the retest terminology phase, such that Google Assistant performed significantly better than Alexa, χ (1, N = 20) = 6.67, p = .010. Additionally, we found a significant difference between Google Assistant and Alexa for answering general Medicare content questions, with Alexa performing significantly better than Google Assistant, χ (1, N = 108) = 26.1, p < .001. We found no significant difference between Google Assistant and Alexa on Part D questions, χ2 (1, N = 24) = 2.18, p = .14.

Discussion and Implications

In terms of our first goal, assessing the reliability of scoring responses from virtual assistants, we found strong evidence that the accuracy of those outputs could be judged reliably based on strong kappa coefficients and percent agreement statistics. Our second goal, comparing the response accuracy of AI-based virtual assistants to each other and Medicare beneficiaries for Medicare survey questions, revealed that early AI-based virtual assistants, Alexa and Google Assistant, were inferior to the average Medicare beneficiaries and LLM-based virtual assistants Bard and ChatGPT-4, but Bard and ChatGPT-4 were superior to Medicare beneficiaries queried 20 years ago. Our third goal, to assess the reliability of AI chatbots, showed that they did not differ significantly from the initial testing round to the retest round, although ChatGPT-4 showed a nominal decrease in performance accuracy for terminology questions from the initial test to the retest round, and Google Assistant showed a nominal increase in performance from the initial test to the retest round.

Are these AI-based virtual assistants trustworthy for Medicare information gathering? As Kaplan et al. (2021) found, trust in AI systems depended strongly on the accuracy and reliability of AI systems. Accuracy was low for Google Assistant and Alexa, that had overall scores of 2.63% and 30.3% of queries answered correctly, respectively, across the categories of general Medicare content, terminology, and Part D. Conversely, Bard and ChatGPT-4 accurately answered all Medicare terminology queries during the initial testing round and answered 96.3% and 92.6% of general Medicare content queries, respectively. The overall performance of Bard and ChatGPT-4 in both domains, was significantly better than that of Medicare beneficiaries, who correctly answered 68.4% of terminology queries, 95% CI [65.9–70.9], and 53.0% of general Medicare content queries, 95% CI [50.3–55.7]. This was true for both time points in which terminology-based queries were given to the virtual assistants. Therefore, it appears that based on the mid-2023 performance of Alexa and Google Assistant, Alexa and Google Assistant are not trustworthy sources of information for our Medicare queries. However, the versions of Bard and ChatGPT-4 tested in the summer of 2023, did demonstrate trustworthiness based on accuracy and reliability. If older adults currently consult friends for Medicare information, they might do better to consult the LLMs, certainly for the case of disambiguating Medicare terminology. Furthermore, one can imagine that Medicare beneficiaries who were given the same MCBS used in Bann and McCormack (2005) would perform better today, given that they could access information via a search engine or an LLM-based virtual assistant.

As researchers have noted (Lee & See, 2004), having properly calibrated trust is important. Although our research demonstrates that Bard and ChatGPT-4 are fairly accurate at responding to general Medicare content and terminology questions at this time, older adults must keep in mind the possibility that the information given to them by Bard and ChatGPT-4 may not be accurate, even if it is presented in a way that seems plausible. Although this paper examines the trustworthiness of information provided by virtual assistants in the context of Medicare information-seeking using the characteristics of reliability and accuracy, it does not investigate the likelihood of older adults utilizing these resources. An obvious next step would be for researchers to examine older adults’ willingness to use these virtual assistants in the specific context of Medicare information-seeking, as well as in other contexts. Future studies should also explore the characteristics of these virtual assistants that make them trustworthy or untrustworthy to older adults. A limitation of Kaplan et al.’s (2021) meta-analysis on the predictors of trust in AI is that this study didn’t focus on or have much representation from older adults who may assess trust differently. Additionally, some researchers, such as Maier et al. (2022), have found that AI performance is not a strong predictor of users’ trust.

To aid older adults in their use of these systems, it may be helpful to develop guidelines that can help them assess the trustworthiness of these systems. These guidelines should warn users of the possibility of hallucinations and should help them verify the accuracy of received information. For example, these guidelines could advise users to ask the virtual assistant to provide detailed references for its response so that users can verify the answer’s accuracy. However, a problem with this approach is that these systems sometimes provide users with false sources (Walters & Wilder, 2023). To improve acceptability, system developers should provide users and researchers with more transparency about the mechanics of these systems in answering queries. This lack of transparency raises ethical concerns about how virtual assistants are acquiring their response data, as well as how they are handling potentially sensitive user query data (Wang et al., 2023). These concerns are part of a larger discussion on the ethics of AI research, that needs to be explored in future studies (Light et al., 2024).

The text-based and voice-based modalities of these virtual assistants led to some notable differences between virtual assistants during our querying experience. For example, Alexa might have not answered any terminology questions correctly because of the verbose nature of these questions. Alexa often responded to these questions while they were still being read out loud given its limited ability to process lengthy queries. Additionally, Google Assistant, ChatGPT-4, and Bard, unlike Alexa, were queried on devices with screens. This enabled these virtual assistants to respond to queries with text, pictures, links, and tables (see Supplementary Figure S1).

The modality of these virtual assistants and the type of device they are installed on (i.e., speaker, phone, computer) is also an important factor when considering older adults’ potential experiences with AI-based virtual assistants (Cho et al., 2019). Furthermore, particularly for voice-based, assistants, users’ trust may be influenced by the anthropomorphic features of the device’s response “voice” (Pak et al., 2012). These factors should be explored in future studies examining older adults’ willingness to trust and adopt these virtual assistants.

It’s important to note that the differences in performance between ChatGPT-4, Bard, Alexa, and Google Assistant are not unexpected given prior findings that an assistant’s answering abilities vary depending on the area of inquiry and the type of virtual assistant being queried (Alagha & Helbing, 2019; Goh et al., 2021). Therefore, our finding of significant differences in accuracy for different question types indicates that the relative superiority of certain virtual assistants over certain human populations and other virtual assistants may depend heavily on the specific query domain. Thus, we might not expect our results to generalize to other health domains (e.g., accuracy of diagnosis of disease).

There are several caveats to our study. Medicare information represents a very large knowledge domain, one that the U.S. government recognizes as difficult to query, hence provides human expert advisors by telephone and by live chat 24 hr a day, 7 days a week to supplement the Medicare website and printed materials released annually. We examined the performance of AI systems on a relatively small subset of potential queries developed nearly two decades ago from a much larger universe of possible queries about Medicare. However, even within our limited query set, there were differences in the performance of the LLMs on terminology queries versus other queries. There is a need to broaden the query set in future studies.

Furthermore, because these queries were taken from a nationally distributed survey and were presented in true/false or multiple-choice formats, they are not representative of the conversational patterns of speech that users might use when communicating with these virtual assistants. Future studies should examine the accuracy of virtual assistants in response to typical queries generated by Medicare beneficiaries. This would allow researchers not only to assess virtual assistant responses to conversational questions but would also identify the types of questions that Medicare beneficiaries require the greatest help. Another suggestion would be for researchers to assess virtual assistants’ abilities to respond to queries specifically about the Medicare.gov website. This would offer important information regarding the virtual assistants’ abilities to assist with virtual tasks, such as using the Medicare.gov plan finder.

Another caveat, tested in part by repeated queries across points in time, is that these virtual assistants are subject to changes in performance. This can occur because of changes in the version of the system, as in the case when ChatGPT-4 demonstrated improved response accuracy for questions from the Bar exam compared to ChatGPT-3.5 (Katz et al., 2023). In some cases, the response accuracy of a certain version of these systems can also increase or decrease depending on when the question was queried (Chen, Zaharia, et al., 2023; Kuroiwa et al., 2023). In the case of our queries, we saw a nonsignificant increase in Google Assistant’s performance on terminology questions (20.0% to 50.0%), as well as a nonsignificant decline for ChatGPT-4 on terminology questions (from 100.0% to 90.0%) across a one-month to a month-and-a-half period. Conversely, we saw no differences in the percent of terminology questions answered correctly by Bard or Alexa when comparing the initial testing phase to the retest phase. However, two time points may be insufficient to assess virtual assistant reliability. Future studies should assess virtual assistant reliability across a minimum of three time points using longer test intervals.

It is also important to note that these virtual assistants are being updated and improved. Alexa will likely be upgraded soon to utilize the LLM technology behind Bard and ChatGPT (Knight, 2023). Furthermore, as of December 2023, Google has rebranded Google Bard as Google Gemini. This change is accompanied by improved capabilities, such as the ability to process more diverse and complex information, such as an hour-long movie within a single prompt (Pichai & Hassabis, 2023). Notably, Google Gemini is the first virtual assistant to perform better than human experts on a Massive Multitask Language Understanding benchmark, which assesses “world knowledge” and “problem-solving abilities” across 57 subjects (Pichai & Hassabis, 2023). Unfortunately, at the moment, these systems are effectively “black boxes” to the potential user population, making it difficult to generalize from current results to future results. Future studies could develop a set of standardized, complex conversational queries specific to Medicare beneficiaries, as well as a reliable scoring scheme, to assess changes in capabilities over time.

Perhaps LLM-based virtual assistant technology could be of the most use to beneficiaries if Centers for Medicare and Medicaid Services were able to “fine tune” a virtual assistant by training it on information sources such as the Medicare website or handbook. Due to the complex and fluid nature of the Medicare system, this assistant would need to undergo rigorous testing and would need to be continuously adjusted in response to regulatory changes. A continually updated virtual assistant could provide users with the information that Bard and ChatGPT could not. This system should implement accessibility requirements outlined by the U.S. Access Board (Centers for Medicare and Medicaid Services, 2023) to be inclusive to the greatest number of older adults.

Overall, our results strongly indicate that versions of Google Assistant and Alexa tested in the summer of 2023 demonstrated poor performance in the domain of queries taken from Medicare questionnaires when compared to Bard and ChatGPT-4, as well as to a sample of Medicare beneficiaries queried in 2003. Therefore, we would not recommend the use of these virtual assistants in their current state. On the other hand, Bard and ChatGPT-4 demonstrated superior performance to that of Google Assistant, Alexa, and a sample of Medicare beneficiaries. Bard and ChatGPT-4 also demonstrated stability in their performance when comparing their responses from June 2023 to July 2023. Therefore, Bard and ChatGPT-4 may be a promising resource for older adults seeking information about Medicare. However, research examining response accuracy to nuanced, conversational queries is needed. In the meantime, older adults would be wise to verify information obtained from AI systems with trained human advisors, or if they have the skills to navigate the Medicare Web Site, to verify against information available there.

Supplementary Material

gnae062_suppl_Supplementary_Materials

Contributor Information

Emily Langston, Department of Psychology, Florida State University, Tallahassee, Florida, USA.

Neil Charness, Department of Psychology, Florida State University, Tallahassee, Florida, USA.

Walter Boot, Department of Psychology, Florida State University, Tallahassee, Florida, USA.

Funding

This work was supported in part by a grant from the National Institute on Aging, under the auspices of the Center for Research and Education on Aging and Technology Enhancement [1P01AG073090].

Conflict of Interest

None.

Data Availability

Data, analytical methods, and materials used in this study are available to researchers for replication purposes, upon request. This study was not preregistered.

Author Contributions

Emily Langston (Data curation [Lead], Formal analysis [Lead], Investigation [Lead], Methodology [Lead], Project administration [Lead], Validation [Lead], Visualization [Lead], Writing—original draft [Lead], Writing—review & editing [Lead]); Neil Charness (Conceptualization [Lead], Data curation [Supporting], Formal analysis [Supporting], Funding acquisition [Equal], Investigation [Supporting], Methodology [Supporting], Project administration [Supporting], Resources [Supporting], Supervision [Lead], Validation [Supporting], Visualization [Supporting], Writing—original draft [Supporting], Writing—review & editing [Supporting]); Walter Boot (Conceptualization [Supporting], Data curation [Supporting], Formal analysis [Supporting], Funding acquisition [Equal], Investigation [Supporting], Methodology [Supporting], Project administration [Supporting], Resources [Supporting], Supervision [Supporting], Validation [Supporting], Visualization [Supporting], Writing – original draft [Supporting], Writing—review & editing [Supporting])

References

  1. Alagha, E. C., & Helbing, R. R. (2019). Evaluating the quality of voice assistants’ responses to consumer health questions about vaccines: An exploratory comparison of Alexa, Google Assistant and Siri. BMJ Health & Care Informatics, 26(1), e100075. 10.1136/bmjhci-2019-100075 [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Ankuda, C. K., Moreno, J., McKendrick, K., & Aldridge, M. D. (2020). Trends in older adults’ knowledge of Medicare advantage benefits, 2010 to 2016. Journal of the American Geriatrics Society, 68(10), 2343–2347. 10.1111/jgs.16656 [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Auxier, B. (2019, November 21). 5 Things to know about Americans and their smart speakers. Pew Research Center. https://www.pewresearch.org/short-reads/2019/11/21/5-things-to-know-about-americans-and-their-smart-speakers/e [Google Scholar]
  4. Bann, C., & McCormack, L. (2005). Measuring knowledge and health literacy among Medicare beneficiaries. Prepared for the Centers for Medicare & Medicaid Services. RTI International. https://www.cms.gov/Research-Statistics-Data-and-Systems/Statistics-Trends-and-Reports/Reports/Research-Reports-Items/CMS062191 [Google Scholar]
  5. Braun, R. T., Barnes, A. J., Hanoch, Y., & Federman, A. D. (2018). Health literacy and plan choice: Implications for Medicare managed care. Health Literacy Research and Practice, 2(1), 40–54. 10.3928/24748307-20180201-01 [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Castillo, A., Rivera-Hernandez, M., & Moody, K. A. (2023). A digital divide in the COVID-19 pandemic: Information Exchange among older Medicare beneficiaries and stakeholders during the COVID-19 pandemic. BMC Geriatrics, 23(1), 1–10. 10.1186/s12877-022-03674-4 [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Centers for Medicare & Medicaid Services. (2023, September 6). Section 508 and CMS. CMS.gov. https://www.cms.gov/data-research/cms-information-technology/section-508 [Google Scholar]
  8. Charness, N., & Boot, W. R. (2022). A grand challenge for psychology: Reducing the age-related digital divide. Current Directions in Psychological Science, 31(2), 187–193. 10.1177/09637214211068144 [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Chen, B. X., Grant, N., & Weise, K. (2023, March 15). How Siri, Alexa and Google Assistant lost the A.I. race. The New York Times. Retrieved January 2, 2024, from https://www.nytimes.com/2023/03/15/technology/siri-alexa-google-assistant-artificial-intelligence.html [Google Scholar]
  10. Chen, L., Zaharia, M., & Zhou, J. (2023). How is ChatGPT’s behavior changing over time? 10.48550/arXiv.2307.0900 [DOI] [Google Scholar]
  11. Cho, E., Molina, M. D., & Wang, J. (2019). The effects of modality, device, and task differences on perceived human likeness of voice-activated virtual assistants. Cyberpsychology, Behavior, and Social Networking, 22(8), 515–520. 10.1089/cyber.2018.0571 [DOI] [PubMed] [Google Scholar]
  12. Czaja, S. J., Sharit, J., & Nair, S. N. (2008). Usability of the Medicare health web site. Journal of the American Medical Association, 300(7), 790–792. 10.1001/jama.300.7.790-b [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Dastin, J. (2023, November 9). Google wants AI Chatbot bard to help it reach billions of users. Reuters. https://www.reuters.com/technology/reuters-next-google-wants-ai-chatbot-bard-help-it-reach-billions-users-2023-11-09/ [Google Scholar]
  14. Demszky, D., Yang, D., Yeager, D. S., Bryan, C. J., Clapper, M., Chandhok, S., Eichstaedt, J. C., Hecht, C., Jamieson, J., Johnson, M., Jones, M., Krettek-Cobb, D., Lai, L., Jones, Mitchell, N., Ong, D. C., Dweck, C. S., Gross, J. J., & Pennebaker, J. W. (2023). Using large language models in psychology. Nature Reviews Psychology, 2, 688–701. 10.1038/s44159-023-00241-5 [DOI] [Google Scholar]
  15. Di Lillo, L., Gode, T., Zhou, X., Atzei, M., Chen, R., & Victor, T. (2023). Comparative Safety Performance of Autonomous- and Human Drivers: A Real-World Case Study of the Waymo One Service. arXiv, 1–7. 10.48550/arXiv.2309.01206 [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Fanni, S. C., Febi, M., Aghakhanyan, G., & Neri, E. (2023). Natural language processing. In Ranschaert, E. R., Trianni, A., Klontzas, M. E. (Eds.) Introduction to artificial intelligence (pp. 87–99). Springer International Publishing. [Google Scholar]
  17. Goh, A. S. Y., Wong, L. L., & Yap, K. Y. L. (2021). Evaluation of COVID-19 information provided by digital voice assistants. International Journal of Digital Health, 1(1), 3. 10.29337/ijdh.25 [DOI] [Google Scholar]
  18. Government Accountability Office. (2019). MEDICARE PLAN FINDER: Usability problems and incomplete information create challenges for beneficiaries comparing coverage options 1–20 (2019). United States Government Accountability Office. [Google Scholar]
  19. Hoy, M. B. (2018). Alexa, Siri, Cortana, and more: An introduction to voice assistants. Medical Reference Services Quarterly, 37(1), 81–88. 10.1080/02763869.2018.1404391 [DOI] [PubMed] [Google Scholar]
  20. Hu, K. (2023, February 2). CHATGPT sets record for fastest-growing user base—analyst note. Reuters. https://www.reuters.com/technology/chatgpt-sets-record-fastest-growing-user-base-analyst-note-2023-02-01/ [Google Scholar]
  21. Kaplan, A. D., Kessler, T. T., Brill, J. C., & Hancock, P. A. (2021). Trust in artificial intelligence: Meta-analytic findings. Human Factors, 65(2), 337–359. 10.1177/00187208211013988 [DOI] [PubMed] [Google Scholar]
  22. Katz, D. M., Bommarito, M. J., Gao, S., & Arredondo, P. (2023). GPT-4 passes the bar exam. SSRN Electronic Journal, 382, 1–35. 10.2139/ssrn.4389233 [DOI] [PMC free article] [PubMed] [Google Scholar]
  23. Knight, W. (2023, September 20). Amazon upgrades Alexa for the CHATGPT era. Wired. https://www.wired.com/story/amazon-upgrades-alexa-for-the-chatgpt-era/ [Google Scholar]
  24. Kuroiwa, T., Sarcon, A., Ibara, T., Yamada, E., Yamamoto, A., Tsukamoto, K., & Fujita, K. (2023). The potential of CHATGPT as a self-diagnostic tool in common orthopedic diseases: Exploratory study. Journal of Medical Internet Research, 25, e47621. 10.2196/47621 [DOI] [PMC free article] [PubMed] [Google Scholar]
  25. Lee, J. D., & See, K. A. (2004). Trust in automation: Designing for appropriate Reliance. Human Factors, 46(1), 50–80. 10.1518/hfes.46.1.50_30392 [DOI] [PubMed] [Google Scholar]
  26. Light, L. L., Panicker, S., Abrams, L., & Huh-Yoo, J. (2024). Ethical challenges in the use of digital technologies in psychological science: Introduction to the special issue. American Psychologist, 79(1), 1–8. 10.1037/amp0001286 [DOI] [PubMed] [Google Scholar]
  27. Maier, T., Menold, J., & McComb, C. (2022). The relationship between performance and trust in AI in E-Finance. Frontiers in Artificial Intelligence, 5, 891529. 10.3389/frai.2022.891529 [DOI] [PMC free article] [PubMed] [Google Scholar]
  28. More, R. (2023, February 11). Google cautions against “hallucinating” Chatbots, report says. Reuters. https://www.reuters.com/technology/google-cautions-against-hallucinating-chatbots-report-2023-02-11/  [Google Scholar]
  29. Olmstead, K. (2017, December 12). Nearly half of Americans use digital voice assistants, mostly on their smartphones. Pew Research Center. https://www.pewresearch.org/short-reads/2017/12/12/nearly-half-of-americans-use-digital-voice-assistants-mostly-on-their-smartphones/ [Google Scholar]
  30. Orrù, G., Piarulli, A., Conversano, C., & Gemignani, A. (2023). Human-like problem-solving abilities in large language models using ChatGPT. Frontiers in Artificial Intelligence, 6, 1199350. 10.3389/frai.2023.1199350 [DOI] [PMC free article] [PubMed] [Google Scholar]
  31. Pak, R., Fink, N., Price, M., Bass, B., & Sturre, L. (2012). Decision support AIDS with anthropomorphic characteristics influence trust and performance in younger and older adults. Ergonomics, 55(9), 1059–1072. 10.1080/00140139.2012.691554 [DOI] [PubMed] [Google Scholar]
  32. Pattanaphanchai, J., O’Hara, K., & Hall, W. (2013). Trustworthiness criteria for supporting users to assess the credibility of web information. Proceedings of the 22nd International Conference on World Wide Web. Rio de Janeiro, Brazil. 10.1145/2487788.2488132 [DOI] [Google Scholar]
  33. Pichai, S., & Hassabis, D. (2023, December 6). Introducing Gemini: Our largest and most capable AI model. The Keyword. https://blog.google/technology/ai/google-gemini-ai/#sundar-note (2024, March 19, date accessed). [Google Scholar]
  34. Sbaffi, L., & Rowley, J. (2017). Trust and credibility in web-based health information: A review and agenda for future research. Journal of Medical Internet Research, 19(6), e218. 10.2196/jmir.7579 [DOI] [PMC free article] [PubMed] [Google Scholar]
  35. Sharevski, F., Slowinski, A., Jachim, P., & Pieroni, E. (2022). “Hey Alexa, What do you know about the COVID-19 vaccine?”—(Mis)perceptions of mass immunization and voice assistants. Internet of Things, 19, 100566. 10.1016/j.iot.2022.100566 [DOI] [PMC free article] [PubMed] [Google Scholar]
  36. Sharit, J., Hernandez, M. A., Nair, S. N., Kuhn, T., & Czaja, S. J. (2011). Health problem solving by older persons using a complex government web site. ACM Transactions on Accessible Computing, 3(3), 1–35. 10.1145/1952383.1952386 [DOI] [Google Scholar]
  37. Singhal, K., Azizi, S., Tu, T., Mahdavi, S. S., Wei, J., Chung, H. W., Scales, N., Tanwani, A., Cole-Lewis, H., Pfohl, S., Payne, P., Seneviratne, M., Gamble, P., Kelly, C., Babiker, A., Schärli, N., Chowdhery, A., Mansfield, P., Demner-Fushman, D., … Natarajan, V. (2023). Large language models encode clinical knowledge. Nature, 620(7972), 172–180. 10.1038/s41586-023-06291-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
  38. Sivakumar, H., Hanoch, Y., Barnes, A. J., & Federman, A. D. (2016). Cognition, health literacy, and actual and perceived Medicare knowledge among inner-city Medicare beneficiaries. Journal of Health Communication, 21(suppl 2), 155–163. 10.1080/10810730.2016.1193921 [DOI] [PubMed] [Google Scholar]
  39. Stults, C. D., Baskin, A. S., Bundorf, M. K., & Tai-Seale, M. (2017). Patient experiences in selecting a Medicare part D prescription drug plan. Journal of Patient Experience, 5(2), 147–152. 10.1177/2374373517739413 [DOI] [PMC free article] [PubMed] [Google Scholar]
  40. Stults, C. D., Fattahi, S., Meehan, A., Bundorf, M. K., Chan, A. S., Pun, T., & Tai-Seale, M. (2018). Comparative usability study of a newly created patient-centered tool and Medicare.gov Plan Finder to help Medicare beneficiaries choose Prescription Drug Plans. Journal of Patient Experience, 6(1), 81–86. 10.1177/2374373518778343 [DOI] [PMC free article] [PubMed] [Google Scholar]
  41. Tyson, A., Pasquini, G., Spencer, A., & Funk, C. (2023). 60% of Americans would be uncomfortable with provider relying on AI in their own health care. Pew Research Center. https://www.pewresearch.org/science/2023/02/22/60-of-americans-would-be-uncomfortable-with-provider-relying-on-ai-in-their-own-health-care/ [Google Scholar]
  42. US Census Bureau. (2023, May 2). Older Americans Month: May 2023. Census.gov. https://www.census.gov/newsroom/stories/older-americans-month.html [Google Scholar]
  43. Vogels, E. A. (2023). A majority of Americans have heard of ChatGPT, but few have tried it themselves. Pew Research. https://www.pewresearch.org/short-reads/2023/05/24/a-majority-of-americans-have-heard-of-ChatGPT-but-few-have-tried-it-themselves/ [Google Scholar]
  44. Walters, W. H., & Wilder, E. I. (2023). Fabrication and errors in the bibliographic citations generated by ChatGPT. Scientific Reports, 13(1), 14045. 10.1038/s41598-023-41032-5 [DOI] [PMC free article] [PubMed] [Google Scholar]
  45. Wang, C., Liu, S., Yang, H., Guo, J., Wu, Y., & Liu, J. (2023). Ethical considerations of using CHATGPT in Health Care. Journal of Medical Internet Research, 25, e48009. 10.2196/48009 [DOI] [PMC free article] [PubMed] [Google Scholar]
  46. Wohr, J. (2023, October 17). Voice assistants: What they are and what they mean for marketing and commerce. EMarketer. https://www.emarketer.com/insights/voice-assistants/ [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

gnae062_suppl_Supplementary_Materials

Data Availability Statement

Data, analytical methods, and materials used in this study are available to researchers for replication purposes, upon request. This study was not preregistered.


Articles from The Gerontologist are provided here courtesy of Oxford University Press

RESOURCES