Abstract
Background:
Large-language models (LLMs) driven by artificial intelligence allow people to engage in direct conversations about their health. The accuracy and readability of the answers provided by ChatGPT, the most famous LLM, about Essential Tremor (ET), one of the commonest movement disorders, have not yet been evaluated.
Methods:
Answers given by ChatGPT to 10 questions about ET were evaluated by 5 professionals and 15 laypeople with a score ranging from 1 (poor) to 5 (excellent) in terms of clarity, relevance, accuracy (only for professionals), comprehensiveness, and overall value of the response. We further calculated the readability of the answers.
Results:
ChatGPT answers received relatively positive evaluations, with median scores ranging between 4 and 5, by both groups and independently from the type of question. However, there was only moderate agreement between raters, especially in the group of professionals. Moreover, readability levels were poor for all examined answers.
Discussion:
ChatGPT provided relatively accurate and relevant answers, with some variability as judged by the group of professionals suggesting that the degree of literacy about ET has influenced the ratings and, indirectly, that the quality of information provided in clinical practice is also variable. Moreover, the readability of the answer provided by ChatGPT was found to be poor. LLMs will likely play a significant role in the future; therefore, health-related content generated by these tools should be monitored.
Keywords: Essential tremor, Movement disorders, Large language Model, Artificial intelligence, ChatGPT
Introduction
Essential Tremor (ET) is defined as a syndrome of action tremor of the upper limbs which can further involve other body regions, in the absence of additional overt neurological signs and of at least 3 years duration [1,2,3]. ET is one of the most frequent movement disorders with an overall prevalence estimate of about 1% which significantly increases with age, affecting more than 6% of people aged 65 years or older [4]. Despite its high prevalence, ET remains frequently undiagnosed or misdiagnosed [5], its public recognition being low. For instance, in a survey of both neurological patients and caregivers attending general neurology, vascular, or movement disorders clinics, only about 10–30% of respondents reported awareness of the condition [6]. Other studies have demonstrated that ET is frequently misdiagnosed, mainly owing to symptom heterogeneity and ambiguity [5,6,7]. Moreover, in a survey including 2864 respondents, it was shown that clinical care of ET is provided by the family physician in up to 26% of cases and only 19% of patients had seen a movement disorder specialist [8], which might suggest that the information about the condition that are received by the patients could be of variable quality and accuracy.
The challenges associated with the diagnosis of ET, its poor recognition and the possible variability about the information received by patients in different clinical settings might theoretically be one of reasons driving online, health-related, information seeking behaviors. In a previous study on movement disorders, we in fact demonstrated that most of the queries on Google were related to definitions, causes and symptoms of different conditions, being possibly conducted to aid self-diagnosis [8]. Furthermore, another infodemiologic study carried out in the field of movement disorders has shown that the term “tremor” was the most queried on Google, with the relative search volume of “Essential Tremor” being 4 times higher than that of “Tourette’s syndrome” [10].
The practice of using the internet to access real-time information on health-related issues is pervasively increasing worldwide, but studies have shown inherent potential risks associated with accessing poor quality, incorrect and/or difficult to read information [11,12], which might in turn lead to negative health behaviors and outcomes [13]. Internet and social media are helpful tools to increase awareness about a particular condition, but there is also the risk of perpetuating myths and disinformation [14].
ChatGPT, introduced by OpenAI in November 2022 [15], is one of the available large language models (LLMs) tools using computational artificial intelligence (AI) that are specifically trained to process and generate text. It enables users to engage in human-like conversations and provides detailed responses about any topic, including healthcare, and is increasingly replacing common search engines. It took only 5 days for ChatGPT’s user base to reach one million following the launch of GPT-3.5 in November 2022 and, according to the latest available data, ChatGPT currently has around 180.5 million users with an average of about 13 million unique visitors that had used it per day in January 2024, more than double the levels of December 2023 [16]. The potential value of LLMs has been already shown in several health-care areas. For instance, in 2023 Singhal and colleagues described Med-PaLM, a LLM designed to provide high quality and accurate answers to medical questions, which demonstrated an accuracy of 67% answering a dataset consisting of US Medical Licensing Exam-style questions, surpassing the prior state of the art by more than 17% [17]. Moreover, Lim and colleagues showed that LLM chatbots, including ChatGPT and Google Bard, provided comprehensive responses to myopia-related health concerns [18], while Goodman and colleagues showed that GPT-3.5 and GPT-4 were largely accurate for addressing complex medical queries across 17 different medical specialties [19].
However, LLMs knowledge derives from a broad range of information, including books, articles, websites, and other written material available online, some of which is likely to be inaccurate or, at least, difficult to understand by laypeople, as shown in a recent work on cardiopulmonary resuscitation [20]. Therefore, also in view of the potential larger use of LLMs tools in the near future, we aimed to analyze the accuracy, relevance, comprehensiveness and overall perceived value of answers provided by ChatGPT to a list of frequently asked questions (FAQs) about ET. We selected ChatGPT over LLMs because of its popularity, public accessibility, convenient usability, and human-like output. The latter is achieved through an incorporated reward model based on human feedback, known as reinforcement learning from human feedback, which results in more credible output than other LLMs [21,22].
Methods
FAQs about ET were first produced by one of the authors (RE) who, based on his clinical experience with ET, developed a list of 25- most received - questions revolving around diagnosis, clinical aspects of ET, differential diagnosis, therapeutic options including alternative treatments and progression of the condition. This list was proposed to three ET patients [2 males and 1 female, aged 67, 59 and 63 years, with a disease duration of 41, 35, and 39 years, respectively, and with a phenotype of “pure” ET with tremor in the upper limbs only of mild to moderate severity (TETRAS performance sub-scale score of 24, 14, and 18, respectively)], who were asked to independently rank them according to their perceived importance. Rankings were subsequently averaged (e.g., [(rank position × number of responses for each rank position)/total number of responses] and the top 10 questions used for the current study. They were subsequently ordered according to the type of question (e.g. first 2 questions about the condition more in general, followed by 4 about its progression, followed by 4 about possible treatments).
On December 20th, 2023, we asked ChatGPT (using the free version 3.5) to provide answers using each question as input. The questions were entered sequentially in a single interaction with ChatGPT (e.g., in a single navigating session), the model was not primed with any other question or statement and no question was repeated twice.
The generated answers were then evaluated by two groups: 5 professionals (e.g., group 1) and 15 laypeople (group 2) who included people who were never formally exposed to information about ET recruited among friends/family members of the professionals who participated in the study.
Group 1 included one neurology resident (CG), three junior neurologists (CS, VC, MR) and one early career faculty (RE), all participating in clinical and research activities about ET.
Participants were asked to evaluate each ChatGPT-generated answer with a score ranging from 1 to 5 (e.g., 1 = poor; 2 = not satisfactory; 3 = adequate; 4 = good; 5 = excellent) according to 5 parameters: clarity (e.g. How well is the answer written, structured and presented?), relevance (e.g., How well does it answer the question that was asked? A relevant answer addresses the question directly), accuracy [(only for professionals); e.g. ‘How correct it is’ - is it factually correct and free from errors?], comprehensiveness (e.g. Does the answer include all or nearly all the information you would expect?), and overall value of the response, in line with a previous study on cardiopulmonary resuscitation [20]. All participants but one (RE) were blinded to the fact that answers were generated using ChatGPT and were told they would have been used for an informational leaflet for patients. Beyond descriptive statistics, ratings were compared between groups (professionals vs laypeople) using the Mann-Whitney test to explore whether literacy about ET could influence the ratings and are presented as median with interquartile ranges (IQR), p < 0.05 being deemed as significant. We further calculated percentage agreement of ratings for each parameter for each question in the two groups, the level of agreement being deemed good when ≥75% of raters expressed a vote equal to 4 or 5 for any item [23].
Finally, the readability of ChatGPT answers was evaluated by means of several parameters, using an on-line software (readable.com). Namely, we used the Flesch Reading Ease test [24], which was initially developed for school books and uses two core measures (e.g. word length and sentence length), higher scores indicating the material is easier to read; the Simple Measure of Gobbledygook (SMOG) index [25], which is widely used to check for health-related content and estimates the years of education a person needs to understand a piece of writing based on the number of polysyllables (words of 3 or more syllables) in a fixed number of sentences; the Coleman-Liau Index [26] that, unlike syllable-based readability indices, relies on word-length in characters, its output approximating the U.S. grade level thought necessary to comprehend the text; and the FORCAST Grade Level [27], which relies on the number of “easy” words with one syllable in a sample of 100–150 words and provides a value that estimates the number of years of education a reader requires to understand a text. We also used two further measures that the on-line software (e.g. readable.com) provides by combining the aforementioned readability metrics. Namely, the “Readability Score” that uses an A-E rating system (the text aimed at the general public should be grade B or better) and the Reach metric, which represents the percentage of reached audience among the general literate population.
Results
The two groups did not differ in terms of age [32 (6) vs 42 (32) years, professionals vs laypeople, respectively, z = 27.5; p = 0.395] or sex distribution (60% vs 46.7% male, professionals vs laypeople, respectively, x2 = 0.267; p = 0.500); however, education was significantly higher in professionals than in laypeople [19 (3) vs 18 (2), respectively, z = 27.5; p < 0.001).
Overall, all answers were rated relatively good, with median scores ranging between 4 and 5, by both professionals and laypeople, although with wide IQRs ranging from 1 to 4. Answers received similar scores by both type of raters independently from the type of question (e.g. about clinical aspects, progression, therapeutic options). Complete aggregate results of the ratings, presented by type of rater, can be found in Table 1.
Table 1.
Question comparison between professional healthcare and laypeople.
| |||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
QUESTION | QUALITY/CLARITY [MEDIAN, (IQR)] | RELEVANCE [MEDIAN, (IQR)] | ACCURACY [MEDIAN, (IQR)] | COMPREHENSIVENESS [MEDIAN, (IQR)] | OVERALL JUDGEMENT [MEDIAN, (IQR)] | ||||||||
|
|
|
|
||||||||||
PROFESSIONAL (N = 5) | LAYPEOPLE (N = 15) | p | PROFESSIONAL (N = 5) | LAYPEOPLE (N = 15) | p | PROFESSIONAL (N = 5) | PROFESSIONAL (N = 5) | LAYPEOPLE (N = 15) | p | PROFESSIONAL (N = 5) | LAYPEOPLE (N = 15) | p | |
| |||||||||||||
What is essential tremor? | [5.0 (2.0)] | [4.0 (1.0)] | 0.866 | [5.0 (1.0)] | [5.0 (1.0)] | 0.866 | [5.0 (2.0)] | [5.0 (2.0)] | [5.0 (1.0)] | 0.800 | [4.0 (2.0)] | [4.0 (1.0)] | 1.000 |
| |||||||||||||
Is essential tremor like Parkinson’s disease? | [4.0 (2.0)] | [4.0 (1.0)] | 0.230 | [5.0 (3.0)] | [4.0 (1.0)] | 0.933 | [4.0 (3.0)] | [4.0 (2.0)] | [4.0 (1.0)] | 0.266 | [4.0 (3.0)] | [4.0 (1.0)] | 0.497 |
| |||||||||||||
I received a diagnosis of essential tremor. Will I get worse? | [5.0 (2.0)] | [4.0 (1.0)] | 0.933 | [5.0 (3.0)] | [4.0 (1.0)] | 0.800 | [5.0 (3.0)] | [4.0 (3.0)] | [5.0 (1.0)] | 0.266 | [5.0 (2.0)] | [5.0 (1.0)] | 0.800 |
| |||||||||||||
Can I do something to slow its progression? | [5.0 (1.0)] | [5.0 (1.0)] | 0.933 | [5.0 (1.0)] | [4.0 (3.0)] | 0.197 | [5.0 (3.0)] | [4.0 (3.0)] | [5.0 (0.0)] | 0.119 | [5.0 (3.0)] | [5.0 (0.0)] | 0.395 |
| |||||||||||||
I had a diagnosis of essential tremor. I am at risk of developing the disease Parkinson’s? | [4.0 (2.0)] | [4.0 (1.0)] | 0.672 | [5.0 (3.0)] | [4.0 (1.0)] | 0.800 | [5.0 (3.0)] | [4.0 (4.0)] | [4.0 (1.0)] | 0.612 | [5.0 (3.0)] | [4.0 (1.0)] | 0.933 |
| |||||||||||||
Will I have memory problems? | [5.0 (1.0)] | [4.0 (1.0)] | 0.349 | [4.0 (2.0)] | [4.0 (2.0)] | 1.000 | [4.0 (3.0)] | [4.0 (3.0)] | [4.0 (1.0)] | 0.445 | [4.0 (3.0)] | [4.0 (2.0)] | 0.800 |
| |||||||||||||
Drugs for essential tremor aren’t working. I can take something else? | [5.0 (1.0)] | [5.0 (1.0)] | 0.933 | [5.0 (2.0)] | [4.0 (1.0)] | 0.800 | [5.0 (3.0)] | [5.0 (3.0)] | [5.0 (1.0)] | 0.612 | [5.0 (3.0)] | [5.0 (1.0)] | 0.553 |
| |||||||||||||
Can cannabis be useful for Essential Tremor? | [5.0 (2.0)] | [4.0 (2.0)] | 0.445 | [4.0 (3.0)] | [4.0 (1.0)] | 0.800 | [4.0 (3.0)] | [4.0 (3.0)] | [4.0 (2.0)] | 0.800 | [4.0 (3.0)] | [4.0 (2.0)] | 0.445 |
| |||||||||||||
I have heard of ultrasound therapy for essential tremor. Can you tell me more? | [5.0 (1.0)] | [4.0 (1.0)] | 0.266 | [4.0 (1.0)] | [4.0 (1.0)] | 0.800 | [4.0 (2.0)] | [4.0 (2.0)] | [4.0 (1.0)] | 0.933 | [5.0 (2.0)] | [4.0 (2.0)] | 0.800 |
| |||||||||||||
Does ultrasound therapy have any side effects? | [5.0 (1.0)] | [4.0 (1.0)] | 0.497 | [5.0 (1.0)] | [4.0 (1.0)] | 0.349 | [5.0 (2.0)] | [4.0 (2.0)] | [5.0 (1.0)] | 0.445 | [4.0 (2.0)] | [4.0 (1.0)] | 0.612 |
|
Conversely, cumulative percentage agreement of ratings ≥4 were variable between the groups. Good agreement (e.g. ≥75%) between professional raters was found for 8/10 answers in terms of clarity, 4/10 in terms of relevance, 1/10 in terms of accuracy, 2/10 in terms of comprehensiveness and 1/10 in terms of overall value of the response (Figure 1). Higher cumulative percentage agreement was found between the laypeople raters, with good scores being observed for 9/10 answers in terms of clarity, 9/10 in terms of relevance, 9/10 in terms of comprehensiveness and 7/10 in terms of overall value of the response (Figure 1). Detailed percentage agreement for each rating is available in the supplemental tables. Notably, regarding the “accuracy” parameter that was only assessed by the professionals, 7/10 answers obtained a negative rating (e.g., 2 = not satisfactory) by one or more raters (supplemental tables).
Figure 1.
Aggregate percentage agreement of ratings ≥4 for each parameter for each question in the group of professional (plain circles) and laypeople (dashed circles). Green circles indicate good agreement (≥75%) and yellow circles moderate agreement (≥60%).
Overall readability levels were consistently poor for all questions. Details about the readability of singles answers according to different parameters are provided in Table 2.
Table 2.
Readability metrics of the answers provided by ChatGPT.
| ||||||
---|---|---|---|---|---|---|
QUESTION | FLESCH READING EASE | SMOG INDEX | COLEMAN-LIAU INDEX | FORCAST GRADE LEVEL | READABILITY SCORE | REACH |
| ||||||
What is Essential Tremor? | 32.0 | 15.1 | 13.9 | 12.1 | D | 59% |
| ||||||
Is essential tremor like Parkinson’s disease? | 28.7 | 14.6 | 14.2 | 12.7 | D | 57% |
| ||||||
I have been diagnosed with essential tremor. Do I carry a higher risk of developing Parkinson disease? | 8.1 | 18.5 | 17.2 | 13.2 | E | 25% |
| ||||||
I have been diagnosed with ET? Will I get worse? | 35.6 | 14.3 | 14.9 | 12.4 | D | 68% |
| ||||||
Is there anything I can do to slow its progression down? | 31.6 | 14.6 | 16.0 | 12.7 | D | 67% |
| ||||||
Does ET affect memory? | 13.7 | 17.8 | 17.2 | 13.9 | E | 32% |
| ||||||
Is cannabis useful for ET? | 20.0 | 17.0 | 15.2 | 12.4 | E | 42% |
| ||||||
Prescribed drugs for essential tremor are not working. can I try something else? | 20.2 | 17.3 | 17.9 | 13.0 | E | 43% |
| ||||||
I heard about ultrasound in essential tremor. can you tell me more? | 26.3 | 16.6 | 14.8 | 12.6 | E | 51% |
| ||||||
Does ultrasound therapy have side effects? | 32.6 | 16.0 | 13.7 | 12.1 | D | 57% |
|
Considering the aggregate metrics (e.g., Readability score and Reach), 50% of the answers were graded as D and the remaining 50% as E, with the percentage of reached audience among the general literate population ranging between 25% and 68% and with 4/10 answers reaching less than 50% of the potential audience.
Discussion
In this study we evaluated the ability of ChatGPT to answer a list of questions commonly asked by patients and their caregivers about ET. We used the version of ChatGPT that was freely available (e.g. ChatGPT-3.5) at the time of conducting the study to simulate the most likely scenario in the real world. ChatGPT was chosen over other LLMs because of its popularity. It was beyond the scope of this study to systematically assess the capabilities of ChatGPT (or other LLMs) of answering FAQs related to ET. In fact, different answers are produced each time, even from the same question, worded identically, from the same LLM. The order of questions as well as “priming” of the model (by providing for instance direction to the model such as “act as a movement disorder specialist”) might significantly affect the output. Furthermore, the output might further change in time depending on the available sources from which LLMs gather their data to generate the answers. A systematic assessment of the capabilities of LLMs to answer any queries, with their respective pros and cons, would require multiple iterations, changes in wording, different priming, etc [28,29]. Therefore, taking into account that the newly developed version 4 of ChatGPT outperforms earlier models as well as LLMs specifically fine-tuned on medical knowledge such as Med-PaLM [30], our data should be intended as providing an initial picture about the outcome of a single interaction with ChatGPT, simulating what would likely happen in the real world.
Generated answers were rated by laypeople as well as by professionals, who judged them as relatively factually correct in terms of accuracy, although with some variability. Overall, ChatGPT answers received good median scores in terms of clarity, relevance, comprehensiveness, and overall value of the response by both laypeople and professionals and with no significant difference between the type of raters. However, we also found that there was only moderate agreement between professionals in terms of relevance, comprehensiveness, accuracy and overall value of the response for most of the answers, whereas good agreement was generally observed among the laypeople. This might be explained by the different literacy about ET between the two type of raters. However, there was variable agreement even among the group of 5 professionals with most answers (7/10) receiving a negative rating (e.g 2 = not satisfactory) in terms of accuracy by at least one rater and with junior raters being more likely to give higher ratings (data not shown). The latter results might therefore indirectly indicate that the information that people receive in the real world is also of variable quality, given that ET patients are cared for by clinicians with different background, including general practitioners [8].
To best of our knowledge this is the first study trying to evaluate the answers of ChatGPT in field of movement disorders with a particular focus on ET and our results adds to a growing – although still fairly small – body of research [18,19,20,31,32,33] suggesting that ChatGPT can provide reasonably correct medical information in a human-like manner, being therefore more appreciated than static internet information, and with higher quality and more empathy than the answers provided by verified physicians [34]. However, differently from previous research on other medical discipline [18,20,31,32,33] or on multiple sclerosis [34], it should be noted that our results somehow diverge in the way that there was only moderate agreement between the professional raters. This might ultimately be related to the profound disagreement, even about movement disorder experts, about the nature of ET [35,36,37].
Contrary to previous research [20,31,32] who involved patients/care-givers affected by the specific condition being the object of the questions, we instead decided to involve laypeople who were never exposed to any information about ET, therefore avoiding any bias that might have arisen from prior knowledge about ET. The choice of selecting people never exposed to any information to ET was also driven by our previous results showing that health-related information seeking behaviors mostly occur to aid self-diagnosis prior to encounter any medical professionals [9]. We further excluded potential biases arising from any skepticism related to AI since participants were blinded to the fact that the answers were generated from a LLM.
If on the one hand side, the positive ratings especially given by the laypeople suggest that ChatGPT might be a useful tool, potentially improving patients’ engagement and reducing workload for healthcare providers, it should be also noted that readability, although the metrics we used might show intrinsic differences and were not all developed to assess medical content [38], was consistently judged very poor with a reach that was at times found as low as 25% of the potential audience. The discrepancy between the positive ratings and the poor readability scores can be easily explained by the high level of education of the subjects involved in this study. The latter should be therefore acknowledged as a limitation, considering that, for instance, according to the Flesh Reading Ease score 6/10 answers obtained a score indicating they were “extremely difficult to read” and only suitable for college graduates.
One might also argue that the list of questions was not complete to depict the complexity of the condition as we only selected the top 10 questions rated by 3 ET patients about different aspects of the condition. For instance, questions related to first-line pharmacological approaches were not included and this is reasonably due to the fact that our 3 ET patients were already treated and, therefore, might have been more likely interested in knowing about alternative options, including surgical treatments. However, although we acknowledge that the FAQs were derived from one provider and only 3 patients rather than a more robust number of each, we would not expect grossly different results by simply increasing the number of providers, patients and/or questions.
Our results show that ChatGPT provides relatively accurate, relevant, and comprehensive, yet poorly readable, answers to questions frequently asked by patients with ET. However, concerns remain around several aspects of this (and similar) LLMs. Although ChatGPT is trained on massive amounts of text data, including books, articles, and websites, it might not easily deal with disputed areas of knowledge and might not provide correct answers when contradictions are present in the input data such as in the case of ET, where profound disagreement exist even between experts [35,36,37]. Moreover, there have been reports of a tendency of the model to occasionally “hallucinate”, that is to provide confidently formulated answers with incorrect or nonsensical content [39]. Users should also be aware that ChatGPT and similar models might show bias against individuals depending on based on sex, ethnicity, or disability [40]. This issue is particularly sensitive in the field of ET, where stigma is prevalent and contribute to social dysfunction [41]. An additional concern is related to the potential for data breaches or unauthorized access to protected health information [42]. This issue should be mitigated by specific laws designed to protect the privacy and security of individuals’ health information. Finally, we evaluated ChatGPT answers in Italian and although this tool might in fact be exploited to produce text in several languages, ultimately reducing linguistic barriers [43], it is unknown whether the value including the accuracy of answers provided in different languages is similar or not.
Notwithstanding these caveats, the power of text-generating LLM tools is undeniable, and it is likely that they will be pervasively used by patients and their care-giver in the near future. While it is crucial to establish robust monitoring measures for health-related information generated by these systems, movement disorders specialists could cautiously attempt to integrate these systems into their clinical practice and make use of their great possibilities [43].
Data Accessibility Statement
Data will be made available upon request to the corresponding author.
Additional File
The additional file for this article can be found as follows:
Detailed percentage agreement of ratings in two group of raters.
Acknowledgements
We thank the 3 ET patients who ranked the frequently asked questions.
Competing Interests
The authors have no competing interests to declare.
Author Contributions
Conception and design of the study, or acquisition of data, or analysis and interpretation of data;
Drafting the article or revising it critically for important intellectual content;
Final approval of the version to be submitted.
CS: 1,2,3; VC: 1,2,3; MR: 1,2,3; CG: 1,2,3; PB: 1,2,3; RE: 1,2,3.
References
- 1.Bhatia KP, Bain P, Bajaj N, Elble RJ, Hallett M, Louis ED, Raethjen J, Stamelou M, Testa CM, Deuschl G; Tremor Task Force of the International Parkinson and Movement Disorder Society. Consensus Statement on the classification of tremors. from the task force on tremor of the International Parkinson and Movement Disorder Society. Mov Disord. 2018. Jan; 33(1): 75–87. DOI: 10.1002/mds.27121 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Erro R, Fasano A, Barone P, Bhatia KP. Milestones in Tremor Research: 10 Years Later. Mov Disord Clin Pract. 2022. Feb 26; 9(4): 429–435. DOI: 10.1002/mdc3.13418 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Erro R, Pilotto A, Esposito M, Olivola E, Nicoletti A, Lazzeri G, Magistrelli L, Dallocchio C, Marchese R, Bologna M, Tessitore A, Misceo S, Gigante AF, Terranova C, Moschella V, di Biase L, Di Giacopo R, Morgante F, Valentino F, De Rosa A, Trinchillo A, Malaguti MC, Brusa L, Matinella A, Di Biasio F, Paparella G, De Micco R, Contaldi E, Modugno N, Di Fonzo A, Padovani A, Barone P; TITAN Study Group. The Italian tremor Network (TITAN): rationale, design and preliminary findings. Neurol Sci. 2022. Sep; 43(9): 5369–5376. DOI: 10.1007/s10072-022-06104-w [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Louis ED, McCreary M. How Common is Essential Tremor? Update on the Worldwide Prevalence of Essential Tremor. Tremor Other Hyperkinet Mov (N Y). 2021. Jul 9; 11: 28. DOI: 10.5334/tohm.632 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Amlang CJ, Trujillo Diaz D, Louis ED. Essential tremor as a “waste basket” diagnosis: diagnosing essential tremor remains a challenge. Front Neurol. 2020; 11: 172. DOI: 10.3389/fneur.2020.00172 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Shalaby S, Indes J, Keung B, et al. Public knowledge and attitude toward essential tremor: a questionnaire survey. Front Neurol. 2016; 7: 60. DOI: 10.3389/fneur.2016.00060 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Jain S, Lo SE, Louis ED. Common misdiagnosis of a common neurological disorder: how are we misdiagnosing essential tremor? Arch Neurol. 2006; 63: 1100–1104. DOI: 10.1001/archneur.63.8.1100 [DOI] [PubMed] [Google Scholar]
- 8.Gupta HV, Pahwa R, Dowell P, Khosla S, Lyons KE. Exploring essential tremor: Results from a large online survey. Clin Park Relat Disord. 2021. Jun 25; 5: 100101. DOI: 10.1016/j.prdoa.2021.100101 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Brigo F, Erro R. Why do people google movement disorders? An infodemiological study of information seeking behaviors. Neurol Sci. 2016. May; 37(5): 781–7. Epub 2016 Feb 4. PMID: 26846327. DOI: 10.1007/s10072-016-2501-5 [DOI] [PubMed] [Google Scholar]
- 10.Pajo AT, Jamora RDG, Espiritu AI. Online Health Information-Seeking Behavior for Movement Disorders: An Infodemiologic Study. DOI: 10.2139/ssrn.4105828 [DOI] [Google Scholar]
- 11.Brigo F, Erro R. The readability of the English Wikipedia article on Parkinson’s disease. Neurol Sci. 2015. Jun; 36(6): 1045–6. Epub 2015 Jan 18. PMID: 25596713. DOI: 10.1007/s10072-015-2077-5 [DOI] [PubMed] [Google Scholar]
- 12.Daraz L, Morrow AS, Ponce OJ, Beuschel B, Farah MH, Katabi A, Alsawas M, Majzoub AM, Benkhadra R, Seisa MO, Ding JF, Prokop L, Murad MH. Can Patients Trust Online Health Information? A Meta-narrative Systematic Review Addressing the Quality of Health Information on the Internet. J Gen Intern Med. 2019. Sep; 34(9): 1884–1891. DOI: 10.1007/s11606-019-05109-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Murray E, Lo B, Pollack L, Donelan K, Catania J, Lee K, Zapert K, Turner R. The impact of health information on the internet on health care and the physician–patient relationship: national US survey among 1.050 US physicians. J Med Internet Res. 2003; 5: e17. DOI: 10.2196/jmir.5.3.e17 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Trethewey SP. ‘Cough CPR’: Misinformation perpetuated by social media. Resuscitation. 2018. Dec; 133: e7–e8. Epub 2018 Oct 8. PMID: 30308229. DOI: 10.1016/j.resuscitation.2018.10.003 [DOI] [PubMed] [Google Scholar]
- 15.OpenAI. ChatGPT: Optimizing language models for dialogue. OpenAI 2022. https://openai.com/blog/chatgpt/ (accessed 3 January 2023).
- 16.https://www.reuters.com/technology/chatgpt-sets-record-fastest-growing-user-base-analyst-note-2023-02-01/.
- 17.Singhal K, Azizi S, Tu T, et al. Large language models encode clinical knowledge. Nature. 2023; 620: 172–180. DOI: 10.1038/s41586-023-06291-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Lim ZW, Pushpanathan K, Yew SME, Lai Y, Sun CH, Lam JSH, Chen DZ, Goh JHL, Tan MCJ, Sheng B, Cheng CY, Koh VTC, Tham YC. Benchmarking large language models’ performances for myopia care: a comparative analysis of ChatGPT-3.5, ChatGPT-4.0, and Google Bard. EBioMedicine. 2023. Sep; 95: 104770. DOI: 10.1016/j.ebiom.2023.104770 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Goodman RS, Patrinely JR, Stone CA, et al. Accuracy and Reliability of Chatbot Responses to Physician Questions. JAMA Netw Open. 2023; 6(10): e2336483. DOI: 10.1001/jamanetworkopen.2023.36483 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Scquizzato T, Semeraro F, Swindell P, Simpson R, Angelini M, Gazzato A, Sajjad U, Bignami EG, Landoni G, Keeble TR, Mion M. Testing ChatGPT ability to answer laypeople questions about cardiac arrest and cardiopulmonary resuscitation. Resuscitation. 2024. Jan; 194: 110077. Epub 2023 Dec 9. PMID: 38081504. DOI: 10.1016/j.resuscitation.2023.110077 [DOI] [PubMed] [Google Scholar]
- 21.Stiennon N, et al. Learning to summarize from human feedback. In Proc. 34th International Conference on Neural Information Processing Systems, 3008–3021 (Curran Associates Inc., 2020). [Google Scholar]
- 22.Gao L, Schulman J, Hilton J. Scaling laws for reward model overoptimization. PMLR. 2023; 202: 10835–10866. [Google Scholar]
- 23.DeVellis RF. Inter-Rater Reliability. In Kempf-Leonard K (ed.), Encyclopedia of Social Measurement, Elsevier; 2005. DOI: 10.1016/B0-12-369398-5/00095-5 [DOI] [Google Scholar]
- 24.Flesch R. A new readability yardstick. J Appl Psychol. 1948. Jun; 32(3): 221–33. PMID: 18867058. DOI: 10.1037/h0057532 [DOI] [PubMed] [Google Scholar]
- 25.Mc Laughlin GH. “SMOG grading-a new readability formula.” Journal of reading. 1969; 12.8: 639–646. [Google Scholar]
- 26.Coleman M, Liau TL. “A computer readability formula designed for machine scoring.” Journal of Applied Psychology. 1975; 60(2): 283. DOI: 10.1037/h0076540 [DOI] [Google Scholar]
- 27.Caylor JS, Sticht TG, Fox LC, Ford JP. Methodologies for Determining Reading Requirements of Military Occupational Specialties. Alexandria, VA: Human Resources Research Organization; 1973. [Google Scholar]
- 28.Clusmann J, Kolbinger FR, Muti HS, et al. The future landscape of large language models in medicine. Commun Med. 2023; 3: 141. DOI: 10.1038/s43856-023-00370-1 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Jahan I, Laskar MTR, Peng C, Huang JX. A comprehensive evaluation of large Language models on benchmark biomedical text processing tasks. Comput Biol Med. 2024. Mar; 171: 108189. DOI: 10.1016/j.compbiomed.2024.108189 [DOI] [PubMed] [Google Scholar]
- 30.Nori H, King N, McKinney SM, Carignan D, Horvitz E. Capabilities of GPT-4 on medical challenge problems. Preprint at arXiv; 2023. DOI: 10.48550/arXiv.2303.13375 [DOI] [Google Scholar]
- 31.Walker HL, Ghani S, Kuemmerli C, Nebiker CA, Müller BP, Raptis DA, et al. Reliability of medical information provided by ChatGPT: Assessment against clinical guidelines and patient information quality instrument. J Med Internet Res. 2023; 25: e47479. DOI: 10.2196/47479 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Ayers JW, Poliak A, Dredze M, Leas EC, Zhu Z, Kelley JB, et al. Comparing physician and artificial intelligence chatbot responses to patient questions posted to a public social media forum. JAMA Intern Med. 2023; 183: 589–96. DOI: 10.1001/jamainternmed.2023.1838 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Rahsepar AA, Tavakoli N, Kim GHJ, Hassani C, Abtin F, Bedayat A. How AI Responds to Common Lung Cancer Questions: ChatGPT vs Google Bard. Radiology. 2023. Jun; 307(5): e230922. DOI: 10.1148/radiol.230922 [DOI] [PubMed] [Google Scholar]
- 34.Maida E, Moccia M, Palladino R, Borriello G, Affinito G, Clerico M, Repice AM, Di Sapio A, Iodice R, Spiezia AL, Sparaco M, Miele G, Bile F, Scandurra C, Ferraro D, Stromillo ML, Docimo R, De Martino A, Mancinelli L, Abbadessa G, Smolik K, Lorusso L, Leone M, Leveraro E, Lauro F, Trojsi F, Streito LM, Gabriele F, Marinelli F, Ianniello A, De Santis F, Foschi M, De Stefano N, Morra VB, Bisecco A, Coghe G, Cocco E, Romoli M, Corea F, Leocani L, Frau J, Sacco S, Inglese M, Carotenuto A, Lanzillo R, Padovani A, Triassi M, Bonavita S, Lavorgna L; Digital Technologies, Web, Social Media Study Group of the Italian Society of Neurology (SIN). ChatGPT vs. neurologists: a cross-sectional study investigating preference, satisfaction ratings and perceived empathy in responses among people living with multiple sclerosis. J Neurol. 2024. Apr 3; Epub ahead of print. PMID: 38568227. DOI: 10.1007/s00415-024-12328-x [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Espay AJ, Lang AE, Erro R, Merola A, Fasano A, Berardelli A, Bhatia KP. Essential pitfalls in “essential” tremor. Mov Disord. 2017. Mar; 32(3): 325–331. DOI: 10.1002/mds.26919 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Lenka A, Louis ED. Do We Belittle Essential Tremor by Calling It a Syndrome Rather Than a Disease? Yes. Front Neurol. 2020. Oct 15; 11: 522687. DOI: 10.3389/fneur.2020.522687 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Erro R, Picillo M, Pellecchia MT, Barone P. Diagnosis Versus Classification of Essential Tremor: A Research Perspective. J Mov Disord. 2023. May; 16(2): 152–157. DOI: 10.14802/jmd.23020 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Mac O, Ayre J, Bell K, McCaffery K, Muscat DM. Comparison of Readability Scores for Written Health Information Across Formulas Using Automated vs Manual Measures. JAMA Netw Open. 2022; 5(12): e2246051. DOI: 10.1001/jamanetworkopen.2022.46051 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Ji Z, Lee N, Frieske R, Yu T, Su D, Xu Y, et al. Survey of hallucination in natural language generation. ACM Comput Surv. 2022; 55(12): 1–38. DOI: 10.1145/3571730 [DOI] [Google Scholar]
- 40.Venkit PN, Srinath M, Wilson S. A study of implicit bias in Pretrained language models against people with disabilities. In: Proceedings of the 29th international conference on computational linguistics [internet]. Gyeongju, Republic of Korea: International Committee on Computational Linguistics; 2022; p. 1324–32 [cited 2023]. Available from: https://aclanthology.org/2022.coling-1.113. [Google Scholar]
- 41.O’Suilleabhain P, Berry DS, Lundervold DA, Turner TH, Tovar M, Louis ED. Stigma and Social Avoidance in Adults with Essential Tremor. Mov Disord Clin Pract. 2023. Jun 21; 10(9): 1317–1323. DOI: 10.1002/mdc3.13774 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Masters K. Ethical use of artificial intelligence in health professions education: AMEE Guide No.158. Med Teach; 2023. Mar 13 [Epub]. DOI: 10.1080/0142159X.2023.2186203 [DOI] [PubMed] [Google Scholar]
- 43.Deik A. Potential Benefits and Perils of Incorporating ChatGPT to the Movement Disorders Clinic. J Mov Disord. 2023. May; 16(2): 158–162. Epub 2023 May 24. PMID: 37258279; PMCID: PMC10236019. DOI: 10.14802/jmd.23072 [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Detailed percentage agreement of ratings in two group of raters.
Data Availability Statement
Data will be made available upon request to the corresponding author.