Abstract
INTRODUCTION
We examine how non‐expert humans (young adults unfamiliar with dementia) and large language models (LLMs) perceive dementia in transcribed texts—recognizing signs that may indicate cognitive decline. Human perception is important, as it is often the driver for seeking medical evaluation. LLM perception is equally interesting given their potential as screening tools.
METHODS
Humans and LLMs intuitively judged whether transcribed picture descriptions came from dementia patients or healthy controls. We represented texts using high‐level, expert‐guided features and used logistic regression to model perceptions and analyze coefficients.
RESULTS
Human judgments are inconsistent, relying on a narrow and sometimes misleading set of cues. LLMs use a richer, more clinically aligned feature set. Both groups show a tendency toward false negatives.
DISCUSSION
This work highlights the need to educate humans and LLMs to recognize a broader range of dementia‐related linguistic signals. It also underscores the value of interpretability in dementia research.
Highlights
Explainable artificial intelligence (AI) uncovers linguistic cues that humans and large language models (LLMs) associate with dementia.
LLMs allow scalable extraction of expert‐defined features on picture descriptions.
LLMs use broader cues than humans to detect dementia and better align with diagnoses.
Humans and LLMs exhibit false negatives; LLMs view fluency as cognitive health.
Understanding non‐expert perceptions can guide education and improve early awareness.
Keywords: Alzheimer's disease, artificial intelligence, cookie theft, dementia, large language models, natural language processing, societal perception
1. BACKGROUND
Subtle language dysfunctions have long been recognized as early signs of cognitive decline, making linguistic signals valuable for early dementia detection. 1 , 2 In turn, hundreds of studies have experimented with natural language processing (NLP) methods to improve and accelerate the detection process, 3 primarily using transcribed cognitive assessments obtained in a clinical setting.
In reality, however, the initial “detection” of dementia symptoms rarely begins with a clinician. More often, individuals themselves or their close environment first notice signs of cognitive decline and initiate medical evaluation. 4 , 5 This highlights the importance of understanding how dementia is perceived by non‐experts, that is, those who are not yet patients, caregivers, or clinicians. Identifying which linguistic behaviors are perceived as related to dementia can reveal where public intuition aligns with clinical insight and where it falls short and can benefit from further education.
Nowadays, not only humans can track linguistic changes and raise concerns. Adults over 55, an age group at increased risk for cognitive decline, 6 regularly use large language models (LLMs). 7 Given LLMs’ extensive world knowledge and access to user input over time, one can imagine LLMs flagging subtle linguistic shifts to indicate signs of cognitive impairment. Therefore, understanding which cues drive LLMs’ perceptions is essential.
So far, most NLP research on dementia focuses on predicting clinical diagnoses, that is., labeling a speaker as “healthy” or “cognitively impaired” based on their speech. 3 In our study, however, we do not aim to replicate clinical diagnoses, but to explore the reasoning of non‐experts; the individuals who often serve as informal “screeners”. In general, the question of how dementia is perceived is rarely studied through the lens of NLP, and when it is, the focus is typically on stigma in social media discourse. 8 , 9
To the best of our knowledge, this is the first work to examine how non‐experts and LLMs perceive dementia through language: which cues lead them to perceive someone as cognitively impaired, based on a spontaneous speech task routinely used in clinical practice. We frame LLMs not merely as technological tools, but as emerging stakeholders in the dementia landscape: bystanders who interact daily with at‐risk populations. We also prioritize inherent interpretability in a domain dominated by “black‐box” solutions.
Throughout this work, we use “humans” to refer to non‐expert humans and “clinicians” or “clinical diagnosis” to refer to expert medical judgment. We presented 514 Cookie Theft picture descriptions from the Pitt corpus 10 to 27 human annotators and three LLMs (LLaMA 3, 11 GPT‐4o, 12 and Gemini 1.5 Pro 13 ), asking them to intuitively judge whether each text was produced by someone healthy or cognitively impaired.
To analyze these collected perceptions, we propose a four‐step explainable method, inspired by studies such as Badian et al. 14 and Lissak et al. 15 : (1) design intuitive, human‐centered features in consultation with domain experts; (2) extract these features using an LLM as an annotator, with quality control; (3) train inherently interpretable logistic regression models to predict perceptions and clinical diagnosis; and (4) analyze coefficients to identify the linguistic cues influencing dementia perceptions. Figure 1 outlines our full methodology.
FIGURE 1.

Illustration of the end‐to‐end methodological process.
We hope this study lays a foundation for future research on dementia from the perspective of all stakeholders, including non‐experts and society as a whole. By shedding light on the linguistic cues that non‐experts and LLMs rely on, we believe our findings can contribute to broader public awareness and support earlier detection. Finally, we aspire to encourage the use of LLMs for advancing dementia detection while preserving the essential principle of interpretability.
2. METHODS
2.1. Preprocessing
RESEARCH IN CONTEXT
Systematic review: We reviewed literature from medical and technological sources (e.g., PubMed, ACL Anthology). Prior natural language processing (NLP) work on dementia focused on classifying clinical transcripts using computational features and black‐box models lacking interpretability. Research on dementia perception is scarce, with minimal work on how non‐experts interpret patient language.
Interpretation: Our study is the first to systematically model and compare linguistic cues shaping dementia perception in non‐expert humans and large language models (LLMs). Using expert‐defined, human‐intuitive features, we reveal differences in the breadth and consistency of cues, providing insights for public education and model design.
Future directions: We present a methodology for studying dementia perception, which could later extend to how experienced individuals (e.g., caregivers) judge language relative to clinical diagnoses and help improve non‐expert and LLM perceptions. From an ecological validity standpoint, dementia perception should also be studied through multimodal, longitudinal input reflecting real‐life interactions and familiarity with a person's baseline communication before decline.
We use the Pitt corpus, 10 which includes data from healthy individuals and dementia patients across varying stages and types. The corpus provides participant demographics, standardized cognitive scores, and transcribed recordings of linguistic tasks such as the Cookie Theft picture description, 16 on which we focus in our study. In this task, participants are shown an image (Figure A.1) and asked to describe what they see, with their responses recorded, transcribed, and analyzed. These descriptions help clinicians form a general, holistic impression of an individual's information processing, linguistic performance, motor speech function, and communicative ability.
We use 514 Cookie Theft Picture descriptions and their corresponding diagnoses, binarized into “Healthy” and “Dementia.” MCI (mild cognitive impairment) patients are included under the dementia category to simplify the experiment for the non‐experts. We clean the transcript of interviewers’ comments and extraneous characters. We use only the raw text, excluding any demographic or clinical metadata, to reduce potential bias (e.g., assumptions based on the speaker's age) and comply with the dataset's guidelines.
3. PERCEPTION ANNOTATIONS
3.1. Human perception
We recruited 27 non‐expert annotators to read the preprocessed picture descriptions and intuitively judge each as “Healthy” or “Dementia.” See full participant demographics in Table B.1.1. Recruitment and consent procedures are described in Appendix B.2. Annotators received no prior instructions regarding which cues to consider when making their decisions (see Appendix C.1 for annotation guidelines). We focused on young non‐expert adults with little to no knowledge of the disease, limited at most to occasional encounters with a cognitively impaired relative. We deliberately chose not to include expert populations, such as caregivers or clinicians, as (a) expert and non‐expert groups are likely to produce significantly different perception signals, and (b) young adults are a particularly relevant group for studying dementia perception, as they are increasingly likely to observe signs of cognitive decline in older family members. Understanding what they perceive, and eventually helping them recognize those signs, is of high public health value.
Each description was labeled by 10 annotators, and the majority vote was used as the final perception label. No ties occurred in any of the samples. Annotations show a relatively low inter‐annotator agreement (Fleiss’ κ = 0.28), which is expected given the inherent subjectivity of the task. 17 After completing the task, annotators were asked to describe any cues they noticed that may have influenced their decision.
3.2. LLM perception
GPT‐4o, LLaMA 3, and Gemini‐1.5‐Pro were provided with the same transcripts and asked to provide a similar intuitive judgment (full prompt in Appendix C.2). Each model labeled the entire dataset, and we used their majority vote in our analysis. LLMs show stronger inter‐annotator agreement than humans (Fleiss’ κ = 0.465), perhaps due to their shared training data and structure.
3.3. Feature design and annotation
In collaboration with a neurologist and a neuropsychologist, we defined 38 binary features capturing informative aspects of the picture descriptions in an intuitive manner. All features are anchored in dementia research (see Table D.1 for features, examples, and sources 18 , 19 , 20 , 21 , 22 , 23 , 24 , 25 , 26 , 27 , 28 , 29 , 30 , 31 , 32 , 33 , 34 , 35 , 36 , 37 , 38 , 39 , 40 ). For example, to represent the noun‐to‐verb ratio, 41 we asked: “Did the speaker focus on actions over objects?”. This framing makes the feature easier to interpret and potentially adopt as a guideline. The features span five categories (Figure 2) aligned with cognitive processes involved in picture description: what is directly observed (Objective Interpretation); what is inferred or assumed beyond what is seen (Subjective Interpretation); how these observations are linguistically expressed (Linguistic); and the emotional or experiential states expressed throughout (Human Experience and Interview Context). Grouping features into broader categories enables higher‐level analysis and more generalizable insights. 42
FIGURE 2.

Feature categories, definitions, and examples.
Traditionally, extracting such high‐level features would require extensive feature engineering and custom algorithmic logic for each individual feature, posing a significant scalability challenge. Manual annotation is similarly impractical, with over 19,000 annotations needed (38 features x 514 descriptions). To address this, we build on prior work such as 43 and use LLMs as annotators, leveraging their scalability while maintaining interpretability. We prompted LLaMA 3, GPT‐4o, and Gemini‐1.5‐Pro to read each picture description and assign Yes/No values to all features. All features and prompts are described in Appendix D. Five features appeared in fewer than 5% of samples: vision difficulties, introduction, naming characters, empathy, and irritability. These were removed from the dataset and excluded from analyses (see Figure D.1 for feature value distribution).
To ensure the validity of LLM feature extraction, we ran a blind annotation study: three human annotators independently labeled a subset of the corpus (30 descriptions x 38 features, totaling N = 1140 values per annotator). Inter‐annotator agreement between human and LLM annotators was solid, with Fleiss’ κ = 0.579. We then applied the Alternative Annotator Test, 44 a statistical method for assessing the chance that LLM annotations are as good as human annotations. GPT‐4o performed best, with 90%, outperforming Gemini‐1.5‐Pro (83%) and Llama‐3.1 (70%). Notably, it passed the test with a conservative threshold (ε = 0.1), limiting the acceptable disagreement between LLM and human annotations, as recommended by Calderon et al. 44 Further validation methods, including prompt sensitivity analysis and a per‐feature agreement analysis between GPT and the human annotators in the blind study, are presented in Appendix E. Given these statistical justifications, we used GPT‐4o to label all 38 binary features across our 514 descriptions, a total of 19,532 annotated values.
3.4. Perception modeling
We train three logistic regression models to predict LLM perception, human perception, and clinical diagnosis. We chose logistic regression as it is inherently interpretable, and we apply stepwise selection to eliminate weak predictors. While these are predictive models by definition, our end goal is not to optimize predictive power but rather to provide insights through coefficient analysis of what the models captured. We therefore train on the entire dataset and evaluate model fit using McFadden's R 2, 45 a standard goodness‐of‐fit metric quantifying how well the predictors explain the outcome relative to a null model. 46 For reference, values between 0.2 and 0.4 are considered strong and comparable to R 2 values of 0.6–0.8 in linear regression. 47 To validate the robustness of our findings, rule out overfitting, and assess predictive power, we conducted a complementary analysis using 20‐fold cross‐validation, with performance metrics, receiver operating characteristics (ROC) curves, and calibration metrics presented in Appendix F (Table F.1 and Figure F.1).
After training, we examine the coefficients of features with statistically significant effects. For example, the feature disfluencies has a coefficient of β = 2.16 in the LLM perception model. This means that the log‐odds of the model assigning a “Dementia” label increase by 2.16 when disfluencies are present, holding all else constant. Converting this to an odds ratio, the model is approximately 9 times more likely (e 2 . 16 ≈ 8.67) to assign a “Dementia” label than a “Healthy” one. In contrast, rich vocabulary has β = −1.46 in the LLM perception model, corresponding to an odds ratio of about e− 1 . 46 ≈ 0.23, meaning the model is less likely to assign a “Dementia” label when this feature is present. In general, positive coefficients indicate a shift toward the “Dementia” label (1) and negative toward “Healthy” (0), with larger values reflecting greater impact.
4. RESULTS
4.1. Modeling perceptions and diagnosis
Figure 3 shows the coefficients of statistically significant features for human perception, LLM perception, and clinical diagnosis. Detailed coefficients and p‐values are provided in Table G.1.
FIGURE 3.

Main results from the logistic regression coefficient analysis and perception disagreements. Left: Statistically significant features associated with clinical diagnosis, LLM perception, and human perception. Colors indicate the source of judgment; bar direction reflects the sign of the logistic regression coefficient (right = dementia, left = healthy). Dotted lines separate feature categories (linguistic, objective interpretation, etc.). Significance levels: * p < 0.05; ** p < 0.01; *** p < 0.001. Top Right: Overlap between clinically diagnosed cases and those perceived as dementia by humans and/or LLMs. Of 283 diagnosed, 98 (green, teal, and purple) were missed by humans, LLMs, or both. Bottom Right: Confusion matrices showing alignment between perceptions and diagnosis. LLM, large language model.
McFadden's R 2 values indicate a good model fit for clinical diagnosis (0.209) and a very strong fit for LLM dementia perception (0.527), suggesting that our model and features capture a reliable underlying signal. Human perception, however, was harder to model, with McFadden's R 2 = 0.058. These results were reproduced in the 20‐fold cross‐validation experiment (Appendix F).
Only a small set of straightforward features are significantly associated with human perception: non‐specific language, short sentences, the girl explicitly mentioned, and the mother explicitly mentioned, all significantly associated with clinical diagnosis as well. Additional features significantly linked to clinical diagnosis include actions over objects, other characters mentioned, and weather conditions mentioned. Interestingly, while clinicians associate short sentences with dementia, non‐experts interpret them as a sign of cognitive health.
Coefficient analysis reveals that all three judgment types are significantly associated with linguistic features such as non‐specific language, as well as objective interpretations (e.g., whether the boy, girl, or mother is explicitly mentioned). LLMs rely on a broader range of features and categories than clinical diagnoses, showing greater sensitivity to subjective interpretation cues. This includes the use of Theory of Mind (describing others’ emotions, intentions, or thoughts) or referring to characters not present in the picture. LLMs also place greater emphasis on emotional expression (lightheartedness, self‐limitations, etc.).
4.2. Misperceptions analysis
We examine the misalignments between human and LLM perception and clinical diagnoses. Figure 3 presents the corresponding confusion matrices, alongside a Venn diagram illustrating the overlap between clinically diagnosed cases and those perceived as dementia. Among the 283 clinically diagnosed dementia cases, humans correctly identified 57%. LLMs, though more conservative in assigning the dementia label, matched 60%. Some misalignment is expected, as our non‐experts rely solely on picture descriptions, while clinical diagnoses draw on a broader range of signals. Errors by both humans and LLMs followed a similar pattern: 70% of LLM errors were false negatives (i.e., missing clinically diagnosed cases), while 30% were false positives. Humans showed a similar trend, with a 65‐35 ratio. As such, both groups portray blind spots and room for improvement. When focusing on MCI (42 of the 283 dementia cases), LLMs perform less accurately than humans, identifying only 26% as cognitively impaired compared to 52% by humans. These preliminary findings call for further research on how different types and stages of dementia are perceived.
Next, we examine two key subsets: cases where humans disagreed with both LLMs and clinicians (n = 121), for example, perceiving dementia when both others judged healthy, and cases where LLMs disagreed with both humans and clinicians (n = 88). In this analysis, we go beyond comparing perceptions and diagnoses. Instead, we ask: what cues are so prominent that both clinicians and LLMs capture them, but humans miss? And which cues do clinicians and humans detect, but LLMs overlook? We then train a stepwise logistic regression model on each subset to predict human and LLM misperceptions. As before, positive coefficients indicate a shift toward the “Dementia” label and negative toward “Healthy”—but here, they reflect false dementia and false healthy, respectively.
The human misperception model shows a strong fit (McFadden's R 2 = 0.6) and reveals two systematic patterns in human misjudgment: (1) Features used differently in misperceived texts: Non‐specific language received a positive coefficient in the full dataset of 514 samples, indicating that, in general, humans associate non‐specific language with dementia. However, in the smaller misperception subset, this feature received a negative coefficient, indicating a shift toward the “Healthy” label (a false healthy judgment in this context). This suggests that human reliance on this cue is inconsistent: it is typically treated as a sign of dementia but sometimes overlooked or misinterpreted.
(2) Features emerging only in misperceptions: Some features are not significant in the overall human perception model but become significant in misperception cases. Namely, rich vocabulary, actions over objects, outside mentioned, and boy mentioned all show a shift toward false healthy judgments, that is, mistakenly labeling clinically diagnosed dementia cases as healthy. Lightheartedness, on the other hand, becomes significant in the misperception model with a shift toward false dementia, suggesting that humans may sometimes associate this cue with cognitive decline. This indicates that certain features, while not consistently relied upon in the general model, may exert misleading influence in specific cases of misjudgment.
The LLM disagreement analysis revealed a quasi‐perfect separation: whenever features such as hesitation, reiterating ideas, grammatical inaccuracies, or disfluencies received a value of ‘0’ (i.e., were not present in the text), the LLM majority vote always labeled the text as “Healthy.” This means that LLMs systematically equate fluency with cognitive health, which is not always the case. Other LLM errors may be linked to features that are rare in the disagreement set, such as sadness, many objects mentioned, mother mentioned, or other characters mentioned. For example, only seven texts expressed sadness, and five mentioned other characters. This pattern suggests that data sparsity may also contribute to LLM errors.
4.3. Self‐reported human rationale
At the end of the task, without having seen our predefined features, annotators were asked if they noticed any patterns in the text that helped with your judgment. These retrospective reports, provided by 18 of 27 participants, may not fully capture real‐time reasoning, yet they offer valuable insights into the behaviors that annotators noticed and said they relied on.
Upon manual review, we found that 65% of the self‐reported cues closely align with our predefined feature set, suggesting our features are indeed naturally noticed by humans. Most cited were reiterating ideas, disfluencies, and improbable interpretations. Notably, all four cues revealed as significant for human perception in our coefficient analysis (Figure 3) were also self‐reported by annotators. A full analysis, including details of newly emergent features from participant responses, is provided in Appendix H and Table H.1.
5. DISCUSSION
Our results reveal that non‐expert humans perceive dementia through a narrow and inconsistent lens. This is reflected in the low McFadden's R 2 score, poor inter‐annotator agreement, and conflicting rationales—both in the features cited and the direction of their presumed influence. These self‐reported rationales did not align with the statistically significant features identified by the model, suggesting that people may not rely on the cues they believe they do. The limited performance of the human perception model likely stems not only from model constraints but also from the inherent noise and inconsistency in human judgment. One example involves the self‐reported cue level of detail, mentioned by many annotators: while some viewed it as a sign of cognitive impairment, others associated it with cognitive health. Another example is the association of short sentences with healthy descriptions, whereas clinical diagnosis links it to dementia. Unlike humans, LLMs draw on a broad range of cues across four of our five feature categories, aligning more closely with clinical diagnoses. This likely reflects the extensive prior knowledge embedded in their training data, for example, “learning” that Theory of Mind correlates with certain dementia types. 48 These findings highlight the importance of education and background knowledge in helping non‐experts and clinicians recognize a broader and more accurate set of linguistic markers for dementia.
We show that both humans and LLMs show a tendency toward false negatives, misjudging dementia cases as healthy. However, in the case of LLMs, a clear pattern emerges: when no linguistic difficulties are present, the model is extremely prone to assigning a “healthy” label. Thus, LLMs may be quick to judge a speaker as cognitively healthy if no linguistic dysfunction is apparent, potentially leading to missed signs of early cognitive impairment. This systematic error pattern likely mirrors a broader societal bias: equating fluency with cognitive health, which is not always the case. It also highlights the need for caution when applying LLMs in dementia‐related settings.
Real‐world deployment of LLMs as diagnostic tools would require prospective validation, clinician oversight, and user‐facing disclaimers to ensure responsible use. Specifically, the systematic tendency of LLMs to equate linguistic fluency with cognitive health demands targeted safeguards. For instance, prompts could explicitly remind the model that cognitive impairment may still occur in fluent speakers, particularly in the early stages of dementia. If feasible, fine‐tuning LLMs on transcribed speech from diverse patient populations could improve their ability to capture both fluent and disfluent manifestations of cognitive decline. Furthermore, incorporating a preliminary fluency assessment prior to querying the model could allow dynamic adjustment of sensitivity thresholds, raising sensitivity when fluency is high.
In sum, this study explores how young non‐expert human adults and LLMs perceive dementia in transcribed picture descriptions and how their perceptions align with clinical diagnoses. We present an interpretable method using high‐level, expert‐guided features annotated by GPT‐4o, followed by logistic regression and coefficient analysis. Ultimately, our work highlights: (1) the inherent difficulty of this task, both in perceiving dementia through text alone and in modeling such a complex phenomenon; (2) the importance of educating both humans and LLMs to recognize a broader range of linguistic signals to improve early detection; and (3) the growing opportunity for clinicians to integrate accessible LLM‐based tools into their practice. This study exemplifies the value of NLP in enhancing scalability across both screening tasks and labor‐intensive annotation processes, while preserving interpretability.
6. LIMITATIONS AND ETHICAL CONSIDERATIONS
Demographically, our human annotators came from a very specific pool that does not necessarily represent the general public as a whole. In addition, as non‐native English speakers evaluated transcriptions in English, subtle nuances in language or expression may have influenced human judgments. Future studies should aim to include more diverse populations, spanning broader age groups, cultural backgrounds, linguistic profiles, and levels of familiarity with dementia.
Our annotators judged anonymous, single‐text descriptions, unlike real‐life scenarios where impressions are shaped through repeated personal interactions. However, this is intentional—focusing on individuals who do not have sustained exposure to people with dementia helps capture perceptions in inexperienced audiences. We assume participants relied on generalized expectations of how a “healthy” person sounds, potentially mirroring stereotypes or media portrayals.
Furthermore, we chose to binarize the samples into Healthy and Dementia classes, the latter containing all MCI samples as well. As such, we did not model the perception of MCI, a decision based on two factors: (1) the Pitt corpus contains only a small subset of MCI samples, and (2) the added complexity of asking non‐experts to distinguish not only between Healthy and Dementia but also MCI, a condition that is difficult to diagnose even for trained clinicians. Future work could extend our approach to designated studies about MCI perception.
Additionally, logistic regression may be less effective at capturing complex feature interactions due to its linearity assumption. However, this choice aligns with the explainable AI literature, where simple linear models are consistently preferred in user studies 49 for their transparent coefficient structures, which provide meaningful, directional insights into how individual features relate to perception labels. Future work should explore more expressive models capturing non‐linear effects while preserving explainability.
Feature extraction via LLMs may be sensitive to prompt phrasing—a concern we addressed through statistical validation methods and robustness evaluation (Appendices E, F). We also note that LLMs are likely to reflect societal and cultural biases rather than clinical reasoning. 50 By framing their outputs as perceptions rather than detection, we mitigate this concern: LLM judgments reflect learned associations, not clinical truths.
We note that the Pitt corpus, used in this study, is available upon request. We used only anonymized transcripts without personally identifiable information (PII), in accordance with ethical standards.
Finally, we emphasize that non‐experts are neither equipped nor expected to diagnose dementia. When loved ones raise concerns, their suspicions, however well‐meaning, can lead to stigma, emotional distress, or erosion of trust. For the person being judged, this may trigger isolation or anxiety; for the observer, it may result in guilt if early signs are missed. These complex dynamics underscore the need for professional involvement. Diagnosing dementia and delivering difficult news should remain the role of trained clinicians, who can do so with care and provide appropriate follow‐up. While laypeople can help prompt medical evaluation, the burden of detection should not fall on them, and missing early signs is never a personal failure.
CONFLICT OF INTEREST STATEMENT
We confirm that this manuscript is original and not under consideration elsewhere. A separate version of this study has been accepted to EMNLP Findings 2025, which presents the work from a technical, algorithmic perspective for the NLP community. The present manuscript is reframed for a clinical audience, with emphasis on methodological innovation, ethical considerations, and the potential of explainable AI for early dementia detection and screening. We also disclose that two of the authors, Prof. Roi Reichart and Prof. Michal Schnaider‐Beeri, serve as Guest Editor and Senior Associate Editor of the journal, respectively. Author disclosures are available in the Supporting Information.
CONSENT STATEMENT
We confirm that all human participants provided informed verbal consent (see Appendix B.2).
Supporting information
Supporting information
Supporting information
ACKNOWLEDGMENTS
The authors sincerely thank Dr. Itamar Ganmore and Yonatan Shwartz for sharing their clinical expertise and guiding the feature design process. They also thank Prof. Uri Obolski for assistance with statistical analysis, and the Technion NLP lab for their support and helpful discussions.
Zadok M, Peled‐Cohen L, Calderon N, Gonen H, Beeri MS, Reichart R. Human and large language model judgments of cognitive impairment from language: An explainable artificial intelligence approach. Alzheimer's Dement. 2026;18:e70248. 10.1002/dad2.70248
Contributor Information
Maya Zadok, Email: maya.zadok@campus.technion.ac.il.
Lotem Peled‐Cohen, Email: lotemi.peled@gmail.com.
REFERENCES
- 1. Verma M, Howard RJ. Semantic memory and language dysfunction in early Alzheimer's disease: a review. Int J Geriatr Psychiatry. 2012;27(12):1209‐1217. [DOI] [PubMed] [Google Scholar]
- 2. Cho S, Cousins KAQ, Shellikeri S, et al. Lexical and acoustic speech features relating to Alzheimer disease pathology. Neurology. 2022;99(4):e313‐e322. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3. Peled‐Cohen L, Reichart R. A systematic review of NLP for dementia‐tasks, datasets and opportunities. arXiv preprint arXiv:2409.19737, 2024.
- 4. van Harten AC, Mielke MM, Swenson‐Dravis DM, et al. Subjective cognitive decline and risk of MCI: the Mayo Clinic Study of Aging. Neurology. 2018;91(4):e300‐e312. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5. Jessen F, Amariglio RE, Buckley RF, et al. The characterisation of subjective cognitive decline. Lancet Neurol. 2020;19(3):271‐278. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6. Bai W, Chen P, Cai H, et al. Worldwide prevalence of mild cognitive impairment among community dwellers aged 50 years and older: a meta‐analysis and systematic review of epidemiology studies. Age Ageing. 2022;51(8):afac173. [DOI] [PubMed] [Google Scholar]
- 7. Smith J. The future of senior care may be AI. https://www.retirementliving.com/aging‐with‐ai, 2025. [Accessed 2025‐05‐08].
- 8. Benbow SM, Jolley D. Dementia: stigma and its effects. Neurodegener Dis Manag. 2012;2(2):165‐172. [Google Scholar]
- 9. Monfared AAT, Stern Y, Doogan S, Irizarry M, Zhang Q. Stakeholder insights in Alzheimer's disease: natural language processing of social media conversations. J Alzheimers Dis. 2022;89(2):695‐708. [DOI] [PubMed] [Google Scholar]
- 10. Becker JT, Boiler F, Lopez OL, Saxton J, McGonigle KL. The natural history of Alzheimer's disease: description of study cohort and accuracy of diagnosis. Arch Neurol. 1994;51(6):585‐594. [DOI] [PubMed] [Google Scholar]
- 11. Grattafiori A, Dubey A, Jauhri A, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024.
- 12. OpenAI. GPT‐4 technical report. CoRR, abs/2303.08774, 2023. doi: 10.48550/ARXIV.2303.08774 [DOI]
- 13. Team G, Georgiev P, Lei VI, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530, 2024.
- 14. Badian Y, Ophir Y, Tikochinski R, Calderon N. Anat Brunstein Klomek, Eyal Fruchter, and Roi Reichart. Social media images can predict suicide risk using interpretable large language‐vision models. J Clin Psychiatry. 2023;85(1):50516. [DOI] [PubMed] [Google Scholar]
- 15. Lissak S, Ophir Y, Tikochinski R, et al. Bored to death: artificial intelligence research reveals the role of boredom in suicide behavior. Front Psychiatry. 2024;15:1328122. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16. Goodglass H, Kaplan E, Weintraub S. BDAE: The Boston Diagnostic Aphasia Examination. PA: Lea & Febiger; 1983. [Google Scholar]
- 17. Rottger P, Vidgen B, Hovy D, Pierrehumbert J, et al. Two contrasting data annotation paradigms for subjective nlp tasks. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies . Associa‐ tion for Computational Linguistics, 2022. [Google Scholar]
- 18. Nicholas LE, Brookshire RH. A system for quantifying the informativeness and efficiency of the connected speech of adults with aphasia. J Speech Hear Res. 1993;36(2):338‐350. doi: 10.1044/jshr.3602.338. ISSN 0022‐4685. [DOI] [PubMed] [Google Scholar]
- 19. Cho S, Nevler N, Ash S, et al. Automated analysis of lexical features in frontotemporal degeneration. Cortex. 2021;137:215‐231. doi: 10.1016/j.cortex.2021.01.012. ISSN 19738102. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20. Croisile B, Ska B, Brabant M‐J, et al, Comparative study of oral and written picture description in patients with Alzheimer's disease. Technical report, 1996. [DOI] [PubMed]
- 21. Fraser KC, Meltzer JA, Rudzicz F. Linguistic features identify Alzheimer's disease in narrative speech. J Alzheimers Disease. 2016;49(2):407‐422. doi: 10.3233/JAD-150520. ISSN 1875‐8908. [DOI] [PubMed] [Google Scholar]
- 22. Ortiz‐Perez D, Ruiz‐Ponce P, Tomás D, Garcia‐Rodriguez J, Flores Vizcaya‐Moreno M, Leo M. A deep learning‐based multimodal architecture to predict signs of dementia. Neurocomputing. 2023;548:126413. doi: 10.1016/J.NEUCOM.2023.126413. ISSN 0925‐2312. [DOI] [Google Scholar]
- 23. Kempler D, Zelinski EM. Language in dementia and normal aging. Dementia and Normal Aging. 1994:331‐365. [Google Scholar]
- 24. Kempler D, Goral M. Language and dementia: neuropsychological aspects. Annual review of applied linguistics. 2008;28:73‐90. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25. Cummings L. Describing the cookie theft picture. Pragmat Soc. 2019;10(2):153‐176. doi: 10.1075/ps.17011.cum. ISSN 1878‐9714. [DOI] [Google Scholar]
- 26. Mueller KD, Koscik RL, Turkstra LS, et al. Connected language in late middle‐aged adults at risk for Alzheimer's disease. J Alzheimers Dis. 2016;54(4):1539‐1550. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27. Forbes‐McKay KE, Venneri A. Detecting subtle spontaneous language decline in early Alzheimer's disease with a picture description task. Neurolog Sci. 2005;26:243. [DOI] [PubMed] [Google Scholar]
- 28. Karlekar S, Niu T, Bansal M. Detecting linguistic characteristics of Alzheimer′s dementia by interpreting neural models. In: Walker M, Heng J, Stent A, eds. Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies . Association for Computational Linguistics; 2018:Volume 2 (Short Papers):701‐707. doi: 10.18653/v1/N18-2110 [DOI] [Google Scholar]
- 29. Mueller KD, Koscik RL, Hermann BP, Johnson SC, Turkstra LS. Declines in connected language are associated with very early mild cognitive impairment: results from the Wisconsin Registry for Alzheimer's Prevention. Front Aging Neurosci. 2018;9(JAN). doi: 10.3389/fnagi.2017.00437. ISSN 16634365. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30. Kumar Y, Maheshwari P, Joshi S, Baths V. Ml‐based analysis to identify speech features relevant in predicting Alzheimer's disease. In Proceedings of the 8th International Conference on Computing and Artificial Intelligence , 2022:207‐213.
- 31. Rudzicz F, Currie LC, Danks A, Mehta T, Zhao S. Automatically identifying trouble‐indicating speech behaviors in Alzheimer's disease. In Proceedings of the 16th International ACM SIGACCESS Conference on Computers & Accessibility, ASSETS ’14, Association for Computing Machinery; 2014:241‐242. ISBN 978‐1‐4503‐2720‐6. doi: 10.1145/2661334.2661382 [DOI] [Google Scholar]
- 32. Yorkston KM, Beukelman DR. An analysis of connected speech samples of aphasic and normal speakers. J Speech Hear Disord. 1980;45(1):27‐36. doi: 10.1044/jshd.4501.27 [DOI] [PubMed] [Google Scholar]
- 33. Chow TE, Veziris CR, Mundada N, et al. Medial temporal lobe tau aggregation relates to divergent cognitive and emotional empathy abilities in Alzheimer's disease. J Alzheimers Dis. 2023;96(1):313‐328. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34. Demichelis OP, Coundouris SP, Grainger SA, Henry JD. Empathy and theory of mind in Alzheimer's disease: a meta‐analysis. J Int Neuropsychol Soc. 2020;26(10):963‐977. doi:10.1017/S1355617720000478. ISSN 1355‐6177, 1469‐7661. [DOI] [PubMed] [Google Scholar]
- 35. Cipriani G, Vedovello M, Ulivi M, Nuti A, Lucetti C. Repetitive and stereotypic phenomena and dementia. Am J Alzheimers Dis Other Demen. 2013;28(3):223‐227. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36. Baylis GC, Baylis LL, Gore CL. Visual neglect can be object‐based or scene‐based depending on task representation. Cortex. 2004;40(2):237‐246. [DOI] [PubMed] [Google Scholar]
- 37. Lawrence V, Murray J, Banerjee S, et al. “out of sight, out of mind”: a qualitative study of visual impairment and dementia from three perspectives. Int Psychogeriatr. 2009;21(3):511‐518. [DOI] [PubMed] [Google Scholar]
- 38. Ilias L, Askounis D. Explainable identification of dementia from transcripts using transformer networks. IEEE J Biomed Health Inf. 2022;26(8):4153‐4164. [DOI] [PubMed] [Google Scholar]
- 39. Sirts K, Piguet O, Johnson M. Idea density for predicting Alzheimer's disease from transcribed speech. In Proceedings of the 21st Conference on Computational Natural Language Learning (CoNLL 2017) . 2017:322‐332.
- 40. Wankerl S, Nöth E, Evert S. An n‐gram based approach to the automatic diagnosis of Alzheimer's disease from spoken language. In Interspeech. 2017:3162‐3166.
- 41. Williams E, Theys C, McAuliffe M. Lexical‐semantic properties of verbs and nouns used in conversation by people with Alzheimer's disease. PLoS ONE. 2023;18(8):e0288556. doi:10.1371/journal.pone.0288556. ISSN 1932‐6203. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42. Cummings L. Describing the cookie theft picture: sources of breakdown in Alzheimer's dementia. Pragmat and Soc, 10(2):153‐176, 2019. [Google Scholar]
- 43. He X, Lin Z, Gong Y, et al. AnnoLLM: making large language models to be better crowdsourced annotators. The North American Chapter of the Association for Computational Linguistics. Association for Computational Linguistics; 2024. [Google Scholar]
- 44. Calderon N, Reichart R, Dror R. The alternative annotator test for LLM‐as‐a‐judge: How to statistically justify replacing human annotators with LLMs. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics ; Association for Computational Linguistics; 2025. [Google Scholar]
- 45. McFadden D. Conditional logit analysis of qualitative choice behavior. Frontiers in Econometrics. Academic Press; 1972. [Google Scholar]
- 46. Smith TJ, McKenna CM. A comparison of logistic regression pseudo r2 indices. Gen Linear Model J. 2013;39(2):17‐26. [Google Scholar]
- 47. Louviere JJ, Hensher DA, Swait JD. Stated Choice Methods: Analysis and Applications. Cambridge University Press; 2000. [Google Scholar]
- 48. Bora E, Walterfang M, Velakoulis D. Theory of mind in behavioural‐variant frontotemporal dementia and Alzheimer's disease: a meta‐analysis. J Neurol Neurosurg Psychiatry. 2015;86(7):714‐719. [DOI] [PubMed] [Google Scholar]
- 49. Poursabzi‐Sangdeh F, Goldstein DG, Hofman JM, Vaughan JWW, Wallach H. “Manipulating and measuring model interpretability”. Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems . Association for Computing Machinery; 2021. [Google Scholar]
- 50. Busch F, Hoffmann L, Rueger C, et al. Current applications and challenges in large language models for patient care: a systematic review. Commun Med. 2025;5(1):26. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Supporting information
Supporting information
