Abstract
Primary progressive aphasia can be distinguished into one of three variants: semantic, non-fluent/agrammatic, and logopenic. While a considerable body of work exists characterizing each variant, few prior studies have addressed the problem of optimizing behavioral assessment in a typical outpatient evaluation setting. Our aim is to examine the sensitivity and specificity of a battery of cognitive and linguistic assessments and determine optimal scores for distinguishing patients’ subtype based on these instruments.
This was a retrospective analysis of outpatient clinical testing of individuals with known or suspected primary progressive aphasia. Evaluations included the National Alzheimer’s Coordinating Center frontotemporal lobar degeneration module and additional measures of naming, semantic association, word verification, and picture description. Receiver operating characteristic analysis was used to examine the utility of each task in distinguishing each variant from the others. Logistic regressions were used to examine the combined utility of tasks for distinguishing a given subtype.
We examined 435 evaluations of 222 patients retrospectively. The battery was most consistent in distinguishing semantic variant by low scores and non-fluent/agrammatic variant by high scores on a similar subset of tasks. Tasks best distinguishing semantic variant produced a model that correctly classified 86% of cases. Tasks best distinguishing non-fluent/agrammatic variant correctly classified 77% of cases. The battery of tasks was weakest in identifying logopenic variant; only the ratio of sentence reading to sentence repetition performance was identified as a reasonable predictor, and it had predictive accuracy of 67%.
Naming assessments were the strongest basis for distinguishing all variants, particularly semantic variant from non-fluent/agrammatic variant. These data illustrate that a number of commonly used assessments perform at chance in distinguishing variant and preliminarily support an abbreviated battery that marginally favors tools not currently included in the frontotemporal lobar degeneration module.
Keywords: aphasia, dementia, language, evaluation
Introduction
Primary progressive aphasia (PPA) refers to progressive neurodegenerative impairment with central language features. It can be further distinguished for many individuals into one of three variants1: semantic variant PPA (svPPA), non-fluent/agrammatic variant PPA (nfavPPA) and logopenic variant PPA (lvPPA). Variants are distinguishable by relative deficits within the domain of language. Individuals with svPPA have impaired word retrieval and object knowledge, but spared repetition and grammar. Individuals with nfavPPA have spared single-word comprehension and object knowledge, but experience agrammatism,2 apraxia of speech,3 and impaired comprehension of complex sentences.4 Individuals with lvPPA experience impaired word retrieval and impaired repetition (often attributed to domain-general working memory deficits) with spared grammar, object knowledge, and comprehension at the word level.2 NfavPPA and svPPA are associated primarily with underlying frontotemporal lobar degeneration, whereas lvPPA is associated primarily with Alzheimer’s disease.5 Unclassified or mixed PPA profiles are not uncommon. These individuals meet the consensus criteria required for a diagnosis of PPA but do not fit the profiles of any of the variants, either because they do not have any of the core features of any variant or because they have core features of multiple variants.
While a considerable body of work exists characterizing each PPA subtype, few prior studies have addressed the problem of selecting the optimal combination of behavioral instruments and cutoff scores for distinguishing the variants in a typical outpatient setting. This gap in the literature leads to the use of extended batteries that often are impractical to implement and challenging for patients, particularly those further in their disease progression, to complete. The aim of this work is to provide insight into two related questions:
What is the relative sensitivity and specificity of a battery of subtests derived from the National Alzheimer’s Coordinating Center (NACC) Uniform Dataset (UDS) standardized evaluation battery frontotemporal lobar degeneration (FTLD) module version 2.0 (https://naccdata.org/data-collection/forms-documentation/ftld-2) in combination with other measures and ratios between key assessment scores in identifying the three PPA variants, as defined using receiver operating characteristic (ROC)-based analysis?
What are the optimal “cut-point” scores for distinguishing patients’ subtype based on these instruments?
We anticipated that naming assessments would be the strongest basis for distinguishing svPPA from nfavPPA, with svPPA associated with poor naming performance and nfavPPA associated with preserved naming performance. We also anticipated that relationships (ratios) among scores may provide more insight into subtyping than raw scores and therefore examined relative performance on spoken and written naming tasks combining scores from noun and verb stimuli. We anticipated that written naming better than oral naming on the same test (high written: oral naming ratio >1) and object naming much better than action naming may better predict nfavPPA, while ratios ≤ 1 for written: oral naming ratio and object: action naming scores may better predict svPPA. Given the nature of lvPPA, we predicted that it would be best captured by the relative performance on tasks with similar linguistic demands that vary in working memory demands (sentence reading more accurate than repetition, or a high reading: repetition ratio).
Materials and Methods
We report how we determined our sample size, all data exclusions, all inclusion/exclusion criteria, whether inclusion/exclusion criteria were established prior to data analysis, all manipulations, and all measures in the study.
Records reviewed
PPA diagnosis and subtype identification were determined on the basis of a battery derived from a modified version of the FTLD module (as described below) in conjunction with comprehensive neurological evaluation, review of medical history, and consultation with family, if present. The findings of this evaluation were further affirmed with magnetic resonance imaging for all patients to differentially diagnose variant via location and severity of atrophy. Asymmetric frontal atrophy was required for a diagnosis of nfavPPA. Left temporo-parietal atrophy was required for a diagnosis of lvPPA, and primarily left anterior temporal atrophy was required for a diagnosis of svPPA. Taken together, these results provided the standard basis for PPA diagnosis and subtype identification.
Patients seen in the Johns Hopkins Outpatient Center between 2008 and January 2021 were sampled based on the use of two key instruments for examining picture naming: the 30-item form of the Boston Naming Test (BNT)6 and the Hopkins Action Naming Assessment (HANA; https://score.jhmi.edu/downloads.html).7 These were chosen because they were the most common tasks administered during the retrospective time period regardless of predicted diagnostic outcome. These measures are prioritized in the clinical setting because they address patients’ most common primary presenting complaint that they cannot retrieve words. If a patient received either assessment during the course of outpatient evaluation or management, performance on all completed tasks in the battery across all visits was considered for analysis. This resulted in the review of 485 evaluations across 256 individual patients (summarized in Table 1). No part of the study procedures or analysis plans was pre-registered prior to the research being undertaken.
Table 1:
Summary statistics for sample
| Logopenic | Non-fluent/agrammatic | Semantic | unclassified | |
|---|---|---|---|---|
| N individuals | 86 | 63 | 73 | 34 |
| N visits | 167 | 132 | 136 | 50 |
| Age | 70.9 (7.3) | 70.0 (8.3) | 67.2 (7.1) | 71.2 (6.6) |
| Education | 16.7 (2.8) | 15.9 (2.9) | 16.2 (2.1) | 17.2 (2.5) |
Patients who were followed beyond the initial referral for differential diagnosis typically returned every 6 months for the first year or two, beyond which time the visits were spaced annually or at “as needed” intervals. During late stage disease, formal standardized assessments were less commonly administered. Beyond this chronology, the timeline of visits and disease progression varied widely from patient to patient. Some patients first were assessed before mild symptoms were able to be captured in formal assessments and then were followed as their performance evolved, whereas others were not referred until relatively late in their disease progression. For this reason, in addition to inconsistency and bias in patient and family reporting of symptom onset, we chose to not to address disease progression directly when identifying PPA subtype status.
Assessment battery
We modified the National Alzheimer’s Coordinating Center (NACC) Uniform Dataset (UDS) standardized evaluation battery frontotemporal lobar degeneration (FTLD) module version 2.0 (https://naccdata.org/data-collection/forms-documentation/ftld-2) by adding measures to further evaluate language generation in response to a picture,8 action naming,7 action and object semantics,9–11 and word comprehension with a more sensitive test of word/picture verification.12–14 Action naming was assessed with the HANA, in which action names are matched in length and frequency to items on the BNT. The HANA is a relatively new assessment that previously has been used in both post-stroke aphasia and PPA,7 where it was observed that the HANA is generally more difficult for patients than the BNT. Accuracy on the BNT and HANA are similarly assessed, including only items correctly named spontaneously. While the standard administration of the BNT was used, items for which a semantic or phonemic cue was required before a correct response was achieved were not considered correct for the present analysis. We also removed the question anagram test in 2014, as a preliminary study showed that it did not distinguish among variants,15 and replaced it with a modified version.16 The Frontal Behavioral Inventory also was added.17 A summary of all the tools in the battery is presented in Table 2. Testing typically is completed in 2–2.5 hours following the initial neurological evaluation. Not all patients completed all assessments; the count of scores available for each task x subtype combination is provided in Table A.1. Scoring followed NACC guidelines for the FTLD module. Written naming was considered correct if all letters of the target name were correct, in the accurate order, regardless of case. Oral naming was scored as correct if all phonemes were present and in the correct order, irrespective of articulatory distortions. Other than where indicated, legal copyright restrictions prevent public archiving of the various assessment instruments and test batteries used in this research, which can be obtained from the copyright holders in the cited references.
Table 2:
Primary progressive aphasia battery
| Task | Description |
|---|---|
| Frontotemporal Lobar Degeneration Module | |
| Benson Figure Direct & Delayed Copy (17 points each) | Patients directly copy a figure then are asked to draw the figure from memory 10–15 minutes later |
| Verbal phonemic fluency (F, L) | Patients have one minutes to list words beginning with a given letter |
| Regular & Irregular Word Tests – Reading & Spelling to Dictation (15 points each) | Patients read words aloud then write them as they are read aloud |
| Semantic Word-picture Matching Test (20 points) | Patients point to the image spoken from a field of 4 possible responses |
| Semantic Associates Test (16 points) | Patients indicate the associated pair of images from a field of 2 pairs |
| Sentence Repetition Test (5 items) | Patients repeat a sentence they just heard |
| Sentence Reading Test (5 items) | Patients read a sentence |
| Noun & Verb Naming Subtests – Oral and Written Modalities (16 points each) | Patients name pictures of common objects and actions then write their names |
| Social Norms Questionnaire (22 points) | Patients indicate whether different social behaviors are acceptable or not |
| Additional measures | |
| Hopkins Action Naming Assessment7 (30 points) | Patients name pictures of common actions |
| Boston Naming Test 30-item short form6 | Patients name pictures of common objects |
| Northwestern Anagram Test – active & passive voice16 (5 points each) | Patients make grammatically correct sentences by manipulating provided words |
| Pyramids & Palm Trees Test 14-item short form9,10 | Patients indicate the associated pair of objects from a field of 2 |
| Kissing & Dancing Test 15-item short form11 | Patients indicate the associated pair of actions from a field of 2 |
| Berndt Picture-Word Verification Nouns & Verbs13,14 (30 points each) | Patients indicate whether a picture of an object or action matches a spoken word |
| Boston Diagnostic Aphasia Examination: Picture description8 (4 points) | Patients describe a picture aloud using complete sentences if possible |
| Frontal Behavioral Inventory17 (24 items) | Patient caregivers rate changes in personality and behavior |
Statistical analysis
Aim 1
The first aim of this work was to examine the relative performance of the instruments and ratios between them, with regard to sensitivity and specificity for each of the three PPA variants. In order to examine Aim 1, two separate analysis methods were used to address the potential influence of repeated measures on the analysis, each with strengths and weaknesses. In the first analysis, each visit in which a PPA battery or subset of battery tasks was completed was treated as an independent, not repeated, portrayal of subtype characteristics. This maximized the overall size of the sampling used for subtype identification considerably but ignored the potential intra-patient influence on prediction. In the second exploratory analysis, only the first visit in which an individual was identified by a given subtype was analyzed and subsequent visits between the provider and the same patient were ignored. This allowed us to examine the magnitude of effect repeated measures may have had on the overall assessment of a test’s utility. All visits in which the individual did not meet criteria for a given PPA subtype (Table 1 “unclassified” individuals) were dropped from further analysis.
All calculations were done in R using plotROC 2.2.18 to calculate empirical ROC curves and cutpointr for optimizing the cutoff score19 on the unsmoothed curve. Receiver operating characteristic (ROC) analysis illustrates the binary diagnostic ability (subtype present or absent) of a given task as the critical threshold score is varied. In this analysis, 0.5 represents chance and 1.0 represents perfect performance. The x-axis is the false positive rate (FPR, also 1-specificity). The y-axis is the true positive rate (TPR, also called sensitivity). Area under the receiver operating characteristic curve (AUROC) is a common way to define the overall performance of that measure in performing the diagnostic classification. To facilitate comparison, curves were inverted prior to calculating AUROC when lower task performance was associated with a higher probability of the individual belonging to that subtype.
AUROC 0.7/1 was used to vet tasks most useful to identifying each subtype, as this is widely considered acceptable performance.20 These tasks then were entered into a binomial logistic regression in order to estimate the overall diagnostic utility of the whole subset of selected tasks in predicting the subtype. Each of the three regressions were each considered at a threshold conservatively corrected for multiple comparisons (α = 0.05/3 = 0.02).
Aim 2
The second aim of this work was to identify the optimal cut-point scores for distinguishing patients’ subtype based on these instruments. In order to examine Aim 2, “optimal” cut-points were established by maximizing the Youden-Index (sensitivity + specificity – 1)21 for any task with a AUROC of ≥ 0.7 for any subtype. These values provide insights into scores where specificity and sensitivity interests are balanced for a given subtype classification, and can be further described in terms of accuracy, sensitivity, and specificity at that score. Where lower task performance was associated with a higher probability of the individual belonging to that subtype, uninverted curves were used to calculate cut-points unless no cut-point could be computed (marked with an asterisk). This facilitated the interpretation of cut-points (which refer to actual test scores) for the purpose of identifying what subtype was most probable.
Standard Protocol Approvals, Registrations, and Patient Consents
All work was conducted with the formal approval of the Johns Hopkins University School of Medicine Institutional Review Board (IRB-3; Federal Wide Assurance # FWA00005752-JHUSOM, FWA00006087-JHH & JHHS, and FWA00005719-KKI OHRP IRB Registration #IRB 00001656). Participant consent was not required.
Data Availability
The conditions of our ethics approval do not permit public archiving of study data. Anonymized data not otherwise present in the appendix are available upon request to the authors, subject to review by the Johns Hopkins University School of Medicine Institutional Review Board resulting in a formal data sharing agreement.
Results
Receiver Operating Characteristic (ROC) curves – All visits
Results of the ROC analysis are presented in Table 3 (Aim 1). Based on the a priori cutoff of 0.7, 14 of the 29 tasks and relationships examined were further analyzed to determine optimal sensitivity and specificity cut-point scores (Aim 2). Additional descriptive information regarding performance by variant is available in the appended Fig.A.2–23.
Table 3:
Area under the receiver operating characteristics – All visits
| lvPPA | nfavPPA | svPPA | |
|---|---|---|---|
| Hopkins Action Naming Assessment* | 0.59† | 0.79 | 0.78 † |
| Boston Naming Test* | 0.57† | 0.77 | 0.77 † |
| Berndt picture-word verification - nouns | 0.50† | 0.71 | 0.82 † |
| Oral noun naming | 0.52† | 0.72 | 0.76 † |
| Benson delayed figure copy | 0.64† | 0.71 | 0.63† |
| Berndt picture-word verification - verbs | 0.51† | 0.71 | 0.79 † |
| Sentence reading: Sentence repetition | 0.70 | 0.60† | 0.54† |
| Written noun naming | 0.54† | 0.70 | 0.74 † |
| Kissing & Dancing | 0.54† | 0.68 | 0.75 † |
| Oral verb naming | 0.55† | 0.64 | 0.68† |
| Spelling to dictation | 0.52† | 0.64 | 0.77 † |
| Written verb naming | 0.53† | 0.62 | 0.68† |
| Picture description | 0.59† | 0.60 | 0.58† |
| Pyramids & Palm Trees | 0.53 | 0.66 | 0.77 † |
| Sentence repetition | 0.66† | 0.59 | 0.53† |
| Verbal fluency | 0.59 | 0.60† | 0.56† |
| Semantic association | 0.55 | 0.62 | 0.76 † |
| Semantic word picture matching | 0.54 | 0.62 | 0.74 † |
| Frontal Behavioral Inventory – Emotional | 0.52† | 0.58 | 0.51† |
| Frontal Behavioral Inventory – Disinhibition | 0.61† | 0.51 | 0.64 |
| Social norms | 0.51 | 0.57† | 0.63 |
| Benson direct figure copy | 0.53† | 0.57 | 0.59† |
| Oral: Written | 0.53 | 0.59† | 0.60 |
| Frontal Behavioral Inventory – Negative | 0.51† | 0.51 | 0.59 |
| Anagrams – active voice | 0.54† | 0.55 | 0.56† |
| Boston Naming Test: Hopkins Action Naming Assessment | 0.56 | 0.53 | 0.59† |
| Oral reading | 0.51 | 0.57 | 0.69† |
| Anagrams – passive voice | 0.56† | 0.52 | 0.54† |
| Sentence reading | 0.51 | 0.51† | 0.57† |
lvPPA: logopenic variant, nfavPPA: non-fluent agrammatic variant, svPPA: semantic variant
Tests with a mean AUROC of ≥ 0.7. All tests with at least one AUROC value ≥ 0.7 are bolded.
ROC curves were inverted in order to calculate maximum AUROC. Lower performance was associated with greater predictive accuracy.
Two tasks demonstrated a mean AUROC of ≥ 0.7, the Hopkins Action Naming Assessment and Boston Naming Test (Fig. 1).
Fig. 1:

Hopkins Action Naming Assessment and Boston Naming Test results by variant
Battery utility in identifying svPPA
As anticipated given the focus of this battery on word-finding abilities across modalities and contexts, multiple tasks associated with naming and semantic association demonstrated high predictive value for identifying svPPA. Low scores were greater predictors of svPPA on each of these tasks (Table 4). Optimal scores refer to the score below which it is likely that the individual has svPPA to the accuracy level provided. All identified tasks were entered as predictors of svPPA status in a binary logistic regression. The model was statistically significant, χ2(11) = 63.24, p < 0.001, explained 52% (Nagelkerke R2) of the variance in svPPA, and correctly classified 86% of cases. The Pyramids & Palm Trees test, Berndt picture-word verification of verbs, and BNT were independent, significant predictors (Table 5). Due to the observed high correlation between variables (see Appendix Fig. A.1) and resulting multicollinearity, a least absolute shrinkage and selection operator (LASSO) correction was used in calculating a complementary model based on a lambda that minimized the test mean squared error, reported in Table 5, which explained 35% of the variance.
Table 4:
Task prediction performance at optimal score
| Optimal score | Accuracy | Sensitivity | Specificity | |
|---|---|---|---|---|
| Semantic variant | ||||
| Pyramids & Palm Trees | 12/14 | 79% | 0.63 | 0.85 |
| Semantic association | 14/16 | 78% | 0.64 | 0.82 |
| Berndt picture-word verification - verbs | 20/30 | 77% | 0.64 | 0.82 |
| Berndt picture-word verification - nouns | 26/30 | 72% | 0.82 | 0.68 |
| Boston Naming Test | 11/30 | 71% | 0.78 | 0.69 |
| Written noun naming | 13/16 | 70% | 0.78 | 0.67 |
| Hopkins Action Naming Assessment | 12/30 | 70% | 0.87 | 0.64 |
| Oral noun naming | 13/16 | 71% | 0.74 | 0.71 |
| Semantic word picture matching | 19/20 | 68% | 0.75 | 0.66 |
| Kissing & Dancing | 12/15 | 67% | 0.79 | 0.63 |
| Spelling to dictation | 22/30 | 67% | 0.86 | 0.62 |
| Non-fluent/agrammatic variant | ||||
| Hopkins Action Naming Assessment | 20/30 | 77% | 0.64 | 0.83 |
| Boston Naming Test | 20/30 | 74% | 0.65 | 0.78 |
| Oral noun naming | 16/16 | 74% | 0.61 | 0.79 |
| Berndt picture-word verification - verbs | 29/30 | 72% | 0.52 | 0.81 |
| Berndt picture-word verification - nouns | 29/30 | 71% | 0.62 | 0.74 |
| Benson delayed figure copy | 9/17 | 67% | 0.73 | 0.64 |
| Written noun naming | 15/16 | 66% | 0.66 | 0.66 |
Table 5:
Results of the logistic regression analyses
| B(SE) | Wald χ2 | Odds ratio | 95% CI | LASSO β | |
|---|---|---|---|---|---|
| Semantic variant | |||||
| Pyramids & Palm Trees* | −0.62(0.26) | 5.76 | 0.54 | 0.33, 0.89 | −0.07 |
| Semantic association | −0.12(0.19) | 0.36 | 0.89 | 0.61, 1.30 | |
| Berndt picture-word verification - verbs* | −0.16(0.07) | 6.08 | 0.85 | 0.75, 0.97 | −0.01 |
| Berndt picture-word verification - nouns | 0.07(0.08) | 0.65 | 1.07 | 0.91, 1.26 | |
| Boston Naming Test** | −0.21(0.07) | 8.02 | 0.81 | 0.71, 0.94 | −0.01 |
| Written noun naming | 0.11(0.10) | 1.16 | 1.11 | 0.92, 1.35 | |
| Hopkins Action Naming Assessment | 0.10(0.07) | 1.97 | 1.10 | 0.96, 1.27 | |
| Oral noun naming | 0.05(0.11) | 0.24 | 1.06 | 0.85, 1.31 | |
| Semantic word picture matching | 0.08(0.16) | 0.24 | 1.08 | 0.79, 1.49 | |
| Kissing & Dancing | −0.05(0.15) | 0.10 | 0.96 | 0.72, 1.27 | |
| Spelling to dictation | −0.02(0.06) | 0.07 | 0.99 | 0.88, 1.10 | |
| Non-fluent/agrammatic variant | |||||
| Hopkins Action Naming Assessment* | 0.13 (0.05) | 5.49 | 1.14 | 1.02, 1.26 | 0.02 |
| Boston Naming Test | 0.07 (0.05) | 1.88 | 1.07 | 0.97, 1.17 | 0.01 |
| Oral noun naming* | −0.23 (0.10) | 5.40 | 0.79 | 0.65, 0.96 | −0.03 |
| Berndt picture-word verification - verbs | 0.01 (0.06) | 0.01 | 1.01 | 0.89, 1.14 | |
| Berndt picture-word verification - nouns | 0.01 (0.07) | 0.01 | 1.01 | 0.88, 1.16 | |
| Benson delayed figure copy | 0.05 (0.05) | 1.41 | 1.06 | 0.97, 1.15 | 0.01 |
| Written noun naming | −0.04 (0.07) | 0.34 | 0.96 | 0.83, 1.10 | |
p < 0.01,
p < 0.05
Battery utility in identifying nfavPPA
A number of tasks with high predictive value for identifying nfavPPA were identified. High scores on naming tests and minimal difficulty including details in a delayed figure copy were the best predictors of this subtype (Table 4). Optimal scores refer to the cutoff score above which it is likely that the individual has nfavPPA, to the accuracy level provided. All identified tasks were entered as predictors of nfavPPA status in a binary logistic regression. The model was statistically significant, χ2(7) = 42.99, p < 0.001, explained 33% (Nagelkerke R2) of the variance in nfavPPA, and correctly classified 77% of cases. HANA and oral noun naming tasks were independent, significant predictors (Table 5). The LASSO-corrected model explained 36% of the variance.
Battery utility in identifying lvPPA
Only a single relationship provided high predictive value for identifying lvPPA, the ratio of sentence reading to sentence repetition performance. An optimal cut-point score for maximizing task performance was calculated as 1.25, or a reading score of 5 and a repetition score of 4. This resulted in a predictive accuracy of 67% (sensitivity = 0.73, specificity = 0.64). Additional cut-point values are illustrated in Fig. 2. The ratio of sentence reading to sentence repetition performance was entered as a predictor of lvPPA status in a binary logistic regression. The model was statistically significant, χ2(1) = 19.65, p < 0.001, explained 11% (Nagelkerke R2) of the variance in lvPPA, and correctly classified 69% of cases. The task was a significant predictor (B = 0.65, SE = 0.16, Wald χ2(1) = 16.68, p < 0.001, Odds ratio = 1.92 [95% CI = 1.40, 2.63]).
Fig. 2:

Logopenic variant receiver operating characteristic curve for Sentence reading: Sentence repetition.
Optimal cutoff point noted in white.
Receiver Operating Characteristic (ROC) curves – First visits
Re-analysis of the AUROC values for a single visit with each participant (i.e., excluding follow-up visits with the same patient once a PPA subtype was identified) is presented in Table A.2. This included 64 assessments of individuals with svPPA, 56 assessments of individuals with nfavPPA, and 77 assessments of individuals with lvPPA. Overall findings were remarkably similar between the two analyses. HANA and BNT were the strongest overall predictors of subtype, while using single visits also resulted in Berndt picture-word verification of nouns having a mean AUROC greater than or equal to 0.7. Using first visits only, performance data from individuals with svPPA showed a nearly identical pattern of predictor strength to the analysis using all visit scores, with no difference in the predictors identified using this subset of first visits. For nfavPPA, the same tasks were identified as the strongest predictors, though their predictive value estimates were slightly higher. Boston Diagnostic Aphasia Examination (BDAE) picture description also emerged as a task that exceeded the threshold for utility that had been set. No task met criterion for lvPPA in isolation, though the ratio of sentence reading to sentence repetition again performed relatively well, as did the Frontotemporal Behavioral Inventory Disinhibition sub-score. Given the gross similarities between the two results, no additional analyses were conducted on the data subset drawn from first visits in isolation.
Discussion
In this retrospective analysis, we explored the utility of the current tasks used in our PPA battery in distinguishing the three PPA subtypes. Findings highlighted the diagnostic value of 14 of the 26 tasks. From the FTLD module, these included the Benson delayed figure copy, oral and written noun naming, sentence reading and repetition, spelling to dictation, semantic association, and semantic word picture matching. Additional tasks make up our current battery: the HANA, BNT, Berndt picture-word verification of nouns and verbs, 15-item version of the Kissing and Dancing test,11 and 14-item version9 of the Pyramids and Palm Trees test.10
Considering our first aim, the pattern of findings mirrored what we anticipated. Naming assessments were the strongest basis for distinguishing svPPA from nfavPPA, with svPPA associated with poor naming performance and nfavPPA associated with strong naming performance. One elegant aspect of the findings was the evidence that same key tasks were identified as having high utility in parsing these two variants. For both nfavPPA and svPPA, these key tasks included the HANA, BNT, oral noun naming, written noun naming, and Berndt nouns and verbs (see summary in Table 6). In service of aim 2, cutoff scores also showed potentially usefulness for differential diagnosis.
Table 6:
Summary of key tasks for the nfavPPA-svPPA distinction
| Non-fluent/agrammatic variant lower bound | Semantic variant upper bound | |
|---|---|---|
| Hopkins Action Naming Assessment | 20/30 | 12/30 |
| Boston Naming Test | 20/30 | 11/30 |
| Oral noun naming | 16/16 | 13/16 |
| Written noun naming | 15/16 | 13/16 |
| Berndt picture-word verification - nouns | 29/30 | 26/30 |
| Berndt picture-word verification - verbs | 29/30 | 20/30 |
Scores above the lower bound for non-fluent/agrammatic variant are highly likely to indicate its presence. Scores below the upper bound for semantic variant are highly likely to indicate its presence.
We found no evidence that nfavPPA was better predicted by written scores or that svPPA was predicted by oral scores; response modality seemed not to be relevant to diagnostic utility. Likewise, relative performance of nouns and verbs did not perform well in parsing subtypes. This was unexpected since prior work has found individuals with agrammatism, whether due to stroke or PPA, appear to have greater difficulty naming verbs than nouns, whereas this pattern is reversed for individuals with svPPA.22, 23 However, these analyses often have relied on the Northwestern Naming Battery items (a subset of which make up the stimuli used in the FTLD module). The hypothesis tested in the present analysis examined the ratio of BNT and HANA scores. It is possible that the leptokurtic negative skew of the NBB/FTLD items caused by the ceiling effect (particularly among nouns) not present in the BNT or HANA performance among our participants is the reason our findings were inconsistent with this prior work.
Given the nature of lvPPA, we predicted that it would be best captured by the relative performance of similarly linguistically demanding tasks with and without greater working memory demands (sentence reading and sentence repetition). This was supported by our data, though it was somewhat surprising that this ratio was uniquely valuable in distinguishing lvPPA out of all tasks considered. Our findings are grossly consistent with prior work examining repetition as a means of distinguishing lvPPA, which also suggested poor repetition scores would perform well as a basis for distinguishing variants.24 However, we found that the ratio of repetition to reading performance performed better than either repetition score or reading score in isolation. Of note, Lukic and colleagues used a more nuanced operationalization of accuracy than we chose for this analysis; Examining accuracy at the syllable level as a function of length and meaningfulness may result in a higher AUROC for these individual tasks than identifying whether sentences were repeated verbatim, as was done here.
Implications for task selection
There are a number of implications that can be distilled from these analyses when trying to optimize time and utility in a clinical setting. First, a striking number of the commonly utilized tasks we examined performed at or near chance (0.5) when distinguishing any PPA subtype from others. This often occurred when tasks were nearly always performed at ceiling, regardless of subtype. This was observed in the word and sentence reading tasks. Nearly ubiquitous probes used in adults with suspected language disturbance, especially those with aphasia due to stroke, also performed surprisingly poorly in distinguishing subtypes, for example, verbal fluency and BDAE picture description score. Notably, other measures derived from a BDAE picture description have demonstrated far more utility in complementary analyses identifying PPA subtype.25
Second, in situations where the same or similar underlying constructs were probed by multiple tasks, the FTLD tasks marginally underperformed independently selected tasks targeting similar skills (see summary in Table 7). This may be due to the abbreviated nature of a number of the FTLD tasks. However, it is striking to note that FTLD tasks were designed or chosen for their value in examining language among people with dementia, in contrast to the majority of independently selected tasks that were designed with post-stroke aphasia in mind. The utility of the FTLD module in isolation for distinguishing PPA subtypes previously was explored,26 and our work affirms and complements their findings in a number of ways. We too found that naming showed the most promise for providing a three-way distinction of PPA variant, though our findings suggest verb naming performed marginally better than noun naming. As observed by Staffaroni et al.,24 FTLD semantic association and word-picture matching tests were notable only for their use in distinguishing svPPA from the other two subtypes.
Table 7:
Mean area under the receiver operating characteristic curve for similar tasks
| Skill | Frontotemporal Lobar Degeneration Module | Additional measures | ||
|---|---|---|---|---|
| Noun naming | Oral noun naming | 0.67 | Boston Naming Test | 0.70 |
| Written noun naming | 0.66 | |||
| Verb naming | Oral verb naming | 0.62 | Hopkins Action Naming Assessment | 0.72 |
| Written verb naming | 0.61 | |||
| Auditory comprehension | Semantic word picture matching | 0.63 | Berndt picture-word verification - nouns | 0.68 |
| Berndt picture-word verification - verbs | 0.67 | |||
| Semantic association | Semantic associates | 0.64 | Pyramids & Palm Trees | 0.65 |
| Kissing & Dancing | 0.66 | |||
Our findings can be used to identify a relatively abbreviated 6-task battery for distinguishing subtypes: the HANA, BNT, and Berndt picture-word verification of nouns and verbs important to the distinction of svPPA and nfavPPA, and the sentence reading and repetition tasks, considered as a ratio when identifying lvPPA (using a reading score of 5 combined with a repetition score of 4). There is cause to consider a 7th assessment of semantic knowledge, the 14-item version9 of the Pyramids and Palm Trees test,10 given its high performance in svPPA and in order to directly address this dimension of the svPPA profile, despite this assessment’s moderate performance when averaged across all three subtypes. At present, the unabridged PPA battery used at Johns Hopkins requires 2–2.5 hours to complete and is conducted in addition to a full neurological evaluation and counseling during outpatient appointments for a patient under investigation. Abbreviated batteries can be particularly useful in this population given the likelihood of comorbid general cognitive impairments, frequency of follow-up appointments, and overall fatigue and frustration that individuals with dementia often experience.
Prior work has promoted the use of even more comprehensive theory-driven batteries far broader than either the larger or abbreviated battery discussed in the present work.27 Our findings were consistent with these recommendations to an extent, particularly when contrasting svPPA and nfavPPA (i.e., the use of the BNT, Pyramids and Palm Trees test, semantic word picture matching) but were markedly dissonant with theory-driven recommendations for assessment of lvPPA. This difference may be due to differences in the focus of the aims. Our analyses focused on utility for subtype identification in isolation. However, it is important to note that diagnostic utility is not the only consideration when designing a battery of this nature. This battery is given in whole or in part throughout the patient’s disease course and provides valuable information to caregivers and clinical team members regarding the patient’s shifting strengths and weaknesses over time – a richer profile than is captured by these tasks alone. Further vetting both of this shorter subset of tasks and theory-driven comprehensive evaluation procedures, such as the one presented by Henry and Grasso25 remains an important task for future investigation.
This analysis also highlights areas where the existing battery did not perform well. Overwhelmingly, the battery is most useful in identifying svPPA. This was true of the FTLD module as well.26 This is not wholly surprising as permutations of naming tasks with diverse additional cognitive-linguistic demands are relatively easy to create and extremely common in language assessment in general. However, it is not ideal that the battery tends only to identify nfavPPA by virtue of this subtype’s lack of deficits and subsequent near-ceiling performance, rather than by the presence of their hallmark agrammatism.24 Tasks where individuals with agrammatism typically do not perform at ceiling should be considered to address this relative weakness, such as the Morphosyntactic Generation test28 or a fluency score derived from a narrative sample or picture description using a rubric similar to that in the Western Aphasia Battery.29 The battery also would likely benefit from the addition of tasks that directly capture apraxia of speech as well as those that provide incrementally more challenging loads on short term and working memory for language units, perhaps using sentences of systematically increasing length and complexity.
Limitations & final thoughts
There are some clear limitations of this analysis beyond those previously discussed. Foremost, it is not best practice to use the same factor, in this case the assessment battery, both as a truth criterion for subtype identification and as the tool under investigation. Ideally, one would want an independent truth criterion (e.g., imaging or another biomarker) to compare with the utility of a behavioral battery against. This avoids a kind of circular reasoning. However, there is no current gold-standard biomarker for PPA or its subtypes. Importantly, although the assessment battery supplies the backbone of PPA variant identification, no single task in the battery or aspect of the neurological evaluation is utilized to the exclusion of other available information. Diagnosis is made based on a preponderance of all available information about a given patient. The impact of the limitation on establishing ground truth independent of any one task is mitigated by this holistic perspective and the utilization of neuroimaging in patient diagnosis.
A second limitation is that some of the measures added to the FTLD module-based assessment battery are relatively new, in some cases having been developed or modified by the authors (i.e., the HANA and the use of the Berndt stimuli for picture-verification). While this does not pose a limitation on the clinical or research use of these freely available materials, it does limit the ability to generalize from these tools to broader claims about noun and verb capacities or representations. For example, while the HANA is matched with the BNT stimuli in word frequency and length, it has not been examined for other factors, such as familiarity, age of acquisition, imageability, or visual complexity of the items. Recent work has suggested these additional factors may influence verb naming in svPPA.30
One strength of this retrospective dataset in particular is that, for many patients, we were able to affirm behaviorally-based subtype identification with neuroimaging used to locate areas of emerging atrophy and infer etiology. While this is a relatively recent addition to the process of diagnosing PPA among outpatients at Johns Hopkins, in future work we will be able to examine both the value of the battery assessments against the diagnostic biomarkers and the added sensitivity and specificity that comes from incorporating neuroimaging into the more traditional behavioral basis of subtype identification. Another limitation is that, when considering diagnostic utility in medicine more generally, an AUROC of 0.7 is a fairly liberal and errorful threshold, and not all clinical circumstances call for the equal optimization of both sensitivity and specificity (e.g., this would result in poor utility for screening).
However, as evidenced by the regressions, the behavioral tasks performed better together and no battery of language assessments is relied upon in isolation for the purposes of identifying PPA or its subtypes. Language batteries perform an important role in the classification of subtypes in concert with thorough medical history, neurological evaluation, examination for the presence of apraxia of speech, and, where available, the consideration of emerging biomarker candidates, such as differences seen in molecular PET imaging,31, 32 MRI,33, 34 and CSF,35 for identifying underlying pathology.1, 36
Supplementary Material
Acknowledgements
This work is supported by National Institutes of Health/National Institute on Deafness and Other Communication Disorders (NIH/NIDCD): R01 DC05375, P50 DC014664, and R01 DC011317.
Footnotes
Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
Declarations of Interest
None.
References
- 1.Gorno-Tempini ML, Hillis AE, Weintraub S, et al. Classification of primary progressive aphasia and its variants. Neurology. 2011;76(11):1006–1014. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Montembeault M, Brambati SM, Gorno-Tempini ML, Migliaccio R. Clinical, anatomical, and pathological features in the three variants of primary progressive aphasia: a review. Front Neurol. 2018;9:692. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Ash S, McMillan C, Gunawardena D, et al. Speech errors in progressive non-fluent aphasia. Brain Lang. 2010;113(1):13–20. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Thompson CK, Mack JE. Grammatical impairments in PPA. Aphasiology. 2014;28(8–9):1018–1037. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Harris JM, Gall C, Thompson JC, et al. Classification and pathology of primary progressive aphasia. Neurology. 2013;81(21):1832–1839. [DOI] [PubMed] [Google Scholar]
- 6.Williams BW, Mack W, Henderson VW. Boston naming test in Alzheimer’s disease. Neuropsychologia. 1989;27(8):1073–1079. [DOI] [PubMed] [Google Scholar]
- 7.Breining BL, Faria AV, Caffo B, et al. Neural regions underlying object and action naming: Complementary evidence from acute stroke and primary progressive aphasia. Aphasiology. 2021;doi: 10.1080/02687038.2021.1907291 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Goodglass H, Kaplan E, Weintraub S. BDAE: The Boston Diagnostic Aphasia Examination. Lippincott Williams & Wilkins; Philadelphia, PA; 2001. [Google Scholar]
- 9.Breining BL, Tippett DC, Davis C, et al. Assessing dissociations of object and action naming in acute stroke. 2015:
- 10.Howard D, Patterson K. The Pyramids and Palm Trees Test: A test of semantic access from words and pictures. Pearson assessment; 1992. [Google Scholar]
- 11.Bak TH, Hodges JR. Kissing and dancing—a test to distinguish the lexical and conceptual contributions to noun/verb and action/object dissociation. Preliminary results in patients with frontotemporal dementia. J Neurolinguistics. 2003;16(2–3):169–181. [Google Scholar]
- 12.Breese EL, Hillis AE. Auditory comprehension: Is multiple choice really good enough? Brain Lang. 2004;89(1):3–8. [DOI] [PubMed] [Google Scholar]
- 13.Berndt RS, Mitchum CC, Haendiges AN, Sandson J. Verb retrieval in aphasia. 1. Characterizing single word impairments. Brain Lang. 1997;56(1):68–106. [DOI] [PubMed] [Google Scholar]
- 14.Breining BL. Berndt Picture-Word Verification Nouns & Verbs. 2011.
- 15.Sebastian R, Davis C, Gomez Y, Trupe L, Hillis A. Syntactic Processing Skills in the Three Variants of PPA (P6.229). Neurology. 2014;82(10 Supplement):P6.229. [Google Scholar]
- 16.Weintraub S, Mesulam M-M, Wieneke C, Rademaker A, Rogalski EJ, Thompson CK. The northwestern anagram test: measuring sentence production in primary progressive aphasia. American Journal of Alzheimer’s Disease & Other Dementias®. 2009;24(5):408–416. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Kertesz A, Davidson W, Fox H. Frontal behavioral inventory: diagnostic criteria for frontal lobe dementi. Can J Neurol Sci. 1997;24(1):29–36. [DOI] [PubMed] [Google Scholar]
- 18.Sachs MC. plotROC: a tool for plotting ROC curves. Journal of statistical software. 2017;79 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Thiele C, Hirschfeld G. Cutpointr: Improved estimation and validation of optimal cutpoints in R. arXiv preprint arXiv:200209209. 2020; [Google Scholar]
- 20.Hosmer DW, Lemeshow S, Sturdivant RX. Applied logistic regression. vol 398. John Wiley & Sons; 2013. [Google Scholar]
- 21.Perkins NJ, Schisterman EF. The Youden Index and the optimal cut- point corrected for measurement error. Biometrical Journal: Journal of Mathematical Methods in Biosciences. 2005;47(4):428–441. [DOI] [PubMed] [Google Scholar]
- 22.Thompson CK, Lukic S, King MC, Mesulam MM, Weintraub S. Verb and noun deficits in stroke-induced and primary progressive aphasia: The Northwestern Naming Battery. Aphasiology. 2012;26(5):632–655. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Lukic S, Borghesani V, Weis E, et al. Dissociating nouns and verbs in temporal and perisylvian networks: Evidence from neurodegenerative diseases. Cortex. 2021; [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Lukic S, Mandelli ML, Welch A, et al. Neurocognitive basis of repetition deficits in primary progressive aphasia. Brain Lang. 2019;194:35–45. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Themistocleous C, Ficek B, Webster K, den Ouden D-B, Hillis AE, Tsapkini K. Automatic subtyping of individuals with Primary Progressive Aphasia. J Alzheimers Dis. 2020;(Preprint):1–10. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Staffaroni AM, Weintraub S, Rascovsky K, et al. Uniform data set language measures for bvFTD and PPA diagnosis and monitoring. Alzheimer’s & Dementia: Diagnosis, Assessment & Disease Monitoring. 2021;13(1):e12148. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Henry ML, Grasso SM. Assessment of individuals with primary progressive aphasia. NIH Public Access; 2018:231. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Stockbridge MD, Matchin W, Walker A, et al. One cat, two cats, red cat, blue cats: eliciting morphemes from individuals with primary progressive aphasia. Aphasiology. 2021:1–12. doi: 10.1080/02687038.2020.1852167 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Kertesz A. The Western Aphasia Battery: a systematic review of research and clinical applications. Aphasiology. 2020:1–30. doi: 10.1080/02687038.2020.1852002 [DOI] [Google Scholar]
- 30.Marcotte K, Graham NL, Black SE, et al. Verb production in the nonfluent and semantic variants of primary progressive aphasia: The influence of lexical and semantic factors. Cogn Neuropsychol. 2014;31(7–8):565–583. [DOI] [PubMed] [Google Scholar]
- 31.Quigley H, Colloby SJ, O’Brien JT. PET imaging of brain amyloid in dementia: a review. Int J Geriatr Psychiatry. 2011;26(10):991–999. [DOI] [PubMed] [Google Scholar]
- 32.Matias-Guiu JA, Díaz-Álvarez J, Ayala JL, et al. Clustering analysis of FDG-PET imaging in primary progressive aphasia. Front Aging Neurosci. 2018;10:230. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Rogalski E, Cobia D, Harrison T, Wieneke C, Weintraub S, Mesulam M-M. Progression of language decline and cortical atrophy in subtypes of primary progressive aphasia. Neurology. 2011;76(21):1804–1810. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Rosen HJ, Kramer JH, Gorno-Tempini ML, Schuff N, Weiner M, Miller BL. Patterns of cerebral atrophy in primary progressive aphasia. The American journal of geriatric psychiatry. 2002;10(1):89–97. [PubMed] [Google Scholar]
- 35.Paraskevas GP, Kasselimis D, Kourtidou E, et al. Cerebrospinal fluid biomarkers as a diagnostic tool of the underlying pathology of primary progressive aphasia. J Alzheimers Dis. 2017;55(4):1453–1461. [DOI] [PubMed] [Google Scholar]
- 36.Tippett DC. Classification of primary progressive aphasia: challenges and complexities. F1000Research. 2020;9 [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
The conditions of our ethics approval do not permit public archiving of study data. Anonymized data not otherwise present in the appendix are available upon request to the authors, subject to review by the Johns Hopkins University School of Medicine Institutional Review Board resulting in a formal data sharing agreement.
