Abstract
The prevalence of concomitant proteinopathies and heterogeneous clinical symptoms in neurodegenerative diseases hinders the identification of individuals who might be candidates for a particular intervention. Here, by applying an unsupervised clustering algorithm to post-mortem histopathological data from 895 patients with degeneration in the central nervous system, we show that six non-overlapping disease clusters can simultaneously account for tau neurofibrillary tangles, α-synuclein inclusions, neuritic plaques, inclusions of the transcriptional repressor TDP-43, angiopathy, neuron loss and gliosis. We also show that membership to the six transdiagnostic disease clusters, which explains more variance in cognitive phenotypes than can be explained by individual diagnoses, can be accurately predicted from scores of the Mini-Mental Status Exam, protein levels in cerebrospinal fluid, and genotype at the APOE and MAPT loci, via cross-validated multiple logistic regression. This combination of unsupervised and supervised data-driven tools provides a framework that could be used to identify latent disease subtypes in other areas of medicine.
Age-related neurodegenerative diseases affect more than 7 million Americans1, accounting for more than US$500 billion in healthcare costs annually2. This public health issue is projected to worsen1,3,4 as life expectancy increases and the US population continues to skew towards older individuals. The neurodegenerative disease umbrella includes major clinicopathological entities such as Alzheimer’s disease5, Parkinson’s disease6 and frontotemporal dementia7, in addition to less common diseases such as progressive supranuclear palsy (PSP)7, corticobasal degeneration (CBD)7 and multiple systems atrophy8. Neurology is in dire need of translational research that accounts for the heterogeneous presentations of neurodegeneration to facilitate the development of targeted treatments.
Decades of evidence support the notion that pathological protein aggregation is a primary disease process in neurodegeneration9,10. These protein aggregates may spread along large white matter fibres over time, causing dysfunction in distant regions11,12, and their toxicity is thought to be mediated in part by the inflammatory system13,14. Different neurodegenerative syndromes are characterized by aggregation of specific proteins; classically, Alzheimer’s disease involves both amyloid-β and microtubule-associated protein tau9,10; Parkinson’s disease involves α-synuclein6; and frontotemporal dementia can involve tau7,15 or transactive response DNA binding protein 43 kDa (TDP-43)16.
Despite this apparent specificity, aggregation of amyloid-β, tau, α-synuclein and TDP-43 can be found post-mortem in virtually all brains with neurodegenerative disease as well as brains from cognitively healthy individuals17–20. To complicate matters further, numerous in vitro and animal studies have demonstrated that these proteins interact to produce unique, concomitant dysfunction21–24. Moreover, there are multiple mechanisms by which molecular pathology causes cellular dysfunction25 (that is, angiopathy, gliosis, neuronal cell death or signalling dysfunction), and cellular dysfunction may be a more specific marker of cognitive dysfunction than plaque burden alone26,27. Additionally, most studies have examined overlap of only two or three aggregates simultaneously or preselect subjects with specific diagnoses. Thus, the latent structure of copathology that may emerge when all regions, aggregates and patients are considered simultaneously remains unknown.
The healthcare system is familiar with the problem of highly complex biology underlying variable clinical presentations. The ever-decreasing costs of computing hardware and easily accessible programming libraries for advanced computational approaches, such as machine learning and network science, provide tools to parse heterogeneity by defining new data-driven disease subtypes. These tools have been applied in multiple contexts in cancer biology28,29, epilepsy30,31 and psychiatry32–34. Machine learning techniques and network approaches have been utilized in speech recordings35, neuroimaging36,37 and clinical data38 of patients with neurodegenerative diseases, but have thus far been underutilized in the field of neuropathology, in which multiple forms of pathological protein aggregates with different morphologies can be measured alongside cellular dysfunction. It is also difficult to map specific imaging phenotypes or biomarkers to a particular disease, due to inaccuracies in clinical diagnoses39. Indeed, the gold standard for identifying a particular neuropathological syndrome is evaluation of proteinopathic burden on autopsy5,7,40.
Here, instead of focusing on 1 or 2 diseases or proteins, we simultaneously analyse copathology between 7 types of pathology (α-synuclein inclusions, tau neurofibrillary tangles, TDP-43 inclusions, neuritic plaques, neuronal loss, angiopathy and gliosis) across 15 brain regions (98 total features) in a sample of 895 patients evaluated by expert neuropathologists on autopsy (Fig. 1a,b). Next, we used a data-driven clustering approach that assigns each patient to a single, data-driven, transdiagnostic ‘disease cluster’ while accounting for all available forms of pathology (Fig. 1b). We evaluated this approach alongside the existing model of neurodegenerative disease, in which diseases are defined by one or two protein species and patients simultaneously meet criteria for multiple diagnoses. The resulting clusters grouped together diseases known to be driven by the same pathogenic protein, providing a data-driven confirmation of traditional disease classification schemes. Additionally, we found a separate cluster containing strong copathology between neuritic plaques, Lewy bodies and tau neurofibrillary tangles. These disease clusters, which were defined solely by histopathology, differed in terms of cognitive phenotypes, cerebrospinal fluid (CSF) protein levels, and genotype at the APOE and MAPT loci. Finally, using a random forest classifier, we achieved accurate identification (area under the receiver-operator characteristic curve (AUC) = 0.85) of the data-driven tauopathy cluster from a heterogeneous clinical population using only data available in vivo, which exceeded our ability to distinguish individual tauopathies (AUC < 0.75). Our findings complement current definitions of neurodegenerative disease syndromes and provide clinicians with greater explanatory power for parsing disease heterogeneity in the context of existing biomarkers.
Fig. 1 |. Schematic of data processing.
a, The burden of amyloid-β plaques, α-synuclein plaques, tau neurofibrillary tangles, TDP-43 inclusions, ubiquitin, neuritic plaques, angiopathy, gliosis and neuron loss was evaluated on a five-tier ordinal scale (0, rare, 1+, 2+ or 3+) via cerebral autopsy in 895 patients through the Integrated Neurodegenerative Disease Database79. Evaluation of pathological burden was performed for all proteins for the listed regions, in addition to the substantia nigra and locus coeruleus, which are hidden for ease of visualization. Dentate gyrus and CA1–subiculum were quantified separately but shown together here as the hippocampus for ease of visualization. Abbreviations for brain regions can be found in Supplementary Table 1. b, We computed a 895 × 895 similarity matrix in which element i,j contains a polychoric correlation (r) between pathology score vectors for patient i and patient j. Next, we used k-medoids clustering to assign each patient to a data-driven disease cluster. α-Syn, α-synuclein. c, Using linked data from CSF protein testing and genotyping, we trained statistical models to predict membership to disease clusters.
Results
Copathology-driven clusters group heterogeneous diseases by underlying proteinopathies.
Co-occurring pathology is a common feature of several neurodegenerative disorders17, complicating the interpretation of diagnoses as uncovering mono- or di-proteinopathic disease processes; however, copathology has mostly only been studied in a pairwise or triadic fashion, often in subsets of patients with a specific disease. As a result, the extent to which individual diagnoses represent well-defined groups when considering the full space of copathology and patients remains unknown. To address this problem, we sought to identify latent, data-driven clusters of neurodegenerative disease patients that explicitly account for copathology. Here, using a sample of 895 autopsy cases with various neurodegenerative diagnoses, we computed the polychoric correlation41,42 between vectors of pathology scores for each pair of patients as a measure of pairwise interpatient similarity across 98 features of molecular and cellular pathology (Fig. 2a). These pathological features included measurements of different types of proteinopathic features (thioflavin-staining neuritic plaques, tau neurofibrillary tangles, α-synuclein inclusions or TDP-43 inclusions) or types of histological features (angiopathy, gliosis or neuron loss) taken from up to 15 brain regions. When patients are ordered by their arbitrary subject number, the resulting matrix of inter-subject polychoric correlations has no obvious structure (Fig. 2a). When patients are ordered by primary histopathological diagnosis, block-like structure becomes evident, indicating similar histopathological findings in patients with the same primary diagnoses (Fig. 2b). However, we also observe large, positive correlation values on the off-diagonal blocks, indicating similarity in histopathological findings that is unexplained by the primary diagnosis.
Fig. 2 |. Unsupervised clustering of copathology groups disease entities into proteinopathy families.
a, We computed a matrix of polychoric correlations between vectors of pathology scores for each pair of subjects across all available pathological features to quantify the similarity in pathology scores. b, The same matrix as in panel a, where rows and columns are ordered by primary histopathologic diagnosis. Black lines along the diagonal mark blocks of patients with the same diagnosis. AD, Alzheimer’s disease. PiD, Pick’s disease; oth, other; T–O, tau other. c, The same matrix as in panel a, where rows and columns are ordered by a partition detected through k-medoids clustering. Black lines along the diagonal mark blocks of patients grouped into the same cluster. d, Representative vector of pathology scores for each cluster (cluster centroids) demonstrate distinct profiles of pathology that map to underlying molecular drivers of disease, including tau, amyloid-β, TDP-43 and α-synuclein. Thio, thioflavin-staining neuritic plaques. e, Composition of each cluster in terms of primary histopathologic diagnoses (see Methods section ‘Sample construction‘). Each cluster is comprised of disease entities that are putatively caused by the protein most highly represented in the cluster’s centroid. Counts placed above stacked bars indicate the number of patients in each cluster. f, In a subset of patients, all of whom have a primary diagnosis of Alzheimer’s disease (high or intermediate ADNC), we show the composition of each cluster in terms of secondary histopathologic diagnosis. Counts placed above stacked bars indicate the number of patients with Alzheimer’s disease in each cluster. ADNC is identified through ABC staging43. See Methods section ‘Sample construction‘ for definition of ‘tau other’ and ‘other’.
To parse the observed overlap between disease entities, we grouped patients according to their distributions of pathology using an unsupervised clustering algorithm known as k-medoids (Fig. 2c; see Methods and Supplementary Fig. 7 for details, including selection of k and assessment of reliability). Notably, this algorithm was agnostic to histopathologic diagnosis, yet consistently grouped histopathologic diagnoses together by their underlying molecular drivers. This became evident when we constructed a representative patient for each cluster by calculating the average pathology scores across all patients in that cluster. Specifically, we saw clusters characterized by tau (cluster 1), amyloid-β and tau (cluster 2), TDP-43 (cluster 3) and α-synuclein (cluster 4) (Fig. 2d). Indeed, the subjects belonging to cluster 1 harboured diagnoses of primary tauopathies7, such as Pick’s Disease, PSP and CBD (Fig. 2e). Cluster 2 was composed primarily of patients with high levels of Alzheimer’s disease neuropathologic change (ADNC)43 (Fig. 2e). Cluster 3 exhibited strong representation of TDP-43 proteinopathies, namely frontotemporal lobar dementia (FTLD) with TDP-43 aggregates (FTLD–TDP) and amyotrophic lateral sclerosis (ALS) (Fig. 2e). Cluster 4 contained primarily synucleinopathies, housing patients with a spectrum of Lewy body disease (LBD) and multiple systems atrophy (MSA) (Fig. 2e). Cluster 5 contained strong copathology between α-synuclein and ADNC, and cluster 6 contained individuals with minimal cerebral pathology (Fig. 2d), who mostly met criteria for low ADNC and/or ALS. We did not find any differences in the prevalence of early-onset Alzheimer’s disease between clusters. When we plotted the mean pathology scores for each cluster on spatial maps of the brain (Supplementary Fig. 8a) using neuroimaging tools44,45, it became apparent that the amygdala and hippocampus tended to have relatively high pathology scores across all clusters. Collectively, these findings suggest that the solution of our clustering algorithm respects the known hierarchy of neurodegenerative diseases, which is driven by aggregation of specific pathogenic proteins, and captures a known pattern of prominent copathology17,18,46,47 as a distinct group. We also applied an alternative approach called exploratory factor analysis42,48, which independently confirmed the finding that each pathological protein contributes a large amount of latent structure to the 98-dimensional pathology space (Supplementary Fig. 10a).
Targeted investigation of α-synuclein and Alzheimer’s disease copathology.
After establishing the similarity between our clustering solution and existing schema for categorizing neurodegenerative disease, we next explored how our clustering approach might expand on these schema by its explicit treatment of copathology. A previous study found distinct spatial patterns of α-synuclein pathology in patients with ADNC and LBD49, but the patterns of copathology between ADNC (tau and amyloid-β) and α-synucleinopathy remain unstudied. As a first step towards measuring copathology, we visualized the distribution of secondary diagnoses in each cluster given to all patients with intermediate (iAD) or high (hAD) levels of ADNC. In clusters 1 and 3, individuals with iAD had secondary diagnoses of tauopathies and TDP-43 proteinopathies, respectively (Fig. 2e,f), suggesting that moderate ADNC is within the histopathological bounds of these latent disease groups. Interestingly, in clusters 2, 4 and 5, we found secondary diagnoses of LBD in individuals with iAD and hAD (Fig. 2e,f), suggesting that the algorithm identified three subgroups of patients with concomitant ADNC and α-synuclein pathology.
To probe these subgroups further, we performed a targeted investigation of the pathology scores found in patients with both Alzheimer’s disease and LBD in clusters 2, 4 and 5 (Fig. 3). First, we isolated patients in cluster 2, 4 and 5 with both LBD and intermediate (iAD) to high (hAD) levels of ADNC43. Next, we calculated the median score of each pathological feature across subjects with LBD and iAD or hAD (Fig. 3a), which revealed three visually different patterns of cerebral copathology. Interestingly, we found that cluster 5 represents a strong Alzheimer’s disease–α-synuclein neocortical copathology subgroup, cluster 2 represents a ‘pure’ Alzheimer’s disease subgroup with a mild amygdalar synucleinopathy, and cluster 4 represents a ‘pure’ α-synuclein subgroup with a mild, limbic-predominant tauopathy and little neuritic plaque pathology.
Fig. 3 |. Comparison of ADNC and Lewy body copathology clusters.
a, Median pathology scores for patients with intermediate to high ADNC and LBD in cluster 2 (left), cluster 4 (middle), and cluster 5 (right), represented as a region × type matrix of pathological features. b, Matrix of pairwise comparisons of median pathology scores for each pathological feature, where colour axis reflects the indicated difference in pathology. *PFDR < 0.05. FDR, false discovery rate100; PFDR, the P value after correcting for multiple comparisons by controlling FDR at <0.05; Cing, anterior cingulate cortex; SMT, superior-middle temporal cortex; MF, middle frontal gyrus; Ang, angular gyrus; CS, CA1/subiculum; EC, entorhinal cortex; Amyg, amygdala; TS, thalamus; CP, caudate-putamen; GP, globus pallidus; SN, substantia nigra; Med, medulla; CB, cerebellum; MB, midbrain. Refer to Supplementary Table 1 for a tabulation of abbreviations.
Next, as pathology scores were not normally distributed, we used pairwise Wilcoxon rank-sum tests at every one of the 98 pathological features to quantify the differences in median pathology scores between the three groups (Fig. 3b). Compared with cluster 2, cluster 5 has stronger widespread α-synuclein pathology (Fig. 3b, middle; PFDR < 0.05 for all regions except amygdala and cerebellum), weaker cortical tau pathology (Fig. 3b, middle; PFDR < 0.05 for anterior cingulate, superior middle temporal lobe, mid-frontal cortex and angular gyrus) and similar neuritic plaque pathology (Fig. 3b, middle; PFDR > 0.05 except for in CA1–subiculum). Additionally, the downstream effects of this pathology, as measured by neuron loss and gliosis, also differed between these two groups. Despite having a similar cortical neuritic plaque burden and increased cortical α-synuclein pathology, the downstream effects of this pathology were that cluster 5 had less cortical gliosis and neuron loss than cluster 2 (Fig. 3b, middle; PFDR < 0.05 for superior middle temporal lobe, mid-frontal cortex and angular gyrus), suggesting that cortical gliosis and neuron loss are more tightly linked to tau pathology than α-synuclein pathology. Importantly, these two groups do not differ in their median age at death (Supplementary Fig. 5a) or onset of disease (Supplementary Fig. 5b), suggesting that pathological differences are not explained by differing time courses of disease. These findings suggest that in the presence of α-synuclein copathology, an alternative pattern of ADNC with decreased cortical tau and similar amyloid-β may occur.
Disease clusters exhibit unique in vivo phenotypes.
After generating data-driven disease categories solely on the basis of post-mortem pathology, we sought to characterize patients within each cluster in terms of in vivo phenotypes. First, we evaluated cognition in a subsample of n = 159 patients with available data from the Montreal Cognitive Assessment (MoCA)50. The limited availability of MoCA data was due to variable protocols across the multiple clinical cores contributing to the Integrated Neurodegenerative Disease Database. To compare MoCA scores between clusters, we defined Mi−j as the median difference between scores from patients in cluster i and scores from patients in cluster j. Because scores were not all normally distributed, we used the Wilcoxon rank-sum test to evaluate the null hypothesis that Mi−j = 0 for each MoCA score and all unique pairs i,j where i ≠ j. We found significant differences in the overall MoCA scores between clusters (Supplementary Fig. 4a), with cluster 2 harbouring the lowest scores (median MoCA = 11, PFDR < 0.05 with cluster 1, 4 and 6), and cluster 5 (M5−6 = −7.00, n = 52, PFDR = 0.0016) and cluster 4 (M4−6 = −6.00, n = 66, PFDR = 0.0050) containing lower scores than cluster 6, which was characterized by minimal pathology (Fig. 2d). Interestingly, overall MoCA scores did not significantly differ between cluster 5 and cluster 2 or cluster 4 (Supplementary Fig. 4a; M5−2 = 4.00, n = 76, PFDR = 0.077; M5−4 = −1.00, n = 72, PFDR = 0.56). These findings suggest that our data-driven clusters capture cognitive differences, and that the identified subtype of ADNC–α-synuclein copathology is not characterized by globally worse cognition, contrary to what has previously been shown in more general investigations of Alzheimer’s disease–α-synuclein copathology17,51.
Next, we wanted to confirm whether these differences in overall MoCA score were harboured in specific MoCA items, or found throughout all items in the assessment. We compared MoCA subscores between clusters and corrected for multiple comparisons across all 15 unique pairwise comparisons for all 6 MoCA subscores (90 total comparisons) by adjusting the FDR (q < 0.05). We found no statistically significant differences between clusters for digit attention and naming items. For repetition, orientation and delayed-recall items, cluster 2 had lower median scores than at least one other cluster, with the largest differences found for orientation testing (Fig. 4; all PFDR < 0.05). However, the visuospatial subsection produced better separation in performance across multiple clusters, with cluster 4 (median, 1; 3rd quartile, 3) and cluster 5 (median, 2; 3rd quartile, 2) exhibiting the lowest scores (Fig. 4; M5−6 = −2.00, n = 52, PFDR = 0.00051; M5−1 = −2, n = 39, PFDR = 0.013; M4−6 = −2, n = 66, PFDR = 0.0011). For this section, the cluster 2 median was lower than the cluster 1 median (Fig. 4; n = 60, M2−1 = −2.00, PFDR < 0.01) and the cluster 3 median (Fig. 4; n = 58, M2−3 = −2.30, PFDR < 0.05). The cluster 1 median and cluster 3 median visuospatial scores were higher than the cluster 4 median scores (Fig. 4; M1−4 = 1.21, n = 89, PFDR < 0.05; M3−4 = 1.50, n = 87, PFDR < 0.05). These results suggest that testing of visuospatial cognition may be more sensitive to α-synuclein pathology or Alzheimer’s disease–α-synuclein copathology than orientation, which is primarily sensitive to a group of Alzheimer’s disease patients with minimal copathology (Fig. 2d,f). Importantly, this trend, although not statistically significant due to a large reduction in power, was still present when we either isolated (Supplementary Fig. 12b) or excluded (Supplementary Fig. 12a) patients with intermediate to high ADNC from clusters 2, 4 and 5, suggesting that our finding crosses the boundaries of neuropathologically defined Alzheimer’s disease and may be characteristic of the data-driven disease cluster.
Fig. 4 |. Comparison of MoCA scores between clusters.
Pairwise intercluster comparisons of median MoCA subscores50 using the Wilcoxon rank-sum test, FDR-corrected for multiple comparisons (q < 0.05) over all pairwise tests for six subscores. Plots were constructed using code from R package ggpubr101. NS, PFDR > 0.05; *PFDR < 0.05, **PFDR < 0.01, ***PFDR < 0.001 and ****PFDR < 10−6. In box plots, box edges represent the 25th and 75th percentiles, the centre line shows the median and whiskers extend from the box edges to the most extreme data point value that is at most 1.5 × interquartile range (IQR). Data beyond the end of the whiskers are plotted individually as dots. Precise P values can be found in Supplementary Table 3.
In order to probe the apparent item-dependent differences in MoCA scores, we next examined how pathology at individual regions could explain cognitive scores compared to summaries of whole-brain pathology, such as our clusters or ADNC staging. Here, we examined the visuospatial sub-item, the orientation sub-item, and the total Mini-Mental Status Exam (MMSE) score as cognitive measurements. We computed the set of Spearman correlations between each pathological feature (that is, scores for a specific type of pathology in each region) and each cognitive measure (Fig. 5a). This analysis revealed that corticolimbic α-synuclein pathology and neuritic plaques were most strongly negatively related to visuospatial subscores (Fig. 5a, left; amygdalar α-synuclein, ρ = −0.36, PFDR = 3.9 × 10−4; middle frontal gyrus neuritic plaques, ρ = −0.22, PFDR = 0.034). By contrast, corticolimbic tau, neuritic plaques, gliosis and neuron loss were negatively related to orientation subscores (Fig. 5a, middle; middle frontal gyrus tau, ρ = −0.42, PFDR = 1.4 × 10−6) and overall MMSE (Fig. 5a, right; angular gyrus neuron loss ρ = −0.47, PFDR = 4.5 × 10−25).
Fig. 5 |. Disease clusters capture the relationship between cognitive measures and pathology scores.
a, Pairwise Spearman correlations between each pathology score and the MoCA visuospatial subscore (left), the MoCA orientation subscore (middle) or total MMSE score (right) thresholded at PFDR < 0.05, corrected within each subpanel. b, Relative node purity index, a measure of feature importance (Imp.) for random forest models trained to use all pathology scores to predict MoCA visuospatial subscores (left), the MoCA orientation subscores (panel) or total MMSE scores (right). Titles indicate the average model R2 in held-out data over 50 repetitions of fivefold cross-validation. c, Distributions of R2 values for predicting MoCA visuospatial subscores (left panel), the MoCA orientation subscores (middle) or total MMSE scores (right) in held-out data over 50 repetitions of fivefold cross-validation, using different sets of predictors: all pathology, entire matrix in Supplementary Fig. 9; path type average, average score collapsed over regions for each type of pathological feature, that is, synuclein, TDP-43 and others; regional pathology average, average pathology score for each region collapsed over types of pathology; clusters, binary indicators of cluster membership; ADNC, binary indicators of level of ADNC43; LBD, binary indicators of LBD distribution52; PCA, first six principal components of polychoric correlations of pathology matrix (Supplementary Fig. 11); EFA, first six exploratory factors from Supplementary Fig. 10. In box plots, box edges represent the 25th and 75th percentiles, the centre line shows the median and whiskers extend from the box edges to the most extreme data point value that is at most 1.5 × IQR. Data beyond the end of the whiskers are plotted individually as dots.
It is important to note that these findings are potentially nonspecific due to the covariance between regional pathology scores. To address this covariance, we trained random forest models to predict each cognitive measure from different transformations of the pathology matrix and evaluated the performance of these models in multiple held-out samples. In doing so, we found that knowledge of each patient’s cluster assignment explained nearly as much variance in visuospatial and orientation subscores (Fig. 5c; mean visuospatial R2 = 0.18, mean orientation R2 = 0.20) as did the entire matrix (Supplementary Fig. 9a) of pathology scores (Fig. 5c; mean visuospatial R2 = 0.23, mean orientation R2 = 0.26) and explained more variance than simple regional averages (Fig. 5c; mean visuospatial R2 = 0.14, mean orientation R2 = 0.18), pathological feature type averages for visuospatial scores (Fig. 5c; mean visuospatial R2 = 0.16, mean orientation R2 = 0.23), ADNC level43 (Fig. 5c; mean visuospatial R2 = 0.06, mean orientation R2 = 0.20), or LBD classification52 (Fig. 5c; mean visuospatial R2 = 0.13, mean orientation R2 = 0.05). Collectively, these findings suggest that these data-driven clusters capture cognitively relevant spatial patterns of pathology, which may separate global deficits captured through orientation from more specific deficits in visuospatial tasks.
CSF biomarker profiles of disease clusters.
In addition to cognitive biomarkers of disease, we were also interested to know how our pathology-defined clusters separated patients with respect to the levels of proteins found in CSF, a diagnostic test primarily used to identify Alzheimer’s disease40,53–56 with mixed success in FTLD39. Specifically, we assessed CSF amyloid-β1−42, phosphorylated tau and total tau in a subsample of n = 214 patients with available data (Fig. 6). Again, due to skewed distributions of CSF protein levels, we used the Wilcoxon rank-sum test to evaluate the null hypothesis that the median sample difference between every unique pairwise combination of clusters is equal to 0 for each CSF protein, correcting for multiple comparisons over 15 unique tests for 3 CSF proteins by adjusting the FDR (q < 0.05). We found that median CSF amyloid-β1−42, phosphorylated tau and total tau in cluster 2 and cluster 5 differed in a statistically significant manner from clusters 1, 3, 4 and 6 (all PFDR < 0.01), except for cluster 5 and cluster 3 with respect to total tau (M5−3 = 15.43, n = 46, PFDR = 0.33). However, we did not find statistically significant differences in median amyloid-β1−42, phosphorylated tau or total tau levels between cluster 2 and cluster 5 (all PFDR > 0.05). This finding is consistent with CSF amyloid-β and tau markers demonstrating copathology-independent associations with Alzheimer’s disease pathology39,53, although we included more patients with tauopathies in our sample. As expected, these findings were driven by the large constituents of patients with Alzheimer’s disease in clusters 2 and 5 (Supplementary Fig. 13a); however, the small subset of patients in cluster 2 with no or low ADNC43 tended to have lower CSF amyloid-β1−42 and high phosphorylated and total tau, those trends were not statistically significant due to the large reduction in power.
Fig. 6 |. Comparison of CSF protein levels between disease clusters.
Pairwise intercluster comparisons of median CSF protein levels for amyloid-β1−42, phosphorylated tau and total tau using the Wilcoxon rank-sum test, FDR-corrected for multiple comparisons (q < 0.05) over all pairwise tests for all three proteins. Plots were constructed using code from the R package ggpubr101. NS, PFDR > 0.05, *PFDR < 0.05, **PFDR < 0.01, ***PFDR < 0.001 and ****PFDR < 10−6. In box plots, box edges represent the 25th and 75th percentiles, the centre line shows the median and whiskers extend from the box edges to the most extreme data point value that is at most 1.5 × IQR. Data beyond the end of the whiskers are plotted individually as dots. Precise P values can be found in Supplementary Table 4.
Genotypic signatures of disease clusters.
Genetic factors have an important role in determining risk for development of neurodegenerative disease. The ϵ4 allele at the apolipoprotein E (APOE) gene locus is a strong risk factor for Alzheimer’s disease, whereas the ϵ2 allele is thought to be protective57. Interestingly, the H1 haplotype at the gene locus encoding tau (MAPT) has been associated with PSP58, Parkinson’s disease59 and Alzheimer’s disease60: three diseases with putatively distinct aetiologies. We sought to understand how these genetic risk factors might be represented in our pathology-based disease clusters. Using a subsample of 861 patients with genotyping at the APOE and MAPT loci, we measured the representation of genotypes within each cluster (Fig. 7a). Next, using multiple logistic regression, we measured the odds of cluster membership given the presence of risk alleles as the regression coefficient β for allele count.
Fig. 7 |. Prevalence of Alzheimer’s disease risk alleles differs across disease clusters.
a, Within each cluster, we calculated the proportion of each genotype for APOE (left) and MAPT (right). b,d, Matrix of logistic regression β-weights, whose element i,j reflects the increase in log odds ratio for membership to cluster i relative to cluster j given the presence of MAPTH2 (d) or MAPTH1 (b). c, Matrix of logistic regression β-weights, whose element i,j reflects the increase in log odds ratio for membership to cluster i relative to cluster j given the presence of APOEϵ2 (left), APOEϵ3 (middle) or APOEϵ4 (right). NS, PFDR > 0.05. *PFDR < 0.05, **PFDR < 0.01, ***PFDR < 0.001 and ****PFDR < 10−6.
First, we used this approach to investigate how APOE alleles were distributed across our disease clusters. We found that subjects carrying an APOE ϵ4 allele had higher odds of belonging to cluster 2 than to any other cluster (Fig. 7c; cluster 1, β = 1.9, d.f. = 356, PFDR = 1.4 × 10−7; cluster 3, β = 1.4, d.f. = 346, PFDR = 2.8 × 10−6; cluster 4, β = 0.96, d.f. = 380, PFDR = 1.1 × 10−5; cluster 6, β = 1.2, d.f. = 495, PFDR = 1.0 × 10−10) except for cluster 5 (β = 0.12, d.f. = 361, PFDR = 0.66). Similarly, subjects carrying an APOE ϵ4 allele had higher odds of belonging to cluster 5 than to any other cluster (Fig. 7c; cluster 1, β = 2, d.f. = 174, PFDR = 1.5 × 10−6; cluster 3, β = 1.4, d.f. = 164, PFDR = 5.2 × 10−5; cluster 4, β = 0.98, d.f. = 198, PFDR = 0.00062; cluster 6, β = 1.2, d.f. = 313, PFDR = 1.4 × 10−6) besides cluster 2. These results align with previous findings that APOEϵ4 is associated with ADNC, even with α-synuclein copathology61. Also, subjects carrying an APOE ϵ2 allele had higher odds of belonging to cluster 1 (Fig. 7c; β = 1.2, d.f. = 356, PFDR = 0.0043) or cluster 6 (Fig. 7c; β = 1.3, d.f. = 495, PFDR = 0.00042) than to cluster 2. These findings suggest that APOEϵ2 is protective against ADNC, which is less prevalent in cluster 1 and cluster 6 than in cluster 2 and cluster 5.
Next, we used the same regression-based approach to examine how MAPT alleles were distributed across our disease clusters. We found that the presence of two H2 alleles portended lower odds of cluster 1 membership relative to other clusters (Fig. 7d; cluster 2, β = −2.4, d.f. = 357, PFDR = 1.5 × 10−5; cluster 4, β = −1.3, d.f. = 194, PFDR = 0.037; cluster 6, β = −2.4, d.f. = 309, PFDR = 2.3 × 10−5). Similarly, the odds of cluster 1 membership were higher given the presence of an H1 allele relative to cluster 2 (Fig. 7b; β = 0.72, d.f. = 357, PFDR = 0.023) and cluster 6 (Fig. 7b; β = 0.84, d.f. = 309, PFDR = 0.0091). PSP is known to be associated with the MAPT H1 haplotype and was primarily found in cluster 1. Therefore, we repeated the above analysis while excluding PSP patients from cluster 1 to test whether the associations between cluster 1 membership and MAPT haplotypes could be simply explained by the sequestration of PSP patients in cluster 1. When excluding PSP from cluster 1, the relationship between the odds of cluster 1 membership and the presence of an H2 allele was still present (Supplementary Fig. 15a; cluster 2, β = −2.3, d.f. = 312, PFDR = 0.00018; cluster 6, β = −2.3, d.f. = 264, PFDR = 0.00033), although the relationship between cluster 1 odds and H1 allele presence was weakened and not statistically significant (Supplementary Fig. 15a; cluster 2, β = 0.25, d.f. = 312, PFDR = 0.62; cluster 6, β = 0.35, d.f. = 264, PFDR = 0.47). Overall, these findings suggest that our pathology-based clusters automatically produced categories whose genotypic compositions cross boundaries drawn by existing disease labels.
Multivariate classification of disease labels from in vivo biomarkers.
Clinicians currently utilize CSF protein analysis and genotyping in diagnosing neurodegenerative disease7,40,54. However, non-concordant clinical diagnosis39,46 and heterogeneous copathology within existing disease categories hinders the use of these tests to accurately infer a specific disease process afflicting a patient with putative dementia39. Previous studies have demonstrated utility of CSF testing in predicting histopathologically confirmed Alzheimer’s disease39,53,55, FTLD39 or concurrent synucleinopathy in patients with Alzheimer’s disease56. Here, we were interested in quantifying the utility of multiple in vivo biomarkers in explicitly predicting data-driven groupings of patients with copathology (Fig. 2) compared with labels from existing disease definitions. In 287 patients (194 patients with both pathology and CSF data, as well as 93 additional clinically unaffected controls with CSF data only), we trained multiple logistic regression and random forest decision tree models to identify a single class of disease label out of the remaining heterogeneous group of patients using CSF amyloid-β1−42, phosphorylated tau, total tau protein levels and MMSE scores. In this way, we convert a multi-class prediction problem to a one-versus-all scenario to increase the available sample size for model training. Additionally, compared to a pairwise comparison between two disease entities55,56, a one-versus-all prediction requires no clinical priors or narrowing of problem scope to a particular syndrome, which may be unrealistic given the weak correspondence between clinical diagnosis and underlying neuropathology39,46. Importantly, compared to previous studies53,55,56 that predict histopathologically confirmed diagnoses from CSF, we did not preselect or exclude certain diagnoses from our heterogeneous sample, we tested nonlinear models, we utilized cognitive and genetic data and we report only cross-validated algorithm performance in data not used to train our models.
Interestingly, we found that the out-of-sample AUC values were largely similar with the inclusion of genotype data (Supplementary Fig. 16) and use of random forest models62, a class of model that utilizes decision trees and can learn non-linear relationships (Supplementary Fig. 17). These results suggest that CSF protein levels and APOE/MAPT genotype explain common variance in disease status. However, both genetic data and random forest models boosted AUC values for predicting primary tauopathies (PSP and CBD) and our primary tauopathy-driven cluster 1 (Supplementary Figs. 16 and 17). Notably, disease labels could still be predicted from genotype alone with above-chance accuracy (Supplementary Fig. 18).
Here we present the out-of-sample performance of a multiple logistic regression model using only CSF protein levels as features, but results with inclusion of genotypic data and use of random forest models are presented in the Supplementary Information (Supplementary Figs. 16 and 17). Confirming previous findings39,53, the model was able to identify intermediate to high ADNC with mean AUC = 0.91 and clinically unaffected patients with mean AUC = 0.90 (Fig. 8a). Performance in identifying neuropathologic diagnoses of LBD or FTLD–TDP was weaker but still above chance, with mean AUCs of 0.66 and 0.78, respectively (Fig. 8a). PSP and CBD could be identified by general linear models with AUCs of 0.6 and 0.72 (Fig. 8a), respectively, but random forest models could identify each these diseases with an AUC of 0.75 (Supplementary Fig. 17). Collectively, these findings suggest that CSF protein can be used to distinguish patients with a particular traditional neuropathologic diagnosis from a group of patients with a heterogeneous mix of neurodegenerative diseases.
Fig. 8 |. Identifying disease labels from initial testing of CSF protein.
a,b, Characteristics of prediction of existing diagnoses (a) or disease clusters (b) in held-out testing data using multiple logistic regression to predict disease labels from CSF protein levels. Sub-panels (i) and (ii) show the test-set sensitivity and specificity, respectively, using a threshold value of 0.5. Subpanel (iii) shows the area under the curve (AUC) on the test set, reflecting performance over a range of threshold values. Bar length represents mean performance, and error bars indicate 95% confidence intervals over 100 repetitions of k-fold cross-validation at k = 5; mean value and 95% confidence interval are shown in each bar. Subpanel (iv) shows representative receiver-operator characteristic curves for test-set predictions of existing diagnoses (a) or disease clusters (b). Subpanel (v) shows mean standardized multiple logistic regression β weights across 100 repetitions of k-fold cross-validation at k = 5 in the prediction task. The β weights can be interpreted as the increase in log odds ratio for a one s.d. increase in the value of the predictor. TPR, true positive rate; FPR, false positive rate; total tau, total CSF tau protein; phosph. tau, total CSF phosphorylated tau; amyloid-β1−42, total CSF amyloid-β1−42.
After demonstrating that we could predict existing disease labels with high accuracy from CSF protein levels, we next investigated whether the disease cluster we identified could improve our ability to infer histopathologic syndromes from clinical data by comparing the AUC for identifying each data-driven cluster to the AUCs for identifying the individual diseases that constitute each cluster (Fig. 2e,f). With the same group of patients, we carried out the same identification procedure using disease cluster membership as class labels instead of existing disease definitions. We achieved mean AUCs of 0.79, 0.88, 0.82, 0.72, 0.74 and 0.61 for each cluster, respectively (Fig. 8b). While the AUC for identifying intermediate to high ADNC (0.91) was higher than the AUC for identifying cluster 2 (Fig. 8b), our data-driven tauopathy and TDP-43 proteinopathy clusters appeared to be easier to identify than their constituent diseases. Interestingly, we were able to identify cluster 1 with a mean AUC of 0.86 using a random forest algorithm and genetic data, which was better than the AUCs for PSP and CBD by 0.11 with the same approach (Supplementary Fig. 17; AUC = 0.86 versus 0.75). Using linear models, we also identified cluster 3 with a higher AUC by 0.04 than FTLD–TDP (AUC = 0.82 versus AUC = 0.78). The feature weights of these models were similar when predicting a cluster or its constituent diseases (Fig. 8 and Supplementary Fig. 17b,d), suggesting that the cluster-predicting models are more accurate due to the grouping together of multiple diseases with similar biomarker patterns. We also confirmed that our ability to diagnose non-Alzheimer’s disease pathology was not trivially due to our ability to accurately identify Alzheimer’s disease pathology by demonstrating similarly accurate classification of cluster 1 (AUC = 0.79) and cluster 3 (AUC = 0.84) in a subsample of n = 181 patients with no intermediate to high ADNC in clusters 2, 4 and 5 (Supplementary Fig. 19). These findings suggest that groups of tauopathies can be better resolved than individual tauopathies using CSF amyloid-β1−42 and tau, whereas ADNC may be better identified as currently defined by ABC staging43, regardless of copathology53.
Discussion
Neurodegenerative diseases are typically defined by the increased burden of one or two pathogenic protein species. However, it is the rule rather than the exception that individual patients meet criteria for multiple diseases and exhibit several pathogenic protein aggregates. In this study, we analyse a large post-mortem sample of patients with a diverse representation of neurodegenerative diseases. Using a disease-blind approach, we assigned each patient to a new disease cluster by simultaneously accounting for the levels of 4 key aggregated protein species and 3 histological features across 15 regions. The resulting clusters, defined only from pathology data, differed in terms of cognition, genetics, and CSF protein levels. Finally, we trained statistical models that were able to classify cluster membership with above-chance accuracy from in vivo measurements. This work advances our understanding of the clinical and neuropathologic heterogeneity among neurodegenerative diseases. Furthermore, our methods and approach provide a general framework that could be applied to various clinical populations outside of patients with neurodegeneration.
Accounting for copathology produces transdiagnostic categories of neurodegenerative disease.
Existing definitions of neurodegenerative disease typically require the presence of one or two pathological proteins and reflect decades of clinical and scientific consensus5,7,8. As a result of this partial treatment of copathology, it remains unknown the extent to which individual diagnoses represent well-defined groups. In the present work, we define 6 non-overlapping disease categories using an unsupervised approach that simultaneously considers 98 pathological features (Fig. 2). These categories appear to be primarily driven by global levels of four proteins (Supplementary Fig. 8; cluster 1: tau, cluster 2: amyloid-β, cluster 3: TDP-43, and cluster 4: α-synuclein), rather than their regional distributions. Cluster 5 contained individuals with strong amyloid-β, tau, and α-synuclein copathology, and cluster 6 contained individuals with minimal cerebral pathology. We found that our clustering approach groups together existing disease entities caused by each pathological protein, while simultaneously parsing copathology between different proteinopathies. Patients with Alzheimer’s disease were found in all 6 clusters, though primarily in cluster 2, consistent with known pairwise measures of copathology between ADNC, α-synuclein or TDP-4317–20. Overall, these results provide a data-driven confirmation of existing models of neurodegenerative disease.
Interestingly, clusters 2, 4 and 5 all contained individuals with both intermediate to high ADNC and evidence of LBD in various regional distributions52. We compared regional pathology scores between patients with intermediate to high ADNC and LBD in each group and found distinct patterns of ADNC–α-synuclein copathology. Cluster 2 had strong neocortical Alzheimer’s disease pathology with amygdalar synucleinopathy, cluster 4 had strong neocortical and limbic synucleinopathy with limbic-predominant tauopathy and minimal neuritic plaque pathology, and cluster 5 had strong cortical Alzheimer’s disease pathology and synucleinopathy. The separation of patients with concomitant amyloid-β, tau and α-synuclein pathology into cluster 2, cluster 4 and cluster 5 suggests that there may exist a pathophysiologically separate form of strong Alzheimer’s disease–α-synuclein copathology, whereas limbic tauopathy and synucleinopathy may be a normal finding alongside intermediate to high ADNC63. Alternatively, this separation could reflect the existence of multiple strains of α-synuclein64, which may differentially interact with tau65.
Despite being defined solely from pathology data, our clusters separated patients phenotypically and genotypically. We found that individual CSF analyte levels and APOE ϵ4 allele representation in cluster 2 and cluster 5 differed from all other clusters, but did not differ between cluster 2 and cluster 5 ( Figs. 4 and 7). Cluster 2 trended towards a more pathological biomarker profile than cluster 5, consistent with previous studies using manually selected ‘pure’ Alzheimer’s disease and Alzheimer’s disease–α-synuclein copathology groups39,56. The fact that our data-driven ‘pure’ Alzheimer’s disease cluster (cluster 2) still contained small amounts of limbic synucleinopathy may reflect a pathophysiological link between Alzheimer’s disease and α-synuclein pathology66,67. Interestingly, we found that the orientation and visuospatial MoCA items differentially separated the clusters from one another. While orientation scores were lower in cluster 2 than in other clusters including cluster 5, the visuospatial scores were lowest in cluster 4 and cluster 5 (Fig. 4). These trends were still apparent even when we excluded or isolated patients from clusters 2, 4 and 5 with intermediate to high ADNC, suggesting that visuospatial deficits and intact orientation may distinguish ‘pure’ Alzheimer’s disease from Alzheimer’s disease–α-synuclein copathology. Indeed, performance on a pentagon-copying task was predictive of the development of clinical dementia in patients with Parkinson’s disease68. Importantly, the cluster assignments alone explained more variance in the visuospatial MoCA item than LBD subtype, ADNC level, regional average pathology or average amounts of each type of pathology (Fig. 5), suggesting that multiple forms of cognitive impairment may be captured by this classification system.
Statistical models expand the utility of CSF protein analysis.
An ideal route towards targeted therapies generally involves the identification of a sufficiently homogeneous clinical population whose biological characteristics motivate the use of a targeted treatment. In the case of neurodegenerative disease, phenotypic and genotypic heterogeneity, along with discordance between clinical diagnoses and gold-standard neuropathological diagnoses39, limit a clinician’s ability to map biomarkers to specific neurodegenerative processes. To address this problem, we trained generalized linear models and random forest decision trees to predict histopathologic disease class membership on the basis of CSF protein levels, MMSE scores and genotype at the APOE and MAPT loci. Notably, for all predictive modelling, we included all available patients regardless of diagnosis, incorporated additional cognitively unaffected controls, used the chronologically earliest available CSF sample, and reported model performance on a distribution of held-out samples in order to mirror a realistic clinical scenario. Consistent with previous studies39,53,55, we were able to identify intermediate to high ADNC (AUC = 0.91) and cluster 2 membership (AUC = 0.88) from a heterogeneous clinical population of patients with neurodegenerative disease using CSF protein levels (Fig. 8). Interestingly, incorporation of genetic data and use of a nonlinear random forest model allowed us to distinguish our tauopathy cluster (cluster 1) from all other patients with an AUC of 0.86, compared with AUCs of 0.75 for classifying PSP and CBD, two major constituent diseases of cluster 1. Similarly, we were able to classify cluster 3 with an AUC of 0.82 compared to an AUC of 0.78 for FTLD–TDP. Finally, we demonstrated above-chance accuracy in disease class identification based solely on genotype at two loci.
In sum, these findings suggest an untapped utility in genotyping and CSF protein analysis for identifying subgroups of patients within flexibly defined histopathologic boundaries, using simple statistical models to generate predictions. Such models may be valuable for designing and repurposing pharmaceuticals, immunotherapies or combination therapies69 by allowing their efficacy to be compared against probabilistic estimates of membership to a particular histopathologic disease class. Moreover, predictions based on genotype would be stable throughout an individual’s life span, allowing for prospective evaluation of treatments designed to intervene early in the disease course.
Limitations.
We acknowledge a number of limitations in the present study. First, the semiquantitative assessment of the degree of each pathological feature precludes the use of statistical methods relying on interval or continuous data and is subject to issues of inter-rater reliability. More granularity in the levels of pathology might be obtained through quantitative, automated image analysis70,71. Continuous pathology data would lend itself well to principal component analysis, which might identify dimensions of covarying neuropathological features. The loadings on these dimensions could then be correlated with phenotype in a continuous fashion, rather than generating a hard partition and performing comparisons of means, as in the present work. However, such approaches for quantitative mapping are still being validated and the large sample size of manually labelled data vastly outweighs the benefit of using smaller amounts of automatically labelled data. We took a first step towards this goal by performing an exploratory factor analysis (Supplementary Fig. 10) and principal component analysis (Supplementary Fig. 11) using polychoric correlations. These methods provided complementary rather than redundant information when viewed alongside the k-medoids clusters that were primarily studied. The fact that we were able to generate meaningful disease categories consistent with earlier work argues that the use of ordinal data is not a drastic limitation, and may inspire others to use similar approaches on datasets that have remained incompletely explored due to their discrete nature.
Another caveat is that the cluster definitions depend on the sample composition and the set of available features. Application of the same approach to samples with different compositions might yield independently interesting results by relative subtyping. Nevertheless, a new observation (from any new patient) could be assigned to one of the clusters analysed in this paper by simply taking the maximum polychoric correlation with each cluster centroid. Adding more pathological features would probably increase the number of stable clusters. The lack of distinct regional patterns in each cluster is probably due to effects of autopsy timing in relation to disease progression and the spatially coarse sampling of the brain (15 regions instead of 70–1,000 in most neuroimaging studies72), the latter of which may be addressed in the future through automated approaches for whole-brain histology73. In our sample, different morphologies of tau were quantified together, but could give rise to new clusters if quantified separately. We also excluded amyloid-β antibody staining from the clustering procedure due to missing data, which precluded us from detecting diffuse amyloid-β plaques. It is also important to acknowledge that our sample consisted primarily of white patients (Supplementary Table 2, 89.6% overall), unlike the population of patients with dementia in the U.S.2,74. Therefore, these findings may not generalize to non-white subjects, nor will they help to explain the increased prevalence of dementia in non-white populations2. Future studies that include additional features and subjects might subdivide the clusters studied here, whereas studies that focus on a smaller subset of features might glean insights about the structure of the data at a different scale.
Finally, the availability of phenotypic measurements (MoCA and CSF protein analysis) are biased by clinical decision making. One can imagine that lumbar puncture followed by CSF protein analysis and detailed cognitive phenotyping were not routinely performed on patients primarily exhibiting motor symptoms. However, our sample size was large enough that we were able to validate our model in a distribution of held-out samples, unlike previous applications of CSF-based predictive models which report training-set accuracy in smaller samples with well-circumscribed disease boundaries55,56. This validation procedure increases confidence in the external validity of our model. Additionally, the diagnostic composition of our CSF sample closely matched the diagnostic composition of the entire sample, suggesting that such biases had minimal impact. Nevertheless, the true model performance ‘in the wild’ cannot be accurately estimated without external validation in an independent sample.
Outlook.
The utility of the statistical models in this study are limited by sample size, but also by the availability of relevant features. Incorporation of quantitative data on clinical symptomatology, along with more complete genomic data, would probably also enhance the predictive value of these models. Several of the disorders studied in the present work have unique clinical attributes and are associated with multiple genetic variants7,75,76. Clinical symptomatology and genotype can be obtained noninvasively, and might be applied more easily as well as explain additional variance in disease. In particular, a model based purely on genotype could theoretically provide an estimate of disease risk at birth, which would allow for the development of preventative therapies targeting early, preclinical disease. Incorporation of addition pathology features would probably produce further granulation of a truly hierarchical disease scheme, only one level of which has been studied in depth here. While we demonstrate how individuals cluster together with respect to neuritic plaques, tau, α-synuclein and TDP-43, other features, such as the type of gliosis found in each region, different morphological strains of tau and α-synuclein, and characteristics of the local inflammatory response to protein aggregates, would probably split the clusters identified here into more granular subtypes. Similarly, incorporation of additional biomarkers emphasized in the β amyloid deposition, pathologic tau and neurodegeneration (AT(N)) framework77, such as PET imaging for amyloid-β or tau, would enhance the biological interpretation of these clusters and probably improve classification accuracy. Unlike CSF protein markers used here, PET imaging directly quantifies the spatial distributions of tau and amyloid-β. CSF amyloid-β is asymptotically related to PET amyloid-β77,78, which may partly explain the smaller regression weights we observed for CSF amyloid-β compared with CSF tau.
Furthermore, the general approach we use in the present work is not specific or limited to neurodegenerative disease. Unsupervised learning can be applied to similarity matrices based on any set of features pertinent to a disease with overlapping pathophysiologic modes, as in multifactorial disorders such as epilepsy, vascular disease or cancer. Crucial to this process is the compilation of large multimodal and multisite datasets that capture a broad range of diagnoses, phenotypes and genotypes. In addition to the potential clinical utility of biomarker-based forecasting of histopathological syndromes, the present work may serve as a model for the use of unsupervised methods to identify data-driven, transdiagnostic disease subtypes in any field.
Methods
Sample construction.
All data were obtained through the Integrated Neurodegenerative Disease Database79,80, hosted by the Center for Neurodegenerative Disease Research at the Hospital of the University of Pennsylvania. A team of expert neuropathologists assessed the extent of 6 molecular pathological features (amyloid-β, neuritic plaques, tau, α-synuclein, TDP-43, and ubiquitin) and 3 cellular pathological features (angiopathy, neuron loss, and gliosis) for 15–20 brain regions during autopsy for 1,659 patients as previously described79,80, totalling 172 pathological features. All patients gave informed consent in accordance with protocols approved by the University of Pennsylvania’s Institutional Review Board. Semiquantitative pathology scores (0, rare, 1+, 2+ or 3+) were converted to integers from 1 to 5 and treated as ordinal data except when computing mean regional pathology scores to visualize cluster centroids. Here, we studied a subset of 895 patients (age at death: mean 73.2 yr, s.d. 11.58 yr; sex: 43.8% female, 56.2% male) over 98 pathological features. See Supplementary Information, ‘Treatment of missing data’, for a discussion of how these pathological features and subjects were selected (Supplementary Fig. 1).
Expert neuropathologists assigned up to 5 histopathologic diagnoses to each patient according to well-accepted disease definitions codified in the neuropathology literature; namely, Alzheimer’s disease5,43, ALS7, argyrophilic grain disease7, cerebral amyloid angiopathy5, cerebrovascular disease5, CBD7, FTLD–TDP7, FTLD without TDP-43 inclusions7, dementia with Lewy bodies52, Parkinson’s disease with and without dementia52, multiple system atrophy8, primary age-related tauopathy7,81, Pick’s disease7 and PSP7. Counts of primary histopathologic diagnoses in the full sample, along with counts of infrequent diagnoses that we classified as ‘other’ or ‘tau other,’ are shown in Supplementary Fig. 2. We defined Alzheimer’s disease using the ABC staging system defined in Table 3 of ref. 43, which describes Alzheimer’s disease as a spectrum of low, intermediate and high levels of ADNC. Throughout the paper, we note where we observe low levels of ADNC, but consider ‘Alzheimer’s disease’ to be intermediate to high levels of ADNC, which is generally considered sufficient to explain symptoms of dementia43. We combined dementia with Lewy bodes, Parkinson’s disease and Parkinson’s disease with dementia into LBD, because these entities can only be reliably distinguished on the basis of clinical features. We stratified LBD into subtypes occurring in brainstem (bLBD), limbic (lLBD), amygdala-only (aLBD) or neocortical (nLBD) distributions, as described in ref. 52. If LBD subtype was not available for a patient with an entry for LBD, we classified these patients as LBD not otherwise specified (LBD-NOS). The presence of tau pick bodies, synuclein in glia, tau-tufted astrocytes, and astrocytic plaques and tau-positive oligodendrocytes were used to diagnose Pick’s disease7, MSA8, PSP7 and CBD7, respectively, but were not included in the clustering because they were not measured at the regional level. FTLD–TDP and ALS were diagnosed neuropathologically as described in ref. 7. All patients were also given a clinical diagnosis by their physician before autopsy. Analyte levels in CSF were measured using a Luminex assay82.
Unsupervised clustering of patients by molecular pathology.
Traditional definitions of neurodegenerative diseases typically only account for one or two species of protein aggregate. Here, we sought to group patients into clusters in an unsupervised manner83,84 in the space defined by the N × p matrix P, where p is the 98 available pathological features and N is the 895 subjects in our sample. To perform clustering of the ordinal semiquantitative pathology scores, we constructed a matrix R whose elements Rij equal the polychoric correlation41,42 between columns i and j of P, corresponding to the vector of pathology scores for patient i and patient j. Next, we performed k-medoids clustering83 on the dissimilarity matrix D = 1 − R. We used a bootstrapping procedure to select k = 6 on the basis of maximizing the silhouette score and minimizing the area of the curve of the cumulative distribution function of the Jaccard similarity between bootstrapped samples84–87 (for detailed discussion, see Supplementary Information, ‘Optimization and validation of clustering procedure’).
Group comparisons of phenotypic measurements.
We used nonparametric testing to compare phenotypes between disease clusters. Using pairwise Wilcoxon rank-sum tests, we compared mean scores in six subdomains of the MoCA50 between all unique combinations of clusters. All P-values were FDR-corrected (q < 0.05) over all comparisons for all six subdomains. The abstraction score was excluded due to missing data. We also used pairwise Wilcoxon rank-sum tests to compare CSF protein levels between all unique combinations of clusters. For patients that underwent multiple CSF studies, we only considered their earliest measurement in order to control for disease progression as much as possible. All P-values were FDR-corrected (q < 0.05) over all comparisons for the three CSF proteins that we assessed.
Logistic regression of cluster membership on allele counts.
Assuming a multiplicative genomics model88, we used multiple logistic regression to perform pairwise comparisons of allele distributions between disease cluster membership as binary phenotypes89. First, for each pair of clusters i = 1,..., k and j = 1,..., k, where k = 6, we constructed a Nij × 1 binary outcome vector Yij, where Nij is the number of patients belonging to cluster i or cluster j. The elements of Yij indicate whether a patient belongs to cluster i or cluster j. A value of 1 indicates membership to cluster i and a value of 0 indicates membership to cluster j. Next, we construct a corresponding Nij × ng allele table A, where ng is the number of non-wild-type alleles for a given gene g. The table A consists of ng column vectors, Aa, where a = 1,..., ng, and the elements of each column vector Aa indicate how many copies of allele a of gene g are present in each of the Nij patients belonging to cluster i or cluster j. Next, we assumed additive genetic effects and used logistic regression89 to fit the following model to predict cluster membership from an allele table Ag for each gene g:
(1) |
where p is the probability that a patient in Yij belongs to cluster i, β0 and βa are parameters obtained by fitting the model and ϵ is an error term. We fit this model for all 15 unique pairwise comparisons of clusters i and j. In this model, β0 is interpreted as the log odds that a patient belongs to cluster i given that the patient has two wild-type alleles. The parameter βa is the change in log odds that a patient belongs to cluster i given the presence of 1 copy of allele a. We adjusted P-values over all comparisons for all genes and all alleles to control the FDR at q < 0.05.
Supervised learning of disease labels.
In order to demonstrate a practical clinical use for multivariate models of biomarkers, we used data available in vivo to predict whether a patient met criteria for a specific neurodegenerative disease or was classified into a particular data-driven disease cluster. We converted this multi-class prediction scenario into a one-versus-all binary prediction scenario, in which we attempted to predict a N × 1 binary disease label vector Di, whose elements equal 1 for patients positive for disease label i and equal 0 for patients negative for disease label i. Here, the disease labels were Alzheimer’s disease, LBD, FTLD–TDP, PSP, CBD, unaffected individuals, and the four clusters, indexed by i = 1,..., 10, and N is the number of patients with available data, either CSF protein levels, allele counts, or both. We used multiple logistic regression to predict a feature matrix I X, where pX is the probability that a patient is positive for disease label, whose dimensions are N × q, where N is the number of patients i, from with available data and q is the number of features. Depending on the analysis, X contained allele counts for non-wild-type alleles of 2 genes (N = 1,312, q = 3), levels of 3 CSF proteins at initial evaluation (N = 268, q = 3), or both allele count and CSF protein levels (N = 262, q = 6).
When evaluating the performance of a model in predicting disease label i, we wanted to ensure that our results were robust across multiple patient samples and consistent in a held-out sample of patients that were not involved in training the model. Therefore, we used 100 repetitions of k-fold cross-validation90 to obtain estimates and confidence intervals for the out-of-sample performance of each model. For each repetition, we randomly split X and Di along their rows into k = 5 subsets Xj and Dj for j = 1,..., k. When the jth subset was used as the held-out testing dataset, Xj and Dj were the independent and dependent variables, respectively. The remaining k − 1 subsets were compiled into a training dataset, whose independent and dependent variables we refer to as Xr and Dr. In order to aid the training process, we ensured class balance in Xr and Dr by randomly under-sampling their rows91,92, such that 50% of patients in Xr and Dr were positive for disease label i prior to training. Using multiple logistic regression, we fit the following equation with the training data:
(2) |
where p is a vector of probabilities that each patient is positive for disease label i, β0 is an intercept vector, βr is a vector of feature weights obtained through model fitting and ϵ is a vector of random error terms. Next, we used the intercept β0 and feature weights βr obtained by fitting our model to the training data to evaluate the out-of-sample performance of our model. We used the held-out testing data Xj to compute Dp = β0 + βrXj, where Dp is a vector of predicted log odds that each patient is positive for disease label i. We converted Dp from a vector of log odds to a vector of probabilities and evaluated the performance of Dp in predicting Dj. We repeated this procedure for j = 1,..., k until each subset Dj and Xj was used as the held-out data exactly one time. The entire k-fold cross-validation procedure was repeated 100 times for disease label i to generate a distribution of test-set performance metrics (sensitivity, specificity and area under the curve). Reported sensitivity, specificity, and accuracy were obtained using a classification threshold value of 0.5 for Dp, although prediction characteristics for a range of threshold values from 0 to 1 can be found in the receiver-operator characteristic curves.
Citation diversity statement.
Recent work in neuroscience and other fields has identified a bias in citation practices such that papers from women and other minorities are under cited relative to the number of such papers in the field93–98. Here we sought to proactively consider choosing references that reflect the diversity of the field in thought, form of contribution, gender and other factors. We obtained predicted gender of the first and last author of each reference by using databases that store the probability of a name being carried by a woman93,99. By this measure (and excluding self-citations to the first and last authors of our current paper), our references contain 51.6% man (first)/man (last), 14.3% man/woman, 22% woman/man, and 12.1% woman/woman. This method is limited in that: (1) names, pronouns and social media profiles used to construct the databases may not, in every case, be indicative of gender identity; and (2) it cannot account for intersex, non-binary or transgender people. We look forward to future work that could help us to better understand how to support equitable practices in science.
Reporting Summary.
Further information on research design is available in the Nature Research Reporting Summary linked to this article.
Data availability
Source data for all figures and pathology scores for the 895 patients analysed here are available from figshare at https://doi.org/10.6084/m9.figshare.12519488.v1. The raw patient data are available from the authors, subject to approval from the Institutional Review Board of the University of Pennsylvania. For data requests, please visit https://www.med.upenn.edu/cndr/biosamples-brainbank.html and complete a Biosample Request Form. Source data are provided with this paper.
Code availability
All analysis code is available at https://github.com/ejcorn/neuropathcluster.
Supplementary Material
Acknowledgements
D.S.B. and E.J.C. acknowledge support from the John D. and Catherine T. MacArthur Foundation, the Alfred P. Sloan Foundation, the ISI Foundation, the Paul Allen Foundation, the Army Research Laboratory (W911NF-10-2-0022), the Army Research Office (Bassett-W911NF-14-1-0679, Grafton-W911NF-16-1-0474, DCIST-W911NF-17-2-0181), the Office of Naval Research, the National Institute of Mental Health (2-R01-DC-009209-11, R01-MH112847, R01-MH107235, R21-M MH-106799), the National Institute of Child Health and Human Development (1R01HD086888-01), National Institute of Neurological Disorders and Stroke (R01 NS099348) and the National Science Foundation (BCS-1441502, BCS-1430087, NSF PHY-1554488 and BCS-1631550). E.J.C. also acknowledges support from the National Institute of Mental Health (F30 MH118871-01). D.J.I. acknowledges the National Institute of Neurological Disorders and Stroke (R01-NS109260). J.Q.T., V.M.-Y.L. and E.B.L. thank members of the Center for Neurodegenerative Disease Research who contributed to the studies reviewed here. J.Q.T., V.M.-Y.L. and E.B.L. also thank the patients and their families for brain donation. J.Q.T., V.M.-Y.L. and E.B.L. acknowledge funding support from AG10124, AG17586, AG62418 and the Woods Foundation. The authors thank D. Wolk for helpful comments on the manuscript during the review process. The content is solely the responsibility of the authors and does not necessarily represent the official views of any of the funding agencies.
Footnotes
Competing interests
The authors declare no competing interests.
Additional information
Supplementary information is available for this paper at https://doi.org/10.1038/s41551-020-0593-y.
References
- 1.Hebert LE, Scherr PA, Bienias JL, Bennett DA & Evans DA Alzheimer disease in the US population. Arch. Neurol 60, 1119 (2003). [DOI] [PubMed] [Google Scholar]
- 2.Alzheimer’s Association 2019 Alzheimer’s disease facts and figures. Alzheimers Dement. 15, 321–387 (2019). [Google Scholar]
- 3.Dorsey ER et al. Projected number of people with Parkinson disease in the most populous nations, 2005 through 2030. Neurology 68, 384–386 (2007). [DOI] [PubMed] [Google Scholar]
- 4.Brookmeyer R, Gray S & Kawas C. Projections of Alzheimer’s disease in the United States and the public health impact of delaying disease onset. Am. J. Public Health 88, 1337–1342 (1998). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Hyman BT et al. National Institute on Aging–Alzheimer’s association guidelines for the neuropathologic assessment of Alzheimer’s disease. Alzheimers Dement. 8, 1–13 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Jankovic J. Parkinson’s disease: clinical features and diagnosis. J. Neurol. Neurosurg. Psychiatry 79, 368–376 (2008). [DOI] [PubMed] [Google Scholar]
- 7.Irwin DJ et al. Frontotemporal lobar degeneration: defining phenotypic diversity through personalized medicine. Acta Neuropathol. 129, 469–491 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Gilman S et al. Second consensus statement on the diagnosis of multiple system atrophy. Neurology 71, 670–676 (2008). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Selkoe DJ & Hardy J. The amyloid hypothesis of Alzheimer’s disease at 25 years. EMBO Mol. Med 8, 595–608 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Thal DR & Fändrich M. Protein aggregation in Alzheimer’s disease: Aβ and τ and their potential roles in the pathogenesis of AD. Acta Neuropathologica 129, 163–165 (2015). [DOI] [PubMed] [Google Scholar]
- 11.Raj A, Kuceyeski A & Weiner M. A network diffusion model of disease progression in dementia. Neuron 73, 1204–1215 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Pandya S, Mezias C & Raj A. Predictive model of spread of progressive supranuclear palsy using directional network diffusion. Front. Neurol 8, 692 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Wyss-Coray T. Inflammation in Alzheimer disease: driving force, bystander or beneficial response? Nat. Med 12, 1005–1015 (2006). [DOI] [PubMed] [Google Scholar]
- 14.Akiyama H et al. Inflammation and Alzheimer’s disease. Neurobiol. Aging 21, 383–421 (2000). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Rademakers R, Cruts M & van Broeckhoven C. The role of tau in frontotemporal dementia and related tauopathies (MAPT). Hum. Mutat 24, 277–295 (2004). [DOI] [PubMed] [Google Scholar]
- 16.Neumann M et al. Ubiquitinated TDP-43 in frontotemporal lobar degeneration and amyotrophic lateral sclerosis. Science 314, 130–133 (2006). [DOI] [PubMed] [Google Scholar]
- 17.Robinson JL et al. Neurodegenerative disease concomitant proteinopathies are prevalent, age-related and APOE4-associated. Brain 141, 2181–2193 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Higashi S et al. Concurrence of TDP-43, tau and α-synuclein pathology in brains of Alzheimer’s disease and dementia with Lewy bodies. Brain Res. 1184, 284–294 (2007). [DOI] [PubMed] [Google Scholar]
- 19.Nakashima-Yasuda H et al. Co-morbidity of TDP-43 proteinopathy in Lewy body related diseases. Acta Neuropathol. 114, 221–229 (2007). [DOI] [PubMed] [Google Scholar]
- 20.Takahashi RH, Capetillo-Zarate E, Lin MT, Milner TA & Gouras GK. Co-occurrence of Alzheimer’s disease β-amyloid and tau pathologies at synapses. Neurobiol. Aging 31, 1145–1152 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Fang YS et al. Full-length TDP-43 forms toxic amyloid oligomers that are present in frontotemporal lobar dementia-TDP patients. Nat. Commun 5, 4824 (2014). [DOI] [PubMed] [Google Scholar]
- 22.He Z et al. Amyloid-β plaques enhance Alzheimer’s brain tau-seeded pathologies by facilitating neuritic plaque tau aggregation. Nat. Med 24, 29–38 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Giasson BI et al. Initiation and synergistic fibrillization of tau and alpha-synuclein. Science 300, 636–640 (2003). [DOI] [PubMed] [Google Scholar]
- 24.Clinton LK, Blurton-Jones M, Myczek K, Trojanowski JQ & LaFerla FM. Synergistic interactions between Aβ, tau, and α-synuclein: acceleration of neuropathology and cognitive decline. J. Neurosci 30, 7281–7289 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Styr B & Slutsky I. Imbalance between firing homeostasis and synaptic plasticity drives early-phase Alzheimer’s disease. Nat. Neurosci 21, 463–473 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Jeong S. Molecular and cellular basis of neurodegeneration in Alzheimer’s disease. Mol. Cells 40, 613–620 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Yankner BA & Lu T. Amyloid β-protein toxicity and the pathogenesis of Alzheimer disease. J. Biol. Chem 284, 4755–4759 (2009). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Li A et al. Unsupervised analysis of transcriptomic profiles reveals six glioma subtypes. Cancer Res. 69, 2091–2099 (2009). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Liu JJ et al. Multiclass cancer classification and biomarker discovery using GA-based algorithms. Bioinformatics 21, 2691–2697 (2005). [DOI] [PubMed] [Google Scholar]
- 30.Bartolomei F et al. Seizures of temporal lobe epilepsy: identification of subtypes by coherence analysis using stereo-electro-encephalography. Clin. Neurophysiol 110, 1741–1754 (1999). [DOI] [PubMed] [Google Scholar]
- 31.Cragar DE, Berry DT, Schmitt FA & Fakhoury TA. Cluster analysis of normal personality traits in patients with psychogenic nonepileptic seizures. Epilepsy Behav. 6, 593–600 (2005). [DOI] [PubMed] [Google Scholar]
- 32.Xia CH et al. Linked dimensions of psychopathology and connectivity in functional brain networks. Nat. Commun 9, 3003 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Drysdale AT et al. Resting-state connectivity biomarkers define neurophysiological subtypes of depression. Nat. Med 23, 28–38 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Grisanzio KA et al. Transdiagnostic symptom clusters and associations with brain, behavior, and daily function in mood, anxiety, and trauma disorders. JAMA Psychiatry 75, 201–209 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.König A et al. Automatic speech analysis for the assessment of patients with predementia and Alzheimer’s disease. Alzheimers Dement. 1, 112–124 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Avants BB, Cook PA, Ungar L, Gee JC & Grossman M. Dementia induces correlated reductions in white matter integrity and cortical thickness: a multivariate neuroimaging study with sparse canonical correlation analysis. NeuroImage 50, 1004–1016 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Brier MR et al. Tau and Aβ imaging, CSF measures, and cognition in Alzheimer’s disease. Sci. Transl. Med 8, 338ra66 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Khanna S et al. Using multi-scale genetic, neuroimaging and clinical data for predicting Alzheimer’s disease and reconstruction of relevant biological mechanisms. Sci. Rep 8, 11173 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Toledo JB et al. CSF biomarkers cutoffs: the importance of coincident neuropathological diseases. Acta Neuropathol. 124, 23–35 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Kansal K & Irwin DJ. The use of cerebrospinal fluid and neuropathologic studies in neuropsychiatry practice and research. Psychiatr. Clin. North Am 38, 309–22 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Olsson U. Maximum likelihood estimation of the polychoric correlation coefficient. Psychometrika 44, 443–460 (1979). [Google Scholar]
- 42.Revelle W psych: Procedures for Psychological, Psychometric, and Personality Research (CRAN, 2019); https://cran.r-project.org/web/packages/psych/index.html [Google Scholar]
- 43.Montine TJ et al. National Institute on Aging–Alzheimer’s Association guidelines for the neuropathologic assessment of Alzheimer’s disease: a practical approach. Acta Neuropathol. 123, 1–11 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Jenkinson M, Beckmann CF, Behrens TE, Woolrich MW & Smith SM. FSL. NeuroImage 62, 782–790 (2012). [DOI] [PubMed] [Google Scholar]
- 45.Cammoun L et al. Mapping the human connectome at multiple scales with diffusion spectrum MRI. J. Neurosci. Methods 203, 386–397 (2012). [DOI] [PubMed] [Google Scholar]
- 46.Beach TG, Monsell SE, Phillips LE & Kukull W. Accuracy of the clinical diagnosis of Alzheimer disease at National Institute on Aging Alzheimer disease centers, 2005–2010. J. Neuropathol. Exp. Neurol 71, 266–273 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47.Irwin DJ et al. Neuropathological and genetic correlates of survival and dementia onset in synucleinopathies: a retrospective analysis. Lancet Neurol. 16, 55 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48.Holgado-Tello FP, Chacón-Moscoso S, Barbero-García I & Vila-Abad E. Polychoric versus Pearson correlations in exploratory and confirmatory factor analysis of ordinal variables. Qual. Quant 44, 153–166 (2010). [Google Scholar]
- 49.Toledo JB et al. Pathological α-synuclein distribution in subjects with coincident Alzheimer’s and Lewy body pathology. Acta Neuropathol. 131, 393–409 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50.Nasreddine ZS et al. The Montreal cognitive assessment, MoCA: a brief screening tool for mild cognitive impairment. J. Am. Geriatrics Soc 53, 695–699 (2005). [DOI] [PubMed] [Google Scholar]
- 51.Irwin JD & Hurtig IH The contribution of tau, amyloid-beta and alpha-synuclein pathology to dementia in Lewy body disorders. J. Alzheimer’s Dis. Parkinsonism 08, 444 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 52.McKeith IG et al. Diagnosis and management of dementia with Lewy bodies: Fourth Consensus Report of the DLB Consortium. Neurology 89, 88–100 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 53.Tapiola T et al. Cerebrospinal fluid β-amyloid 42 and tau proteins as biomarkers of Alzheimer-type pathologic changes in the brain. Arch. Neurol 66, 382–389 (2009). [DOI] [PubMed] [Google Scholar]
- 54.Kang JH, Korecka M, Toledo JB, Trojanowski JQ & Shaw LM Clinical utility and analytical challenges in measurement of cerebrospinal fluid amyloid-1–42 and proteins as Alzheimer disease biomarkers. Clin. Chem 59, 903–916 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 55.Perneczky R et al. CSF soluble amyloid precursor proteins in the diagnosis of incipient Alzheimer disease. Neurology 77, 35–38 (2011). [DOI] [PubMed] [Google Scholar]
- 56.Irwin DJ et al. CSF tau and amyloid-β predict cerebral synucleinopathy in autopsied Lewy body disorders Alzheimer disease biomarkers and synucleinopathy. Neurology 90, 1038–1046 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 57.Liu CC, Kanekiyo T, Xu H & Bu G. Apolipoprotein E and Alzheimer disease: risk, mechanisms and therapy. Nat. Rev. Neurol 9, 106–118 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 58.Baker M. et al. Association of an extended haplotype in the Tau gene with progressive supranuclear palsy. Hum. Mol. Genet 8, 711–715 (1999). [DOI] [PubMed] [Google Scholar]
- 59.Tobin JE et al. Haplotypes and gene expression implicate the MAPT region for Parkinson disease: the GenePD study. Neurology 71, 28–34 (2008). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 60.Myers AJ et al. The H1c haplotype at the MAPT locus is associated with Alzheimer’s disease. Hum. Mol. Genet 14, 2399–2404 (2005). [DOI] [PubMed] [Google Scholar]
- 61.Tsuang D. et al. APOE ϵ4 increases risk for dementia in pure synucleinopathies. JAMA Neurol. 70, 223–228 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 62.Liaw A. et al. Classification and regression by RandomForest. R News 2, 18–22 (2002). [Google Scholar]
- 63.Uchikado H, Lin WL, Delucia MW & Dickson DW Alzheimer disease with amygdala Lewy bodies: a distinct form of α-synucleinopathy. J. Neuropathol. Exp. Neurol 65, 685–697 (2006). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 64.Covell DJ et al. Novel conformation-selective α-synuclein antibodies raised against different in vitro fibril forms show distinct patterns of Lewy pathology in Parkinson’s disease. Neuropathol. Appl. Neurobiol 43, 604–620 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 65.Guo JL & Lee VMY Cell-to-cell transmission of pathogenic proteins in neurodegenerative diseases. Nat. Med 20, 130–138 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 66.Guo JL et al. Distinct α-synuclein strains differentially promote tau inclusions in neurons. Cell 154, 103–117 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 67.Masliah E. et al. β-Amyloid peptides enhance α-synuclein accumulation and neuronal deficits in a transgenic mouse model linking Alzheimer’s disease and Parkinson’s disease. Proc. Natl Acad. Sci. USA 98, 12245–12250 (2001). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 68.Williams-Gray CH et al. The distinct cognitive syndromes of Parkinson’s disease: 5 year follow-up of the CamPaIGN cohort. Brain 132, 2958–2969 (2009). [DOI] [PubMed] [Google Scholar]
- 69.Perry D. et al. Building a roadmap for developing combination therapies for Alzheimer’s disease. Expert Rev. Neurother 15, 327–333. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 70.Neltner JH et al. Digital pathology and image analysis for robust high-throughput quantitative assessment of Alzheimer disease neuropathologic changes. J. Neuropathol. Exp. Neurol 71, 1075–1085 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 71.Irwin DJ et al. Semi-automated digital image analysis of Pick’s disease and TDP-43 proteinopathy. J. Histochem. Cytochem 64, 54–66 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 72.Eickhoff SB, Yeo BT & Genon S Imaging-based parcellations of the human brain. Nat. Rev. Neurosci 19, 672–686 (2018). [DOI] [PubMed] [Google Scholar]
- 73.Alegro M. et al. Deep learning based large-scale histological tau protein mapping for neuroimaging biomarker validation in Alzheimer’s disease. Preprint at https://www.biorxiv.org/content/10.1101/698902v1 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 74.Demirovic J. et al. Prevalence of dementia in three ethnic groups: the South Florida program on aging and health. Ann. Epidemiol 13, 472–478 (2003). [DOI] [PubMed] [Google Scholar]
- 75.Van Cauwenberghe C, Van Broeckhoven C & Sleegers K The genetic landscape of Alzheimer disease: clinical implications and perspectives. Genet. Med 18, 421–430 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 76.Kalinderi K, Bostantjopoulou S & Fidani L The genetic background of Parkinson’s disease: current progress and future prospects. Acta Neurol. Scand 134, 314–326 (2016). [DOI] [PubMed] [Google Scholar]
- 77.Jack CR et al. NIA–AA research framework: toward a biological definition of Alzheimer’s disease. Alzheimers Dement. 14, 535–562 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 78.Landau SM et al. Comparing positron emission tomography imaging and cerebrospinal fluid measurements of β-amyloid. Ann. Neurol 74, 826–836 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 79.Toledo JB et al. A platform for discovery: the University of Pennsylvania Integrated Neurodegenerative Disease Biobank. Alzheimers Dement. 10, 477–484 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 80.Xie SX et al. Building an integrated neurodegenerative disease database at an academic health center. Alzheimers Dement. 7, e84–e93 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 81.Crary JF et al. Primary age-related tauopathy: a common pathology associated with human aging (PART). Acta Neuropathol. 128, 755–66 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 82.Shaw LM et al. Cerebrospinal fluid biomarker signature in Alzheimer’s disease neuroimaging initiative subjects. Ann. Neurol 65, 403–413 (2009). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 83.Kaufman L & Rousseeuw PJ Finding Groups in Data: An Introduction to Cluster Analysis Vol. 344 (John Wiley & Sons, 1990). [Google Scholar]
- 84.Hastie T, Tibshirani R & Friedman J The Elements of Statistical Learning: Data Mining, Inference, and Prediction (Springer Science & Business Media, 2009). [Google Scholar]
- 85.Kerr MK & Churchill GA Bootstrapping cluster analysis: assessing the reliability of conclusions from microarray experiments. Proc. Natl Acad. Sci. USA 98, 8961–8965 (2001). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 86.VonLuxburg U Clustering stability: an overview. Found. Trends Mach. Learn 2, 235–274 (2010). [Google Scholar]
- 87.Ben-Hur A, Elisseeff A & Guyon I in Biocomputing 2002 (eds Altman RB et al.) 6–17 (World Scientific, 2001). [Google Scholar]
- 88.Clarke GM et al. Basic statistical analysis in genetic case-control studies. Nat. Protoc 6, 121–33 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 89.Cook JP, Mahajan A & Morris AP Guidance for the utility of linear models in meta-analysis of genetic association studies of binary phenotypes. Eur. J. Hum. Genet 25, 240–245 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 90.Kohavi R A study of cross-validation and bootstrap for accuracy estimation and model selection. IJCAI 14, 1137–1145 (1995). [Google Scholar]
- 91.Drummond C & Holte RC C4.5, Class Imbalance, and Cost Sensitivity: Why Under-Sampling Beats Over-Sampling (ICML, 2003). [Google Scholar]
- 92.Mani I & Zhang J kNN Approach to Unbalanced Data Dstributions: A Case Study Involving Information Extraction (ICML, 2003). [Google Scholar]
- 93.Dworkin JD et al. The extent and drivers of gender imbalance in neuroscience reference lists. Nat. Neurosci 10.1038/s41593-020-0658-y (2020). [DOI] [PubMed] [Google Scholar]
- 94.Maliniak D, Powers R & Walter BF The gender citation gap in international relations. Int. Organ 67, 889–922 (2013). [Google Scholar]
- 95.Caplar N, Tacchella S & Birrer S Quantitative evaluation of gender bias in astronomical publications from citation counts. Nat. Astron 1, 0141 (2017). [Google Scholar]
- 96.Chakravartty P, Kuo R, Grubbs V & McIlwain C #CommunicationSoWhite. J. Commun 68, 254–266 (2018). [Google Scholar]
- 97.Thiem Y, Sealey KF, Ferrer AE, Trott AM & Kennison R Just Ideas? The Status and Future of Publication Ethics in Philosophy: A White Paper (Publication Ethics, 2018); https://publication-ethics.org/white-paper/ [Google Scholar]
- 98.Dion ML, Sumner JL & Mitchell SML Gendered citation patterns across political science and social science methodology fields. Political Anal. 26, 312–327 (2018). [Google Scholar]
- 99.Zhou D. et al. Gender Diversity Statement and Code Notebook v1.0; (2020); https://zenodo.org/record/3672110 [Google Scholar]
- 100.Bejamini Y & Hochberg Y Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. R. Stat. Sci. B 57, 289–300 (1995). [Google Scholar]
- 101.Kassambara A ggpubr: ‘ggplot2’ Based Publication Ready Plots v.0.2.4 (CRAN, 2019). [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
Source data for all figures and pathology scores for the 895 patients analysed here are available from figshare at https://doi.org/10.6084/m9.figshare.12519488.v1. The raw patient data are available from the authors, subject to approval from the Institutional Review Board of the University of Pennsylvania. For data requests, please visit https://www.med.upenn.edu/cndr/biosamples-brainbank.html and complete a Biosample Request Form. Source data are provided with this paper.