Genetic Signatures of Exceptional Longevity in Humans

Paola Sebastiani; Nadia Solovieff; Andrew T DeWan; Kyle M Walsh; Annibale Puca; Stephen W Hartley; Efthymia Melista; Stacy Andersen; Daniel A Dworkis; Jemma B Wilk; Richard H Myers; Martin H Steinberg; Monty Montano; Clinton T Baldwin; Josephine Hoh; Thomas T Perls

doi:10.1371/journal.pone.0029848

. 2012 Jan 18;7(1):e29848. doi: 10.1371/journal.pone.0029848

Genetic Signatures of Exceptional Longevity in Humans

Paola Sebastiani ^1,^*, Nadia Solovieff ¹, Andrew T DeWan ², Kyle M Walsh ², Annibale Puca ³, Stephen W Hartley ¹, Efthymia Melista ⁴, Stacy Andersen ⁵, Daniel A Dworkis ⁶, Jemma B Wilk ⁷, Richard H Myers ⁷, Martin H Steinberg ⁶, Monty Montano ⁶, Clinton T Baldwin ^6,⁸, Josephine Hoh ², Thomas T Perls ⁵

Editor: Greg Gibson⁹

PMCID: PMC3261167 PMID: 22279548

Abstract

Like most complex phenotypes, exceptional longevity is thought to reflect a combined influence of environmental (e.g., lifestyle choices, where we live) and genetic factors. To explore the genetic contribution, we undertook a genome-wide association study of exceptional longevity in 801 centenarians (median age at death 104 years) and 914 genetically matched healthy controls. Using these data, we built a genetic model that includes 281 single nucleotide polymorphisms (SNPs) and discriminated between cases and controls of the discovery set with 89% sensitivity and specificity, and with 58% specificity and 60% sensitivity in an independent cohort of 341 controls and 253 genetically matched nonagenarians and centenarians (median age 100 years). Consistent with the hypothesis that the genetic contribution is largest with the oldest ages, the sensitivity of the model increased in the independent cohort with older and older ages (71% to classify subjects with an age at death>102 and 85% to classify subjects with an age at death>105). For further validation, we applied the model to an additional, unmatched 60 centenarians (median age 107 years) resulting in 78% sensitivity, and 2863 unmatched controls with 61% specificity. The 281 SNPs include the SNP rs2075650 in TOMM40/APOE that reached irrefutable genome wide significance (posterior probability of association = 1) and replicated in the independent cohort. Removal of this SNP from the model reduced the accuracy by only 1%. Further in-silico analysis suggests that 90% of centenarians can be grouped into clusters characterized by different “genetic signatures” of varying predictive values for exceptional longevity. The correlation between 3 signatures and 3 different life spans was replicated in the combined replication sets. The different signatures may help dissect this complex phenotype into sub-phenotypes of exceptional longevity.

Introduction

The average human lifespan in developed countries now ranges from about 80 to 85 years. Environmental factors such as lifestyle choices and where we choose to live as well as genetic factors all contribute to healthy aging. Supporting the importance of environmental factors in survival to old age is the 88 year average life expectancy of Seventh-Day Adventists [1], who by virtue of their religion have health related behaviors conducive to healthy aging.

Human twin studies suggest that only 20–30% of the variation in survival to about 85 years is determined by genetic variation [2]. However, the existence of rare families demonstrating remarkable clustering for extreme ages [3], [4], the increased relative risks of survival amongst siblings of nonagenarians [5] and of centenarians [6], [7], [8], [9], [10], [11], [12], [13], the fact that children of centenarians experience a marked delay in age-related diseases [14], and the similarity of centenarians' lifestyles to the general population [15], all argue that genetic factors play a much stronger role in living 25–35 years beyond the mid-eighties [10], [16], [17]. Impressively, siblings of centenarians born in 1900 have a relative risk of living nearly 100 years that is 8 (females) to 17 times (males) greater than that for the average of their birth cohort [10]. The rarity of the trait —only 1 centenarian amongst approximately 5,000 people in the US and only 1 supercentenarian (age 110+ years) amongst seven million people [18]— places exceptional longevity in a very different category from both average life expectancy and common complex traits associated with aging.

Based upon the hypothesis that exceptionally old individuals are carriers of multiple genetic variants that influence human lifespan, we conducted a genome-wide association study (GWAS) of centenarians. We began with a traditional one SNP at a time analysis to identify SNPs that are individually associated with exceptional longevity. We then used a novel approach to build a family of genetic risk models based on Bayes rule which, while taking into account the simultaneous influence of many genetic variants, can accurately discriminate between subjects with average versus exceptional longevity. Next, we used this family of models to construct subject-specific genetic risk profiles that, by cluster analysis, can be used to discover sub-phenotypes of exceptional longevity that are characterized by different genetic signatures. Figure 1 summarizes the steps of the analyses.

The analysis included genetic matching to remove confounding by population stratification between cases and controls of the discovery and replication set 1, discovery and replication of single SNP associations, multivariate genetic risk modeling and generation of predictive genetic profiles, and cluster analysis of genetic risk profiles to discover genetic signatures of EL.

Results

Primary and secondary sets

Our primary set (discovery set) consisted of 801 unrelated subjects enrolled in the New England Centenarian Study (NECS) and 914 genetically matched controls. NECS subjects were Caucasians who were born between 1890 and 1910 with an age range of 95 to 119 years (median age 104 years). Approximately one-third of the NECS sample included centenarians with a first-degree relative also achieving exceptional longevity, thus enhancing the sample's power [19]. Controls included 241 genetically matched NECS referent subjects who were spouses of centenarian offspring or children of parents who died at an age ≤73 years, and 673 genetically matched subjects selected from the Illumina control database. For genetic matching we used a previously described algorithm [20] that groups subjects by ethnicities based on cluster analysis of the most informative principal components of genome-wide genotype data (Figure S1). Note that, based on the U.S. Social Security Administration's 1920 birth cohort life table, the average life expectancy in the cohort is 82 years, with standard deviation of 7.9 years, so that the mean age of the cases in our study and the average life expectancy in the cohort differ by 2.69 times the standard deviation. Furthermore, the mean age of NECS controls was 75 years, with standard deviation 7 years. Therefore, the difference between mean age of centenarians in the discovery set and NECS controls was more than 4 times the standard deviation, thus boosting the power of the study. For replication we used two additional sets. The replication set 1 (“ELIX”) consisted of 253 North American Caucasian subjects enrolled by Elixir Pharmaceuticals between 2001 and 2003. These individuals were born between 1890 and 1910 (age range of 89–114 years, median age 100) and were recruited and phenotyped using a protocol similar to the NECS. Referent subjects (n = 341) were identified from the remaining Illumina controls and genetically matched to the 253 cases using the same matching algorithm used in the discovery set. The replication set 2 was composed of 60 centenarians that included 39 subjects of European ancestry enrolled in the NECS between June 2009 and September 2010 (age range 100–114, mean age 108) plus 21 centenarians (age range 101–115, mean age 107) not included in the discovery set during the genetic matching, and all available Caucasians samples from the Illumina control database not used in the above comparisons. Centenarians and controls in replication set 2 were not genetically matched to test the generalizability of the results. Figure 2 displays the age distributions of centenarians in the discovery and replication sets 1 and 2. We also used an additional set of 867 neurologically normal subjects used as controls for a Parkinson's disease GWAS [21], to test the robustness of single SNP associations. We analyzed 243,980 SNPs that passed a stringent quality control protocol described in the methods.

NECS: centenarians of the discovery set, ELIX: nonagenarians and centenarians from the ELIX replication set, NECS 2: additional NECS replication set of 60 centenarians. The y-axis reports the density, and the x-axis reports the age, in group of 2 years. The frequency of subjects with ages between x and x+2 is 2*density*(sample size).

Single SNP Analysis

First we conducted a traditional single SNP analysis in which we ranked SNPs in the discovery set by the strength of association. We employed both Bayesian and traditional frequentist analyses of 4 different genetic models (general/genotypic, allelic/additive, recessive and dominant associations) to maximize power [22], [23]. With the Bayesian analysis, we scored each SNP association by the Bayes Factor (BF), which is the posterior odds for the association when the null hypothesis of no association and the alternative hypothesis of an association have the same prior probability [24], and then we used the maximum BF (MBF) as a measure of statistical significance. Figure S2 shows the error rate of decision rules based on several thresholds for MBF. The matching strategy appeared to remove confounding by stratification because we did not observe any inflation of associations and the genomic control factor in allelic association was 0.99 (Figure S3). We also conducted additional analyses described below to investigate whether residual confounding by population stratification could bias the results and found no evidence of bias.

The Manhattan plot ( Figure 3 ) displays the log10(MBF) for each tested SNP. This analysis identified a single SNP in APOE/TOMM40 as irrefutably genome-wide significant (P<10e-8, Table 1 ). The association was replicated in the ELIX set, and was maintained when we used 867 referent subjects included in a GWAS of Parkinson's disease as alternative controls ( Table 1 ).

The SNPs are ordered by chromosome (alternate color bands) and, within chromosome, by physical position (x-axis). We tested the association of each SNP with exceptional longevity using general, allelic, dominant and recessive models and the y-axis reports the maximum log10(Bayes factor) observed for each SNP. The SNP rs2075650 in *APOE/TOMM40* reached irrefutable genome wide significance (log10(MBF) = 7.9 and p-value<e-10). Figure S3 shows the Manhattan plot and QQ plot for the additive model using logistic regression.

Table 1. Replication of the association of rs207650 in TOMM40/APOE.

	SNP	Gene	Chrom	Alleles	Discovery Set (801, 914)
					LOG10(BF)	p-value	OR	p(A)
Discovery Set (801, 914)	rs2075650	*TOMM40/APOE*	chr19:50087459	AG/GG v AA	6.31	1.03E-08	0.49	0.15/0.26
Replication Set (Elix 253, 341)					2.04	0.000468	0.47	0.15/0.27
Combined (1054, 1255)					9.30	1.01E-11	0.48	0.15/0.26
Coriell (801, 867)					3.73	3.86E-06	0.55	0.15/0.24

Open in a new tab

The table shows the replicated associations of the SNP rs207650 in TOMM40/APOE in the replication set 1 and the additional control set from the Parkinson's Disease study. Column legends: SNP = official dbSNP identifier. Gene = official gene name for SNPs that are within 20 kb from transcribed regions. Chrom = Chromosome and physical position of SNP in hg18. Alleles = the two SNP alleles (allele 1 v allele 2) in the genetic model that reached strongest significance in the Bayesian analysis. LOG10(BF) = the logarithm 10 Bayes Factor for the association relative to the null model of no association. Assuming uniform prior probabilities for the two hypotheses, the BF represents the posterior odds for association. P-value = p-value for 1 degree of freedom test for the dominant model AG/GG versus AA. OR = odds ratio for exceptional longevity in subjects who carry allele 1 relative to allele 2. For example, subjects who carry the allele 1 (AG/GG) of SNP rs2075650 have 0.49 times the odds for exceptional longevity compared to subjects who carry the allele 2 (AG/GG: either the genotype AG or GG). P(A) = prevalence of allele 1 in cases and controls. For example, 15% of centenarians carry the allele AG/GG of SNP rs2075650 compared to 26% of controls. Row 1 shows the results in the discovery set; row 2 in the ELIX set, row 3 the combined discovery and ELIX datasets and row 4 is the set in which the 914 matched controls of the discovery set were replaced with the unmatched Coriel controls.

The apolipoprotein E (APOE) is associated with human lifespan [25], [26], [27]. SNP rs2075650 occurs in an intron of TOMM40 but it is a strong proxy of the SNPs that define the APOE alleles [28]. This SNP has been associated with Alzheimer's disease (AD) [29], [30] and lipid levels [31], [32].

Genetic Risk Modeling

In the single SNP analysis, we observed a substantial enrichment for significant associations which do not meet the stringent threshold for genome wide significance. For example, 112 SNPs were associated with exceptional longevity with log10(MBF)>2 against an estimated error rate of 4 in 100,000 independent tests and hence 8–10 false positive associations expected by chance in ∼250,000 tested SNPs if there were no significant associations and all SNPs were independent (Figure S2). The clusters of associations in chromosomes 8, 9 and 21 in Figure 3 point to interesting regions, although they fail to reach genome wide significance. Several authors have argued that SNPs that do not reach genome wide significance may be biologically important by virtue of their joint effect [33], [34], [35], [36], and have successfully built risk models that can predict genetic susceptibility to several complex traits that are highly heritable [37], [38], [39], [40], [41]. We similarly explored the hypothesis that different sets of SNPs that are associated with exceptional longevity, although with moderate effects, may jointly characterize the genetic predisposition to exceptional longevity [42], [43] and therefore provide a model for in silico analysis that can suggest targets and genetic paths to exceptional longevity.

Selection of Predictive SNPs

To proceed with this analysis, we had to make several decisions about the class of models to work with, how to determine the number of SNPs to be included in the model, and the overall search strategy. We chose to compute the genetic risk associated with a set of SNPs using a simple but effective Bayesian classification model, also known as the naïve Bayes classifier ( Figure 4A ) [44]. This approach –also used in [39] to accurately predict the susceptibility to carotid atherosclerosis – classifies a subject as predisposed to exceptional longevity if the posterior probability of exceptional longevity, given genotypes of a set of SNPs, exceeds the posterior probability of average longevity ( Figure 4A ). The advantage of this method is that there is virtually no upper limit to the number of SNPs that can be used for classification, and it can be used for risk prediction even if the data used for the analysis are from a case control study. We designed a forward search procedure to discover a sufficient number of predictive SNPs ( Figure 4A ). The procedure builds a series of nested genetic risk models starting with the most significant SNP in the discovery set and incrementally adding one SNP at a time from a pruned set of SNPs that are sorted in order of log10(MBF). Each model is used for prediction, and the accuracy of each model to predict exceptional longevity and average longevity is evaluated by sensitivity and specificity ( Figure 4B ). The trend of sensitivity and specificity in Figure 4B shows that including more SNPs increases both sensitivity and specificity but the gain of accuracy becomes less and less as SNPs with decreasing statistical significance (lower MBF) are added. Particularly, the sensitivity plateaus between 275–285 SNPs so that including more SNPs does not appear to improve the sensitivity further ( Figure 4B ). Because the model with 281 gives the closest sensitivity and specificity, we stopped the search for predictive SNPs at 281. We also used a resampling approach (Figure S4A) to validate this choice, and examined the effect of changing the SNP order in our heuristic search (Figure S4C and D), and possible lab-genotyping bias (Figure S4B).

We ordered SNPs by maximum Bayes Factor in the discovery set and built nested SNP sets starting with the most significant SNP and then adding one SNP at a time from the ordered list. The conditional probabilities of SNP genotypes in centenarians (p(SNP_i|EL)) and controls (p(SNP_i|AL)) are used to compute the posterior probability of exceptional longevity (p(EL|Σ_k)) using Bayes' theorem and prior probability p(EL) = 0.5. The classification rule is the standard Bayesian classification rule that is optimal under a 0–1 loss function. **B) Sensitivity and specificity of 400 nested models.** The x-axis reports the number of SNPs in each of the nested models, and the y-axis reports sensitivity (% of centenarians with posterior probability of exceptional longevity>posterior probability of average longevity) and specificity (% of controls with posterior probability of exceptional longevity<posterior probability of average longevity).

Table S1 provides complete details of all of the 281 SNPs, and the probabilities that are used to compute the prediction using the formula in Figure 4A . Reliability of the Illumina genotyping was double-checked by re-genotyping the top 28 SNPs of the model using TaqMan genotyping in an independent lab, and the 99.7% concordance suggests that the data are reliable (Figure S5). Intensity plots of the 281 SNPs are available from www.bumc.bu.edu/centenarian. 137 SNPs of the 281 SNPs occur in 130 genes, some of which have been previously associated with aging such as LMNA (rs915179), WRN (rs1800392), and SOD2 (rs2758331) and several of them are in close proximity of coding SNPs [45]. The LMNA gene, which encodes the nuclear envelope proteins lamin A and lamin C, has been associated with the progeroid (premature aging-like) syndrome, Hutchinson-Gilford syndrome [46]. The WRN gene is a DNA helicase and exonuclease that plays a deterministic role in DNA repair and another progeroid syndrome, Werner's Syndrome [47]. The WRN gene has been associated with longevity in the Framingham Heart Study (FHS) sample [48]. It is remarkable that the two genes responsible for the best known progeroid syndromes appear in the genetic risk model, and this may reflect the power of the discovery sample which includes such extreme old ages. Another gene, also noted to be associated with longevity in the FHS sample as well as the Jerusalem Study, is SOD2, or superoxide dismutase 2 [49]. SOD2 is a key free radical scavenger and free radical damage likely plays an important pathogenic role in aging and numerous age-related diseases [50]. CDKN2A (rs1063192) performs a key step in the p53 pathway that has been posited to play a key role in inducing cellular senescence [51] and it has been associated with adult onset diabetes [52]. SORCS1 (rs7907713) and SORCS2 (rs6812745) have been linked AD [53]. Gastric inhibitory polypeptide (GIP), commonly referred to as glucose-dependent insulinotropic peptide, encodes a protein that regulates insulin secretion and activates AKT [54]. The association of this gene (rs9899404) supports the potential role of insulin regulation in exceptional longevity [55], and suggests new target genes for human aging beyond FOXO1, FOXO3A and IGF-IR [56], [57], [58]. There is also growing evidence of GIP playing a protective role in both diabetes and AD and GIP is being investigated as a therapeutic target [59].

We used Genomatix (http://www.genomatix.de) to annotate the list of 130 genes included in the genetic risk model and the analysis showed that the list was enriched for several groups of genes linked to both common and rare diseases (MeSH). Genes related to Alzheimer's disease, dementia and tauopathies were the most significant: 38 of the 130 genes were linked to AD in the literature (p-value to test the null hypothesis that this happens by chance was 6.17 e-7) and they are displayed in Figure 5 ; 42 genes were linked to dementia (Figure S6, p-value to test the null hypothesis that this happens by chance was 1.07 e-6) and 38 to tauopathies (p-value 8.47e-7). The fact that so many genes are noted to play a role in dementia is consistent with the epidemiologic finding that dementia is absent or markedly delayed amongst centenarians (average age of onset, 93 years) [60]. Genes related to other age related diseases were also significantly represented: 24 genes were linked to coronary artery disease ( Figure 5 ), and several genes were linked to neoplasms.

The two networks display 38 of the 130 genes in the genetic risk model that are linked to Alzheimer's disease (top) and 24 of the 130 genes that are linked to coronary artery disease (bottom) in the literature, either by functional or genetic association studies. The nodes that are linked by an edge represents either genes that are “co-cited” (dashed lines) or “associated by expert curation” (continuous lines). The arrow head means that the associations are activation (triangle), inhibition (circle), modulation (diamond), conversion (arrow head). The node shape informs about known roles of the genes (see inset). The nodes that are singleton were linked to AD/CAD in the literature but not together with other genes. The number of genes linked to each disease was compared to what is expected by chance using Fisher exact test, and the p-values show that the gene seta are unluckily the result of chance. (Networks generated with Genomatix).

Genetic Risk Profiles and Ensemble of Risk Models

To better understand the role of these 281 SNPs in shaping the genetic susceptibility to exceptional longevity, we generated a genetic risk profile for each subject by plotting the posterior probability of exceptional longevity (p(EL|Σ_k), y axis) against the number of SNPs in each of the 281 SNP sets Σ_k (x-axis) and examined their patterns. Figure 6 shows, for example, the profiles from 3 centenarians and a control. In each profile, an increasing posterior probability of exceptional longevity shows strong enrichment of longevity associated variants, because the posterior probability of exceptional longevity increases when the profile includes a new SNP genotype that is more frequent in centenarians than in controls (see methods).

281 nested SNP sets were used to compute the posterior probability of exceptional longevity in the 4 subjects (y-axis) and were plotted against the number of SNPs in each set (x-axis). In the 107 year old, the first 5 SNP sets Σ₁ = [rs2075650], Σ₂ = [Σ₁, rs1322048], …, Σ₅ = [Σ₄, rs6801173] determine a posterior probability of exceptional longevity ranging between 0.54 and 0.28. This subject carries genotypes AA, AG, AG, CC, AA for the 5 SNPs respectively and, with the exclusion of genotype AA of rs2075650 that is more common in centenarians, the other genotypes are more common in controls than centenarians and determine a posterior probability of exceptional longevity that is lower than the posterior probability of average longevity. The sixth SNP set, Σ₆ = [Σ₅, rs337656], predicts an almost 30% chance of exceptional longevity. The subject carries the AA genotype for the SNP rs337656 that is more frequent in centenarians (Table S1), and carrying this genotype increases the posterior probability of exceptional longevity. The probability predicted by the next SNP sets increases steadily and all models with more than 20 SNPs predict more than a 50% chance of exceptional longevity. This genetic profile shows that the subject carries some combinations of SNP alleles that are associated with exceptional longevity, while other alleles are associated with “average longevity”. However, the overall genetic risk profile determined by all 281 SNP sets makes a strong case for exceptional longevity because the majority of models predict more than an 80% chance of exceptional longevity. The genetic risk profile of the centenarian who died at age 119 years is even more convincing: with the exception of the first SNP, all subsequent SNP sets determine more than a 70% chance of exceptional longevity, and 272 of the 281 models predict more than an 80% chance for exceptional longevity. This profile shows that this subject is highly enriched for SNPs alleles that are more common in centenarians (longevity associated variants) and that probably played a determinant role in the extreme survival. The profile of the third subject, age 108 years, shows that different SNP sets determine different chances for exceptional longevity, and only the overall trend of genetic risk provides evidence for exceptional longevity. The fourth plot displays the profile of a control, and shows that this subject carries some longevity associated variants; however, the overall trend of genetic risk points to average longevity rather than exceptional longevity.

These examples support the hypothesis that exceptional longevity is determined by varying combinations of longevity associated variants and some number of SNPs may be optimal for classifying some subjects but not others. Consistent with this observation, we choose an ensemble of all 281 genetic risk models to compute the posterior probability of exceptional longevity. This ensemble of 281 genetic risk models provides 89% specificity and sensitivity in the discovery set ( Figure 7A ). We next evaluated the predictive accuracy of this ensemble of models in the two replication sets, the ELIX set and a recently enrolled sample of NECS centenarians.

**Panel A:** Posterior probability of exceptional longevity (EL) and average longevity (AL) (x axis) in the centenarians (red boxplots) and controls (AL1: Illumina controls, blue boxplots, AL2: NECS controls, green boxplots) of the discovery set (NECS, top left). Both sensitivity and specificity were 89%. The boxplots in blue and green show that the distributions of the posterior probability of EL in the two control groups are not statistically different (p-value from t-test comparing the posterior probability of EL = 0.21). **Panel B:** Posterior probability of EL and AL (x axis) in the centenarians (red boxplots) and controls of the replication set 1. Sensitivity and specificity were 60% and 58% and the distributions of the predictive score are significantly different (t-test p-value = 0.001). **Panel C:** Median values of the posterior probability of EL (predictive score) in subsets of centenarians of the replication set 1 with increasing ages. The barplot shows that the median score increases with older ages. **Panel D: Sensitivity of the classification rule in subsets of centenarians of the replication set 1 with increasing ages.** The barplot shows the increasing sensitivity in older groups that reaches 85% in 20 subjects aged 106 and older. Panel E: Distribution of the posterior probability of exceptional longevity in the 253 cases of the replication set divided into two age groups (<103 years, pale blue, mean age 99 years, and ≥103 years, red, mean age 106). The sensitivities in the two groups are 57% and 71.4%. The three distributions are significantly different (p-value = 0.04 from t-test comparing Illumina controls and centenarians aged <103; p-value = 0.004 from t-test comparing the centenarians stratified by age). Panel F: Sensitivity and specificity in an additional set of 2863 controls from the Illumina database (blue), and an additional set of 60 centenarians that include 39 centenarians enrolled since June 2009 (mean age 108) and 21 centenarians that were excluded from older analysis because of genetic matching (mean age 106). The specificity in the additional Illumina controls is 61.2%. The sensitivity in the additional centenarians was 71.5% in the set of 21, and 82% in the additional 39 for a total of 78% (p-value from t-test comparing the posterior probabilities of EL in controls and centenarians <1e-10).

Sensitivity and specificity in the replication set 1 (the ELIX sample) comprised of 253 nonagenarians and centenarians and 341 genetically matched controls were 60% and 58% ( Figure 7B ) and AUC = 0.58 (Figure S7). Although the distributions of the predictive scores are significantly different (p-value from t-test comparing the predicted probabilities of exceptional longevity in the two groups was 0.001), the discrimination of the model is less remarkable. Since the ages of subjects in this replication set are younger compared to the centenarians in the discovery set (median age in the ELIX set was 100 years compared to 104 in centenarians of the discovery set) and because we expect that the genetic component of exceptional longevity increases with age, we next examined the distribution of the predictive score and the trend of sensitivity in subsets of subjects with older ages. The median probability of exceptional longevity in subsets of increasing age of survival increases to more than 68% in the 81 subjects with ages >101 ( Figure 7C ) and, consistently, the sensitivity of the model to correctly classify older subjects increases with older ages and reaches 85% in 20 subjects ages 106 and older ( Figure 7D ). For example, when the 253 cases of the replication set were divided into two age groups to better match the ages of the substantially older discovery set (204 subjects, age <103, median age 100 years, and 49 subjects, age ≥103, median age 105) the sensitivity of the model was 71% ( Figure 7E ).

To further investigate our hypothesis that the genetic contribution to exceptional longevity increases with older ages we evaluated the sensitivity of the classification rule in a second replication set of newly enrolled NECS centenarians (n = 39) plus NECS centenarians not included in the discovery set (n = 21), the sum of which had a median age of 107 years ( Figure 7F ). The sensitivity was 78% (71.5% in the group of 21 with median age 106 and 82% in the recently enrolled and older group of 39) confirming increasing sensitivity with increasing ages. The boxplot in Figure 7F shows that the specificity in an additional set of 2863 controls of replication set 2 was is 61.2%, and the AUC in this second replication set was 0.74 (Figure S7). Figure S8 shows that classification rules based on randomly ordering the top 281 SNPs (mid panels) or selecting 281 SNPs at random have lower sensitivity and specificity.

Our analysis used genetic matching to remove confounding by population structure. However, since we matched subjects within clusters, residual stratification might still confound the association and possibly affect the classification rule. To test the hypothesis that there is no confounding by residual stratification, we conducted two traditional analyses. In one analysis, we adjusted the associations of the 281 SNPs by the top 4 principal components, and in the second analysis we did not. We then checked whether adjusting the analysis by the principal components would change the results of the unadjusted analysis. Figure S9 shows that the distributions of p-values for the two analyses in different genetic models are essentially identical (correlation coefficient 0.98 to 0.99). This analysis would indicate that there is no confounding due to residual stratification. We repeated the analysis adjusting for the top 10 principal components. The effect of this more stringent adjustment made 3 of the 281 SNPs borderline significant. We also checked if there is any residual correlation between the top two PCs and the score predicted by our model, and there appears to be none (Figure S10).

Genetic Signatures

Some genetic risk profiles were recurrent and we speculated that groups of centenarians may have genetic risk profiles that are associated with different sub-types of exceptional longevity such as different prevalences or ages of onset of age-related diseases. To test this hypothesis, we used cluster analysis to group the genetic risk profiles into prototypical signatures. We then investigated whether groups of centenarians with particular genetic risk profiles shared specific age-related sub-phenotypes.

Cluster analysis identified 26 groups of 8 to 94 centenarians (90% of the discovery set) with similar genetic risk profiles, while 10% of the centenarians had rare profiles that occur in groups of 7 centenarians or less. Figure 8 shows, for example, the 9 largest clusters while all clusters are shown in Figure S11. The prototypical genetic risk profiles associated with each cluster are informative displays of the longevity associated variants, and represent different genetic signatures of exceptional longevity. While the ensemble of genetic risk models provides a global estimate of the probability of exceptional longevity, the pattern itself provides information about the different sets of longevity associated variants that drive a subject toward this probability. The same cluster analysis of predicted profiles in centenarians of the merged replication sets 1 and 2 identified 15 clusters with 8 or more subjects, while approximately 35% profiles clustered in groups of 7 or less. The two most predictive and the one least predictive clusters from the replication set are also shown in Figure 8 . Figure S12 depicts all 15 clusters with 8 or more subjects in the merged replication sets.

In each plot, the x-axis reports the number of SNPs in each genetic risk model (1,…,281), and the y-axis reports the posterior probability of exceptional longevity predicted by each model. The boxplots (one for each SNP set on the x axis) display the genetic risk profiles of the centenarians grouped in the same cluster. Numbers N in parentheses are the cluster sizes, and the average posterior probability of exceptional longevity. Color coding represents the strength of the genetic risk to predict EL (Blue: P(EL|∑₂₈₁)>0.95; Red: 0.5<P(EL|∑₂₈₁)<0.95; Orange: 0.20<P(EL|∑₂₈₁)<0.5; Green: P(EL|∑₂₈₁)<0.2). The full set of 26 clusters is in **Figure S11** and includes more than 90% of centenarians in the discovery set.

To examine the specificity of the profiles in characterizing exceptional longevity, we also generated genetic risk profiles of the control subjects in the discovery set and used cluster analysis to group them. Only 5 subjects had profiles that predicted exceptional longevity with more than 90% posterior probability (Figure S13). Other clusters with more than 8 subjects show that the majority of these profiles match either the lack of a predictive genetic signature as in cluster C26 or the sporadic presence of longevity associated variants of clusters C24–C25 in Figure S11. To further extend this analysis, we clustered the genetic profiles of all 4118 controls that include all controls in the discovery and replication sets 1 and 2. Cluster analysis identified several signatures, of which only 17% predict exceptional longevity with more than 70% posterior probability, and 67% predict average longevity (Figure S14). The most predictive genetic signatures that characterize exceptional longevity are rare amongst control subjects, and only 0.6% of the genetic signatures of control subjects have a posterior probability of exceptional longevity >0.95.

Interestingly, the patterns of genetic risk profiles that cluster into genetic signatures distinctly differ from clusters of genetic risk profiles generated from SNPs selected at random (Figure S15). We also investigated if some clusters were enriched for specific ethnicities, but no clusters showed enrichment for any specific European ethnicity.

We next investigated whether different genetic signatures correlate with different life spans ( Figure 9 ). Some genetic signatures were indeed associated with significantly different life spans. For example, the most predictive signature (C1) was comprised of centenarians with significantly longer survival compared to centenarians with signatures C2 (the second most predictive) or cluster C26 (the least predictive), and the median survival in centenarians with signature C1 was 105 years compared to 104 years in centenarians with signature C2 or 103 years in centenarians with signature C26. We observed a similar result when we compared the survival of centenarians with the most predictive signatures in the merged replication sets (R1 and R2), and when we compared the survival of centenarians with the most and the least predictive signatures (R1 and R15) (See Figure 9 ). However, not all signatures correlated with different survival, for example centenarians with signatures C1 and C3 did not demonstrate different survival (See Figure S16). Preliminary analyses provided in the supplementary material (in need of replication) suggest that the different genetic signatures of exceptional longevity associate with varying prevalences and ages of onset of various age-related diseases (Figure S17, Table S2).

**Panel A:** Some genetic signatures are associated with significantly different life-span. For example the most predictive signature (C1) comprises centenarians with significant longer survival compared to centenarians with signatures C2 or C26. (p-value 0.01 and 0.02) More examples are in **Figure S15**. **Panel B:** The two most predictive genetic signatures and the least predictive signature in the centenarians of the merged replications sets show consistent results. The comparison between survival of centenarians with the most predictive signature R1 and the least predictive signature R15 reaches statistical significance, (p-value = 0.003) while the comparison between survival distributions of centenarians with signatures R1 and R2 does not reach statistical significance (p-value 0.10).

For 17 of the 28 centenarians in cluster C26 who lack almost all the longevity associated variants discovered in this study, we had information about familial longevity. Twenty-five percent (n = 5) had >50% of siblings who survived past the age of 90 and some had evidence for longevity as shown in some pedigrees in Figure S18. This could indicate that such families have more private or rare variants not captured by either the genotyping or the model.

Discussion

Though living to very old age runs strongly in families, it is also a very complex phenomenon with many different patterns of survival that include disease-free survival but also survival with various age-related diseases. Given this complexity, it is extremely unlikely that a single or few genes confer this survival advantage, but rather it is likely that many genes are involved. To capture this genetic complexity we developed an approach that uses genetic risk modeling for in-silico genetics. Our approach includes 3 steps: 1) a single SNP analysis to identify and rank SNPs that are significantly associated with exceptional longevity, 2) genetic risk modeling based on nested Bayesian classifiers that produce genetic risk profiles and 3) cluster analysis of the profiles to discover genetic signatures and correlate these to different survival patterns or subphenotypes of exceptional longevity.

Limitations

Although we elected to work with naïve Bayesian classifiers, many alternative approaches to genetic risk modeling exist and our method could be extended and/or improved to include for example different parametric models, or different types of cluster analyses to discover genetic signatures. We conducted extensive simulation studies to compare our approach to logistic regression that use the genetic data, or a summary of the genetic data in a genetic risk score. Our analyses show that when all SNPs have an additive effect, using a Bayesian classifier or a logistic regression model with a weighted genetic risk score perform equivalently. However, when the genetic effects include different models of inheritance, such as a combination of dominant/recessive/general associations, then a Bayesian classifier is more robust than logistic regression with a weighed genetic risk score. In either case, the approach we chose guarantees robustness as indicated in simulation studies (Clustering by genetics ancestry using genome-wide single nucleotide polymorphisms and incorporating genetic ancestry into genetic prediction models, Doctoral dissertation by Nadia Solovieff, May 2011, available upon request). Furthermore, many other “machine-learning type” approaches exist that can be used to generate genetic risk models, and years of comparative evaluations in the machine learning community have shown that there is no clear winner, but different problems require different solutions [61]. In our search for genetic predictors of exceptional longevity, Bayesian classifiers appear to perform reasonably well and can be extended to more general directed graphical models to include interactions between SNPs and between genes and environmental factors [62]. Our approach for selecting predictive features appears to work well in this application. However other search procedures for feature selection need to be explored and may produce even better predictive accuracy.

There are aspects of our method that are based on heuristics. For example, our choice of the number of SNPs to be used in the genetic risk modeling is based on a heuristic rule. The choice of the optimal number of features to be used in a classifier is a well-known problem, with no simple solution [44] and to limit the effect of a sub-optimal selection we used an ensemble of classifiers to gain robustness. This approach is known to produce better classifiers than one single model [63]. Our heuristic search orders SNPs by maximum Bayes factor. Our secondary analyses show that random reordering of the 281 SNPs decreases the specificity slightly and selecting SNPs at random from the most significant 1700 SNPs gives models that are less predictive in independent sets (Figure S4 and S8). If other investigators apply this approach to other domains, they may want to conduct similar secondary analyses to evaluate whether the same heuristics lead to better models.

A major challenge we faced with our genome-wide association study was the choice of appropriate controls. Because of the limited number of controls in the NECS, we had to resort to healthy controls from other genome-wide association studies (the Illumina control data set and the NECS controls where genotype data were generated in different labs with different SNP arrays) as other investigators have done [64]. Our stringent quality control approach and the genetic matching minimized the number of false positive associations, likely at the expense of missing some true positive associations. We decided to use genetic matching to reduce the effect of population stratification because our initial genome-wide association study that included all control subjects from the Illumina repository had a genomic control factor >1.3 suggesting substantial population stratification between cases and controls. Simulation studies that we published in [65] showed that matching is a good way to remove the effect of stratification without losing too much power. In addition, a traditional model that includes principal components from genome-wide principal component analysis would not be useful for prediction because the values of the principal components for new subjects would be missing. Our analysis does not show any systematic difference between results in the controls genotyped in our lab compared to healthy controls genotyped elsewhere ( Figures 5 and 8 ). Also, additional analyses using traditional principal-components approaches to control for population stratification suggest that no residual stratification is likely to confound the associations (Figures S9 and S10). However, only replication of these results in independent data from comparably old subjects by independent investigators will definitively validate the results and this approach.

In our study we included only Caucasian subjects and the extent to which this analysis applies to other racial groups is an open question.

Novel insights about the genetics of exceptional longevity

The large number of SNPs in our genetic risk model and the variety of genetic signatures confirm that exceptional longevity is influenced by the combined effects of a large number of SNPs. The genetic risk model implicates 130 genes, most of them known to play a role in various disease mechanisms ( Figure 5 ), and our findings suggest that different variants of these genes may have a protective role. The most intriguing examples are LMNA and WRN: while specific variants of these two genes determine progeria and accelerated aging, alternative variants may increase life span. About 50% of the SNPs in the genetic risk model are in intragenic regions and this also suggests that regulatory mechanisms play an important role in exceptional longevity. We also found that the sensitivity of the prediction in independent sets increases with the ages of centenarians, and therefore likely, the genetic contribution to lifespan increases with increasing ages of the centenarians.

Our analysis provides further insight about the role of APOE in survival to extreme ages. Although the SNP rs2075650 in TOMM40/APOE is the most significantly association with exceptional longevity, the value of this SNP to identify who can live to 100 and older appears to be limited. The traces of sensitivity and specificity of the nested genetic models in Figure 4B show that the model with only this SNP has 85% sensitivity to predict exceptional longevity but only 26% specificity in the discovery set. We conducted an ROC analysis to show the poor predictive value of this SNP alone (Figure S19, AUC = 0.62). Also, sensitivity and specificity of the model with only this SNP are 85%/26% in the ELIXIR set, and 82%/23% in the second replication set. The traces of sensitivity/specificity of the models with increasing number of SNPs show that, the predictive accuracy increases only when a substantial number of variants are added to the model that includes rs2075650 ( Figure 4B ). We also examined the changes in sensitivity/specificity when we removed this SNP from the list of 281, and dropping rs2075650 resulted in a loss of approximately 1% accuracy (88% sensitivity/specificity in the discovery set (AUC = 0.95); 55% sensitivity and 58% specificity in the ELIX set (AUC = 0.56); and 75% sensitivity and 60% specificity in the additional 60 centenarians and 2863 Illumina controls (AUC = 0.73)) These results are summarized in Figure S7. This SNP is only in weak linkage disequilibrium with the two SNPs that define the 3 alleles of APOE but its association with longevity was shown to be dependent on the APOE alleles in [66]. The reason for the low predictive value of rs2075650 alone is that the GG genotype of this SNP is rare in the population (genotype frequency 3%) but virtually absent in centenarians (genotype frequency 0.1%), therefore if someone is a carrier of the GG allele it is unlikely that he will become a centenarian, while predicting the outcome in carriers of the AA or AG genotypes is more difficult without additional genetic data.

The NECS previously showed that centenarians fall into different groups in terms of age of onset of age-related diseases: survivors (onset of aging disease ≤80 years), delayers (onset of aging disease between 80 and 100 years) and escapers (age of onset ≥100 years) [67]. This current analysis now shows that some of the centenarians carry genetic signatures that correlate with different ages of survival and suggests that the complexity of aging and the different patterns of survival to the age of 100 and older may be the result of different genetic profiles. Unlike the typical approach of finding individuals with a specific phenotype in common and then performing a genetic association study to discover genetic associations with the trait, our approach tries to dissect a complex phenotype into sub-phenotypes based on the genetic data. Our analysis is preliminary, based on small a sample, and needs to be replicated but we hope that this new approach may prove useful in dissecting other complex genetic traits [68].

While large numbers of longevity associated variants appear to be necessary for extreme survival, we did not observe a substantial difference in the numbers of a large sample of known disease-associated variants carried by centenarians and controls ( Figure 10 , Table S3). The Leiden Longevity and Leiden 85+ Studies recently produced similar findings for alleles associated with specific age-related diseases amongst 85+ year olds and nonagenarians [69]. Furthermore, only 13 SNPs previously associated with common diseases in genome wide association studies reach statistical significance in the discovery set, and the risk alleles are significantly less frequent in centenarians than in controls (Table S4) [70], [71], [72], [73], [74], [75].

Risk alleles were derived from the GWAS catalogue at the NHGRI (downloaded in April 2011) and the Human Genome Mutation Database. The boxplots displays the rate of risk alleles carried by centenarians (red) and controls (blue). The disease described are: lupus, cholesterol level (Chol), macular degeneration (MD), Parkinson's Disease (PD), Chron's disease (chr), diabetes (diab), cardiovascular disease (CVD), cance (canc)r, Alzheimer's (AD), GWAS.pt is the group of alleles related to personality disorders that were found in GWAS, gwas.qt is the group of alleles related to QTL from GWASs and include cholesterol, BMI, obesity etc, and GWAS.cc is the group of risk alleles found from case/control GWASs so include for example cancer, PD, MD etc, cod is for coding variants from the HGMD, and all is the full set of 1214 variants. Table S3 reports the actual rates.

These preliminary data suggest that exceptional longevity may be the result of an enrichment of longevity associated variants that counter the effect of disease-risk alleles and contribute to the compression of morbidity and/or disability towards the end of very long lives [43].

In our analysis we also found that specific signatures correlated with the prevalence and age of onset of some age-related diseases and further investigation is needed to understand how and why they predispose for exceptional longevity and for specific, different patterns of aging. The genetic signatures were built by using an ensemble of genetic risk models. The high sensitivity of these predictions in independent samples of centenarians shows that genetic data can indeed predict exceptional longevity without knowledge of any other risk factors. The high sensitivity is consistent with (1) theoretical results that show potentially high predictability of rare and highly heritable traits even when only 50% of the genetic variants that determine the trait are found [36] and (2) the accuracy of genetic risk models that have been developed to predict complex and highly heritable traits [37], [38], [39], [40], [41]. To quantify the amount of genetic variance in liability to exceptional longevity that is explained by our model, we used the online calculator http://gump.qimr.edu.au/genroc/ to translate the predictive accuracy measured by the AUC in proportion of explained genetic variance on the liability scale [36]. Based on previous reports and the latest US 2010 Census (http://www.census.gov/prod/cen2010/briefs/c2010br-03.pdf), we estimated that the prevalence of exceptional longevity (living to 100+) is 1 in every 5,000 people, while the sibling relative risk for exceptional longevity ranges between 8 and 17 [9], [10]. With these numbers, we estimated that the maximum AUC of a genetic model of exceptional longevity ranges between 0.95 to 0.98 and our genetic model that reaches AUC = 0.74 in the second replication set (Figure S7) explains between 12% to 17% of the genetic variance on the liability scale. In the ELIXIR replication set, the AUC of our genetic risk model is 0.58 and this would represent 1–2% of explained genetic variance. Since the ELIXIR set includes more nonagenarians than centenarians, and their prevalence in the population is 0.5% and the sibling relative risk of this trait is approximately 2.5, we repeated the calculations in this scenario and the 0.58 AUC translated into approximately 4% of the genetic variance in the liability scale. These results show that although we explained a good amount of genetic variability on the liability scale to live to very old ages, there is still more than 80% missing heritability that remained to be explained, and more comprehensive genetic studies have the real potential to decipher the genetic base of this complex phenotype.

Some centenarians in our study however lack a genetic signature conducive to exceptional longevity. The strong clustering of exceptional longevity in some of their families suggests that these individuals harbor rare or private alleles associated with exceptional longevity. This in turn would suggest that sequencing these individuals could be particularly fruitful.

The specificity of our classification rule is 60–61% in the independent sets and is comparable to other genetic studies of complex traits [76], [77], [78]. Although the specificity is better than random, it would not be useful as a diagnostic test. The decreased specificity in this study could be explained by the fact that the control subjects from the Illumina database are primarily made up of healthy controls used for other genome-wide association studies and therefore the control data set may be enriched for healthy aging subjects.

Our finding that about 17% of Illumina controls have signatures with >70% chance of exceptional longevity (Figure S14) suggests that a substantial proportion of this group have a genetic predisposition to exceptional longevity. If this observation is replicated in more representative samples of the population, it could in part explain why centenarians are the fastest growing age group in developed countries [79], [80]. At the turn of the last century, infant mortality was approximately 25%. As public health measures markedly reduced infant mortality rates in the first quarter of the 20^th century, a greater and greater proportion of the population had the opportunity to age into middle and older ages. If nearly one fifth of the population had an increased genetic predisposition to survive to 100 years, it is understandable why the number of centenarians is growing at such a relatively high rate.

Although sensitivity and specificity of our classification rule may improve with a more comprehensive knowledge of human genomic variation, its limitations could also suggest that environmental factors (e.g., lifestyle) contribute in important ways to the ability of people to survive to very old ages. Replications of these results in independent cohorts will help to answer these questions.

Materials and Methods

Ethics statement

NECS and Elixir subjects were enrolled under similar protocols approved by Boston Medical Center's Institutional Review Board and the Western Institutional Review Board, respectively. Written informed consent was obtained for all NECS and ELIXIR subjects.

Study populations

The New England Centenarian Study (NECS) began in 1994 as a population-based study of all centenarians living within 8 towns in the Boston area [81]. Since ∼2000, the NECS expanded enrollment to include centenarians from throughout the USA (www.bumc.bu.edu/centenarian). Potential subjects are ascertained via voting records and media alerts. Subjects are sent a demographic data, life style choices, medical history and functional status questionnaire, family pedigree form and blood kit. A dementia scale test is administered over the telephone. The study is still actively recruiting centenarians, with an average of 50 subjects enrolled per year.

Elixir Pharmaceuticals American Centenarians

In 2001–2003, Elixir Pharmaceuticals (co-founded by Leonard Guarante and Cynthia Kenyon) conducted a U.S. nation-wide centenarian recruitment effort. Since 2006, Elixir's centenarian research effort has ceased (and DNA and data are stored and have also been shared with the NECS, where genotyping of all the samples was performed in 2008). Recruitment and data collection were modeled after the NECS protocol.

NECS controls

The NECS has recruited approximately 450 referent subjects comprised of spouses of centenarian offspring and children of parents who died at the mean age of 73 years, with an age at enrollment ranging between 53 and 90 years.

Illumina controls

We identified 3,613 Caucasian healthy controls from the Illumina control database (iControlDB, http://www.illumina.com/downloads/PurposeDocument.pdf). No phenotypic information is available for subjects selected from the Illumina repository, except for gender (∼60% females) and age at blood draw for some subjects (age range 0—75 years).

The Coriell NINDS control sample in the Parkinson's disease (PD) set is described elsewhere [21].

Subjects from these studies were combined to generate a discovery and replication set using genetic matching (see below) and an additional replication set in which subjects were not genetically matched.

Discovery set (NECS)

This consisted of 801 cases and 914 controls. Cases are long lived individuals from the NECS who were born between 1880 and 1910 and reached an age at death between 95 and 119 (mean 104±3, median 104). Controls were comprised of 673 healthy controls from the Illumina database (Illumina I), and 241 referent subjects from the NECS. Controls were selected to match the genetic background of cases.

Replication 1 (ELIX)

This is comprised of 253 long lived individuals enrolled from ELIXIR Pharmaceutical (mean age 101±3, median 100), and 341 healthy controls from the Illumina database (Illumina II). Controls were selected to match the genetic background of the 253 cases in this set.

Replication 2 (NECS 2)

60 NECS individuals and 2863 healthy controls from the Illumina database (Ilumina III). In this set, no genetic matching was performed. The 60 centenarians include 39 subjects of European ancestry enrolled between June 2009 and September 2010 (age range 100–114, mean age 108) plus 21 centenarians also of European ancestry (age range 101–115, mean age 107) that were not included in the discovery set during the genetic matching.

SNP genotyping

We analyzed 1 ug of genomic DNA for NECS and ELIXIR samples, using the Illumina 370 CNV chip, v.1, the Human610-Quad v1.0, and the Human 1 M v1.0 (Illumina, San Diego, CA). We used the Beadstudio software for genotype calling using the top-strand rule, so that SNPs alleles are coded using lexicographical order (typically A/G and A/C). The data in the Illumina repository were generated with different SNP arrays (300 and 550) and we selected the SNPs that were in common to all platforms. SNPs with reverse alleles, and monomorphic in some of the arrays were detected by comparing allele frequencies in controls (300 vs 550, 370 vs 550), and in centenarians (370 vs 1 M, 370 vs 610). Table 2 summarizes the arrays used.

Table 2. Breakdown of genotyped samples by Illumina SNP array type (columns 3—7), laboratory (column 8), and case/control/study status (rows).

		370	610	1 M	300	550	Lab
Centenarians	NECS	583	102	176	0	0	BU
	ELIXIR	209	44	0	0	0	BU
Controls	NECS	237	4	0	0	0	BU
	Illumina I	0	0	0	89	584	unknown
	Illumina II	0	0	0	62	279	unknown
	Illumina III	0	0	0	574	2289	unknown
	Coriell NINDs	867	0	0	0	0	CIDR

Open in a new tab

The columns of the table denote the Illumina array types. The column “Lab” denotes the laboratory that performed the genotyping: BU = Boston University; CIDR = Center for Inherited Disease Research. The row Illumina I denotes the control samples included in the discovery set; Illumina II denotes the control samples included in the first replication set, and Illumina III denotes the residual samples from the Illumina repository; Coriell NINDs denotes the neurologically normal controls.

Quality Control

Rules for sample inclusion

Raw GWAS data were clustered using standard Illumina cluster definitions in array-specific batches (all 370 samples together, all 1 M samples together, all 610 samples together). Specifically, we performed sample-based QC checks and produced QC statistics to compute sample call rates (CR). We eliminated all samples with CR<96.5% and remaining samples were reclustered. After re-clustering, we included the “excluded” samples using this new cluster file. If the previously excluded samples had a CR above 93% they were included in the final analysis.

We also used the genome-wide identity by descent analysis in PLINK [82], to discover unknown relatedness and to estimate error rate using the number of mismatch of replicated samples (2%). With this analysis we discovered one subject enrolled in both the NECS and ELIX studies, whom we removed from the ELIX set. We also removed samples with inconsistent gender between heterozygosity of the X chromosome and gender recorded in the database.

Rules for SNP inclusion

SNPs were included in the final clean data set if all these conditions were satisfied:

CR>98% in each array type (300, 370, 550, 610, 1 M) in both centenarians and controls of the discovery set, and overall CR>98% in all samples included in discovery and replication sets.
Cluster separation score >0.25.
Excess heterozygosity score between −0.3 and 0.3.
Hardy Weinberg equilibrium χ² statistics in controls <50.
Minor allele frequency difference between any pair of array type <0.2

A total of 243,980 SNPs were selected for the analysis.

Assessment of between arrays bias and batch effects

The 610-Quad is part of the new line of Infinium high density whole-genome genotyping products, and had undergone substantial design changes compared to the Human CNV370, Human 1 M, HumanHap550-Duo and HumanHap300. We used data from 32 samples that had been genotyped with both the Human CNV370 and 610-Quad illumina arrays and that underwent the same QC procedure, to test for systematic bias between the two arrays. 345,219 SNPs were in common between the two arrays but only 294,153 SNPs had CR>0.97 (so at least 31 genotypes were called) in both arrays after reclustering. In this set, 915 SNPs had 2 or more different genotypes, and only 28 SNPs had allele frequencies that differed by more than 0.05. The plot of allele frequencies (Figure S20) suggests that there is no systematic bias between arrays but rather sporadic errors that can be identified by plotting allele frequencies.

We tested the agreement between allele coding in the other arrays by comparing the allele frequencies. See Figure S21. The plots rule out general bias between arrays and show that SNPs with reversed alleles were removed.

The additional sample of 60 centenarians included 39 subjects that were genotyped in September 2010, using the 610-Quad array. To be able to test for batch effects, we genotyped the 39 samples in a batch of 48 that included two replicated samples, and 7 samples that had been genotyped with the Human 1 M in the original analysis. The agreement between genotype calls in the 7 samples genotyped with the 610-Quad and the Human 1 M ranged between 99.2% and 99.7%.

Genetic matching of controls

Population stratification was assumed to be a serious problem with the centenarian and control data, because a large proportion of NECS subjects were immigrants from Europe, and the patterns of immigration at the end of the 19^th century may lead to an overrepresentation of some European ethnic groups [83]. In fact, an initial GWAS analysis in which we randomly selected controls from the Illumina repository pointed to substantial stratification (genomic control factor ∼1.3). We therefore reduced possible confounding due to population stratification by selecting controls to match the genetic backgrounds of NECS subjects.

To identify the population substructure in the centenarians and controls we ran a principal components analysis with the software EIGENSOFT [84], using GWAS SNP data for SNPs common to the NECS and Illumina datasets that had a SNP call rate>0.95 and MAF>0.05. SNPs in strong LD were removed using the program PLINK with a SNP window of 50 and sliding window of 5 SNPs and we removed 1 SNP from each pair of SNPs with r²>0.30 leaving 97,508 SNPs for this analysis. We found that the top several principal components (PCs) correlated to the genetic ancestry and formed a similar pattern to other studies of subjects of European ancestry [84], [85]. However, the analysis also showed that the Illumina controls contain many more ethnic groups than the NECS (Figure S1), and the inclusion of these control subjects might therefore inflate false positive associations. We used the clustering algorithm in [65] to group individuals with similar ancestry into the same cluster. The algorithm utilizes k-means clustering to iteratively group individuals into cluster sizes varying from 2 to 30 and then computes a scoring index at each cluster size that accounts for the accuracy of the subjects' cluster assignments, the stability of k-means clustering from iteration to iteration and the ability of the algorithm to maximize the distance between subjects allocated to different clusters. This analysis identified 20 clusters corresponding to sub-populations with different genetic structure, and Figure S1 shows the details of the clusters and their ethnic labels based on the known mother tongue and ancestry of the cluster members. NECS cases were present in only 16 of the 20 clusters as shown in Table 3 that displays the frequency of NECS cases (row 2), NECS controls (row 3) and Illumina controls (row 4). For example, no centenarians were allocated to cluster 1 or 15 (empty and full red dots in Figure S1 that may represent Franks and Celtics- Alpine ethnicities). To increase the number of controls, we randomly selected additional Illumina controls from those 16 clusters to maintain the same ratio of cases/controls in each cluster. For example, we sampled 4 additional Illumina controls from cluster 2, so that the ratio case/controls in cluster 2 was 21/24 = 0.88, and similarly, we sampled 19 additional controls from cluster 9, so that the ratio case/control in cluster 9 was 31/35 = 0.88 etc.

Table 3. Distribution of NECS cases (row 2), NECS controls (row 3) and Illumina controls (row 4) in clusters of genetic ethnicity (columns).

	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16	17	18	19	20
Cent	0	21	34	79	27	189	6	0	31	102	22	20	3	94	0	15	94	34	0	25
Control	2	20	8	14	30	38	2	1	16	19	18	3	4	12	4	3	29	7	0	12
Illumina	90	310	192	47	278	168	223	104	277	288	200	120	173	132	169	54	266	154	118	250

Open in a new tab

The table shows the 20 clusters of genetic ethnicity that were discovered using a clustering algorithm described in reference [20]. Note that no centenarians were allocated to cluster 1 or 15. These clusters are represented by full red dots in Figure S1 and denote Franks and Celtics- Alpine ethnicities.