Abstract
Genome-wide studies often exclude family members, even though they are a valuable source of information. We identified parent–offspring pairs, siblings and couples in the UK Biobank and implemented a family-based DNA-derived heritability method to capture additional genetic effects and multiple sources of environmental influence on neuroticism and years of education. Compared to estimates from unrelated individuals, total heritability increased from 10 to 27% and from 17 to 56% for neuroticism and education respectively by including family-based genetic effects. We detected no family environmental influences on neuroticism. The couple similarity variance component explained 35% of the variation in years of education, probably reflecting assortative mating. Overall, our genetic and environmental estimates closely replicate previous findings from an independent sample. However, more research is required to dissect contributions to the additional heritability by rare and structural genetic effects, assortative mating, and residual environmental confounding. The latter is especially relevant for years of education, a highly socially contingent variable, for which our heritability estimate is at the upper end of twin estimates in the literature. Family-based genetic effects could be harnessed to improve polygenic prediction.
Electronic supplementary material
The online version of this article (10.1007/s10519-019-09984-5) contains supplementary material, which is available to authorized users.
Keywords: Genomics, Family data, Heritability, Neuroticism, Education
Introduction
Heritability measures the proportion of individual differences in a trait explained by inherited genetic variation, and is traditionally estimated by comparing the resemblance of identical and non-identical twins (Knopik et al. 2017). Researchers can also now estimate single nucleotide polymorphism (SNP) heritability, the variance explained by the additive effects of common genetic variants tagged by a genotyping array (Yang et al. 2010, 2011). SNP heritability is expected to be less than twin and family-based heritability, since the former only estimates the additive effects of measured common variants, plus variants that are correlated (i.e. in linkage disequilibrium) with them, and ignores influences of DNA sequence differences that are rare and/or not well tagged by genotyping arrays. Since genome-wide association studies (GWAS) also only consider the additive effects of common variants, SNP heritability provides an estimate of the total genetic effect that could be identified with well-powered association studies of a given phenotype in a given population. Given the importance of SNP heritability, researchers have investigated approaches to maximise the accuracy of estimates, beyond increasing sample sizes or denser genotyping (van den Berg et al. 2014; Laurin et al. 2015; Cheesman et al. 2018; van der Sluis et al. 2010).
The dominant method for the estimation of SNP heritability, Genomic-RElatedness-based restricted Maximum-Likelihood (GREML), takes advantage of small genetic differences among many unrelated individuals to predict trait similarity (Yang et al. 2011). The effects of genetic variants that are not in linkage disequilibrium with common genotyped SNPs will not be captured using this method. However, when GREML is applied to family data, the higher genetic relatedness among relatives increases the correlation between genotyped SNPs and causal variants, because they are more likely to be inherited together (Zaitlen et al. 2013; Xia et al. 2016). This increase in linked variants helps to capture additional genetic variation not normally picked up in population studies of unrelated people, such as rare single nucleotide variants, copy number variants, and structural variants.
An extension of the method that uses family data, GREML-KIN, allows the estimation of two categories of genetic influence: population-level common variant heritability, plus additional heritability that is associated with kinship (Xia et al. 2016; Zaitlen et al. 2013). The first estimate is similar to that derived from GREML using unrelated individuals. The latter heritability estimate captures additional family-based effects, due to the increased correlation between genotyped SNPs and causal variants among relatives. The GREML-KIN method also allows for effects of environment sharing amongst family members, siblings and couples. This inclusion is important as it attempts to remove confounding that results from people who are more genetically related having more similar environments and higher phenotypic resemblance than people who are less related.
One study applied GREML-KIN to neuroticism and years of education (Hill et al. 2018) in Generation Scotland, a large family-based study (Smith et al. 2006). Neuroticism is a personality trait characterised by readily experiencing negative emotions. It is a strong predictor of common mental disorders, occupational attainment and mortality (Ormel et al. 2013). Years of education is also a complex trait that shows significant associations with diverse socioeconomic and health outcomes (Mackenbach et al. 2008). Twin studies have repeatedly demonstrated substantial broad sense heritability for neuroticism (40–60%) (Hettema et al. 2006) and educational attainment (40–50%) (Branigan et al. 2013). Extended twin family designs estimate the narrow sense heritability of neuroticism to be ~ 25% (Coventry and Keller 2005). These designs provide a better benchmark for additive genetic influence than the classical twin design, since they incorporate relatives of twin pairs to allow explicit separation additive genetic influences from non-additivity, shared environment, assortative mating, and gene-environment correlation. We are not aware of applications of the design to educational attainment, but one study estimated the narrow sense heritability of adult intelligence, a comparable phenotype, to be 44% (Vinkhuyzen et al. 2012). Neuroticism and educational attainment both index a range of important traits, and are available in numerous large datasets. As a result, they have been subject to some of the largest GWA studies of psychological traits (N = 449,484 for neuroticism (Nagel et al. 2018; Luciano et al. 2018); N = 1.1 million for years of education (Lee et al. 2018)). Nonetheless, only ~ 10% of the variance in neuroticism can be explained by the additive effects of common SNPs (Luciano et al. 2018; Nagel et al. 2018). Estimates of SNP heritability for years of education also fall substantially short of twin and pedigree estimates (14.7%; Lee et al. 2018).
GREML-KIN analyses in Generation Scotland revealed a large increase in heritability compared to a standard GREML analysis of unrelated individuals (Hill et al. 2018). For neuroticism, the total heritability from the best-fitting model was 30%, primarily accounted for by kin-based genetic effects (19%), as well as common variant effects tagged in studies of unrelated individuals (11%—akin to SNP heritability). They detected no family environment effects. For years of education, there was a strong kin genetic component (28%) in addition to common genetic influence (16%), plus substantial variance explained by sibling and couple similarity (11% and 31%, respectively). The findings align well with evidence from twin studies that the family environment influences education-related outcomes (~ 36% of the variance; Branigan et al. 2013) but not neuroticism (Polderman et al. 2015). If these results are replicated, then the total DNA-based heritability of neuroticism and education would be close to twin and pedigree study estimates. Moreover, a replication of these results would suggest that most of the variance in education (86%) can be captured with measured parameters. Notably, for each, the larger component of the genetic contributions results from less common variants not identified in genomic studies of unrelated individuals. The authors also found that rarer variants (0.1–1% in frequency) explained 12% of the variance in education, but did not influence variation in neuroticism (Hill et al. 2018).
Our study aimed to estimate familial influences on neuroticism and years of education in the UK Biobank, using GREML-KIN. We capitalise on the presence of thousands of family members in the UK Biobank to shed light on the genetic and environmental architecture of these two phenotypes. To robustly replicate the previous Generation Scotland study, we ensured that phenotype definitions were as similar as possible, and that there was no sample overlap. Based on previous research, we hypothesised that neuroticism and years of education would show increased heritability by exploiting the higher linkage disequilibrium within families. Our secondary analyses aimed to validate our kin-based estimates and specifically quantify the contribution of rarer genetic variants to the heritability of neuroticism and years of education. For this, we used the LDMS-I method, and stratified imputed genetic variants by their individual level of linkage disequilibrium and allele frequency to allow estimation of the variance explained by rarer variants (Evans et al. 2018).
Methods
Sample
Analyses were conducted using the UK Biobank, a resource containing rich phenotype and genotype data on ~ 500,000 individuals aged between 40 and 70 (Allen et al. 2014). To minimise confounding from population stratification, analyses were limited to white British individuals. We identified families and restricted heritability analyses to individuals with at least one family member in the UK Biobank, as well as phenotype data on neuroticism and/or years of education. Previous publications suggested a sample size of ~ 40,000 pairs of family members (parent-offspring, siblings, and couples) within the full dataset (Bycroft et al. 2018).
Genotyping
Genome-wide genetic data from the full release of the UK Biobank data were collected and processed according to the quality control pipeline (Bycroft et al. 2018). For primary GREML-KIN analyses, we used genotyped or imputed SNPs with minor allele frequency > 0.01 and imputation confidence (INFO) score > 0.4, indicating well imputed variants. Due to computing memory constraints, we used PLINK2 to prune down to 241,678 variants in approximate linkage equilibrium using an r2 threshold of 0.2 (Chang et al. 2015) before calculating genetic relatedness matrices.
Measures
Neuroticism was measured as a continuous trait, captured with 12 questionnaire items such as “Does your mood often go up and down?”, “Would you call yourself tense or ‘highly strung’?”. This trait was defined previously in the UK Biobank (Smith et al. 2013, 2016).
The years of education variable was defined according to ISCED categories, as in previous genomic studies in the UK Biobank and other samples (Hill et al. 2018; Lee et al. 2018). The six response categories were: none of the above (no qualifications) = 7 years of education; CSEs or equivalent = 10 years; O levels/GCSEs or equivalent = 10 years; A levels/AS levels or equivalent = 13 years; other professional qualification = 15 years; NVQ or HNC or equivalent = 19 years; college or university degree = 20 years of education. To test whether the number of response categories affected heritability estimates, as has been shown previously in the UK Biobank (Lee et al. 2018), we ran sensitivity analyses using a ‘coarsened’ years of education variable, plus a binary variable reflecting college completion (see Supplementary Fig. 1).
In all analyses the following covariates were included: age, sex, the first 40 ancestry principal components from the UK Biobank (Bycroft et al. 2018), genotyping batch, and assessment centre.
Analyses
Identification of family members
Sibling and parent–offspring pairs were identified using relatedness files (KING n.d.) received with the UK Biobank data. Relatedness between two individuals is summarised by a kinship coefficient, which is defined as the probability that a random allele from an individual is identical by descent (IBD) with an allele at the same locus from the other individual (i.e. identical and inherited from a common ancestor). For example, in parent–offspring duos, kinship is ~ 0.25, as it is the probability that a random allele in a child is from one specific parent (0.5 since humans are diploid) multiplied by the probability that the parental allele from that parent is passed to the child (0.5; independent to the first probability). To allow for normal variation in within-pair similarity, first-degree relatives are therefore defined as pairs that have a pairwise kinship coefficient of ≥ 0.177 and ≤ 0.354.
To distinguish parent–offspring pairs from sibling pairs, we plotted the proportion of SNPs with zero identity-by-state (IBS0) within the kinship bounds of 0.177–0.354 (Fig. 1). IBS describes the probability that alleles are the same regardless of common ancestry. When comparing two individuals, variants are termed IBS0 if neither allele is shared by the pair. Parent–offspring pairs have IBS0 = ~0 since they share one allele inherited by descent (IBD) in all positions on autosomes. In other words, an individual is unlikely to share zero variants with one of their parents, unless for example both copies come from the other parent (uniparental disomy), or unless there are genotyping errors meaning that shared variation is not called. In contrast, siblings have a higher pairwise IBS0.
Couples were identified as pairs of unrelated opposite-sex individuals matching exactly on a string of household variables: social deprivation (Townsend Deprivation Index), assessment centre, income, time at address, smoker in household, type of accommodation, relatives in household, number in household. This approach of matching on household variables was used in a recent study of assortative mating in the UK Biobank (Yengo et al. 2018). We note that there is potential for type 1 error: it is possible, especially in densely populated areas, that people could match on all eight variables by chance.
Kin-based SNP heritability method accounting for environmental similarity: GREML-KIN
GREML requires the calculation of genetic similarity for each pair of individuals across genotyped variants. This matrix of genomic similarity is compared to a matrix of pairwise phenotypic similarity using a random-effects mixed linear model, such that the variance of a trait can be decomposed into genetic and residual components, using maximum likelihood. Ordinarily, GREML is applied in samples of unrelated individuals and has a single common genetic variance component.
GREML-KIN is an extension of GREML that estimates the variance explained by multiple genetic and non-genetic sources. The method uses a linear mixed model to fit five matrices: G = common genotyped SNP effects; K = kin genetic effects; F = nuclear family (siblings, parent-offspring, and couples); S = siblings; C = couples. For the G matrix, we calculated genetic similarity for all possible pairs of individuals across all genotyped SNPs. As GREML-KIN allows for effects of the family environment, no relatedness cut-off was applied to the G matrix (unlike the standard GREML model applied only to unrelated individuals, where a cut-off of < 0.025 is typically used). The K matrix is a modified G matrix, containing only information on relatives (cut-off > 0.025), since values for unrelated pairs are set to 0. Family, sibling and couple (F, S and C) similarity matrices were created in the format required for GCTA. Elements in the genomic relatedness matrix were replaced by 0 if a pair did not have the specific relationship; and 1 if a pair do have the relationship, or for elements representing individuals’ relatedness to themselves.
Importantly, the variance components are not purely ‘genetic’ and ‘environmental’. The sibling and couple environment sharing matrices likely pick up variance due to other processes that inflate covariance between relatives, including dominance and assortative mating, respectively.
Assortative mating refers to greater similarity between partners than is expected by chance. This can result from multiple mechanisms, including direct choice based on phenotype, social homogamy, and convergence over time due to shared environments. Assortative mating amongst couples in the UK Biobank sample will be captured by fitting the couple similarity matrix (C). However, to the extent that phenotypic similarity among the parents of the UK Biobank participants reflects their genetic similarity, it is also likely that assortative mating in their parents will contribute to the additive genetic variance in our estimates (G + K). This is because assortative mating induces a positive correlation between trait-increasing alleles (‘gametic phase disequilbrium’), which elevates trait-specific genetic and phenotypic variance (Peyrot et al. 2016).
The genetic variance components are also likely to include some bias from the indirect effects of genetic variants shared with relatives. Genetic variants in the parents do not only have direct effects on offspring traits by being transmitted, but they also have indirect influences on offspring traits through the environment that they provide for their children. This can bias SNP-based heritability estimates (Young et al. 2018).
The residual component includes sources of variance that are not captured by the G, K, F, S or C matrices, particularly other environmental influences (idiosyncratic, individual-specific environments or perceptions that are not shared by family members) and error.
To identify the best-fitting model for each trait, we ran a model for every possible combination of variance components (31 models), and compared them with backwards stepwise likelihood ratio testing, starting with the full model and dropping non-significant parameters.
We compared GREML-KIN results against those from a standard GREML model in a subset of unrelated individuals from the family-based analyses. The standard GREML model uses a single genomic relatedness matrix with a cut-off to exclude one from each pair of related individuals (cut-off > 0.025). This approach therefore only detects population-level additive genetic effects tagged by common genotyped SNPs, plus potential confounding, for example from gene-environment correlation and population stratification. The residual component contains other sources of variance: gene–environment interaction, error, plus all of the environmental influence, rare variant effects that are not captured when using an unrelated population sample, and non-additive genetic effects.
GREML-LDMS-I to investigate the effects of less common variants
In our GREML-LDMS-I analyses, we started with whole genome data imputed from the HRC panel (93,095,623 autosomal variants; see Bycroft et al. 2018) We ran quality control to include variants across the allele frequency spectrum that were imputed with high confidence (INFO score > 0.80), and removed multiallelic variants. Three allele frequency bins were made, containing variants with minor allele frequency ranges of: 0.001–0.01, 0.01–0.1, 0.1–0.5, respectively. SNPs in each bin were split into high versus low linkage disequilibrium categories. We stratified by individual (rather than regional) SNP LD scores, since this has been shown to yield SNP heritability estimates that are more robust across different genetic architectures than estimates from other approaches (Evans et al. 2018). This led to six genome-wide genetic relatedness matrices, one for each allele frequency and LD bin (non-overlapping). All matrices included the same number of unrelated individuals (cut-off 0.025) with phenotype data and with at least one family connection as in standard GREML. The matrices were simultaneously fitted using a linear mixed model, and estimates were allowed to be negative. In supplementary analyses we explored whether the variance explained by the rarest alleles could be underestimated because their imputation quality was lower. Hence, we checked how many SNPs in each minor allele frequency bin were dropped when applying the INFO > 0.8 cutoff.
Sample independence
To ensure that our results were independent from the previous Generation Scotland study, we compared checksums for both samples to identify and remove overlapping participants. A checksum is the sum of nine numbers taken from binary genotype files. Checksums were obtained from Generation Scotland without accessing genotype data directly. We then ran checksums in the UK Biobank (after ensuring quality control of genomic data was the same), using a script from the Broad Institute, which is available online: https://personal.broadinstitute.org/sripke/share_links/checksums_download/outdated_readme/id_geno_checksum.v2.
We note that the number of relationships per individual in the UK Biobank is lower than in Generation Scotland. Their sample was selected to capture dense kinships, where many individuals have siblings, parents and spouses who are also study participants. This may result in lower power in the UK Biobank to detect influences of family similarity, especially if small in magnitude, and reduces power to separate confounding factors, as in biometric designs (McAdams et al. 2018).
Software
We used the following software in our analyses: identification of family members was performed using R; construction of genomic relationship matrices was done in GCTA; family, sibling and couple similarity matrices were made in bash; GREML analyses were conducted in GCTA. Scripts are available from the lead author on request. The UK Biobank is a public dataset available to all bona fide researchers (with funds to pay the access fee).
Results
Identification of family members
Columns 2–4 of Table 1 (bold) show how many pairs of the three types of family members we identified with available data on neuroticism and years of education. The numbers of family relationships closely matched findings from previous publications (Yengo et al. 2018; Bycroft et al. 2018). Column 5 contains the sample sizes of family pairs (the total of couple, sibling and parent–offspring pairs). From the number of family pairs, we derived the number of families (column 6), or in other words, the number of discrete sets of individuals who have at least one connection. The number of unique individuals (column 7) represents the total number of participants with at least one family member, after removing double-counted individuals who have multiple connections.
Table 1.
Phenotype | Couple (pairs) | Sib (pairs) | Parent-offspring (pairs) | Nuclear (pairs) | Families | Unique individuals |
---|---|---|---|---|---|---|
Neuroticism | 16,451 | 14,562 | 4004 | 35,017 | 31,369 | 65,361 |
Education | 23,201 | 21,564 | 5912 | 50,677 | 44,316 | 93,737 |
The discrepancies between the numbers of individuals, families and nuclear pairs reflects that most people only have one family member in the study. This contrasts to samples with dense kinship networks such as Generation Scotland, where many individuals have siblings, parents and spouses who are also study participants. As described in the Methods, we distinguished parent–offspring and sibling pairs according to their IBS0 (Fig. 1). We chose a threshold of IBS0 > 0.001 to define siblings (blue) separately from parent–offspring pairs (yellow).
Phenotypic correlations for neuroticism were 0.03 for couples, 0.14 for siblings, and 0.13 for parent–offspring pairs. Phenotypic correlations for years of education were 0.38 for couples, 0.30 for siblings, and 0.26 for parent–offspring pairs.
Kin-based SNP heritability method accounting for environmental similarity: GREML-KIN
Figure 2 shows results for the full GREML-KIN models for neuroticism and education years. For neuroticism, the full model indicated that the variance explained by common genetic and kin-based variants is 11% (se = 0.01) and 20% (se = 0.09), respectively (31% in total, first two bars). Bars 3–5 demonstrate that there is no significant influence of family, sibling or couple similarity. For neuroticism, the selected model gives similar results and contains common SNP and kin-based genetic influences (11% (se = 0.01) and 16% (0.02) respectively). Compared to this best-fitting model, the inclusion of matrices to control for the influence of family environments (in the full model) does not reduce heritability. The total additive heritability of neuroticism when including relatives (27%, selected model) is substantially higher than in our analysis of unrelated individuals: 10% (se = 0.01; N = 44,694 unrelated individuals; Supplementary Table 1).
Figure 2 also shows the results for years of education from our full model. Due to computational memory limitations, it was necessary to run models for education in two parts (explained below). We report meta-analysed (inverse variance weighted means) results here. The results for each part are in Supplementary Tables 2 and 3. The heritability of years of education in our selected model was 56%, made up of 12% common SNPs at the population level (se = 0.01) and 44% kin-based effects (se = 0.02). This increases heritability considerably from 17% (se = 0.01) in unrelated individuals. The final model for education also contains a couple similarity effect of 35% (se = 0.01) in addition to the common and kin-based genetic influences.
See Supplementary Tables 1–3 for full model-fitting results, including all sub-models and fit statistics. As noted above, we ran two sets of GREML-KIN models in independent samples to reduce the computational burden. The first used the same matrices as in analyses of neuroticism. The second used new matrices including individuals who have education data and at least one family member, and who were not included in the neuroticism matrices. In defining these groups, we ensured that individuals in the same family were kept together.
Supplementary Fig. 1 gives GREML-KIN results for alternative education phenotypes (years of education with fewer categories, and degree/college completion). Estimates differed only slightly, and the conclusions remain the same.
GREML-LDMS-I to investigate the effects of less common variants
Table 2 shows our estimates of the contribution of variants of different allele frequencies and individual linkage disequilibrium levels to neuroticism and years of education. For neuroticism, there is no evidence of a contribution of SNPs in the lowest minor allele frequency (MAF) bin (0.001–0.01), but SNPs of MAF 0.01-0.1 explained 3% of the phenotypic variance. All variants explained 11% (se = 0.02) of phenotypic variation. For education, variants tagged by low frequency SNPs (MAF between 0.001–0.01, and 0.01–0.1), particularly those in lower linkage disequilibrium, make a modest contribution to phenotypic variation, and all variants explained 21% of the phenotypic variance. Supplementary Table 4 shows that the lower the allele frequency bin, the more SNPs were dropped due to low imputation confidence.
Table 2.
MAF | 0.001–0.01 | 0.001–0.01 | > 0.01–0.1 | > 0.01–0.1 | > 0.1–0.5 | > 0.1–0.5 | ||
---|---|---|---|---|---|---|---|---|
LD | Lower | Higher | Lower | Higher | Lower | Higher | Total | |
No. SNPs | 1,772,407 | 1,772,399 | 1,756,294 | 1,756,290 | 2,105,014 | 2,104,972 | 11,267,376 | |
Neuroticism | h2 | 0.00 (0.02) | 0.00 (0.01) | 0.02 (0.01) | 0.01 (0.03) | 0.05 (0.01) | 0.04 (0.004) | 0.11 (0.02) |
Education | h2 | 0.06 (0.02) | 0.00 (0.05) | 0.03 (0.01) | 0.01 (0.01) | 0.07 (0.01) | 0.05 (0.03) | 0.21 (0.02) |
MAF minor allele frequency, LD linkage disequilibrium, h2 (se) variance explained by SNPs in MAF and LD bin, plus standard error
Checksum analyses indicated that there were no family members with phenotype data in the UK Biobank who were also in Generation Scotland. We can therefore be confident that our results are independent of the previous study (Hill et al. 2018).
Discussion
In this study, we capitalised on the family information and genome-wide genotypes in the UK Biobank to estimate genetic and family environmental influences on neuroticism and years of education. Inclusion of family-based genetic effects substantially increases the heritabilities of neuroticism and education from 10 and 17% in standard analyses of unrelated individuals, to 27% and 56% respectively. Our estimates closely replicate previous findings from an independent sample (Generation Scotland; Hill et al. 2018). The additional family-based influences likely include rare variants, copy number variants, and structural variants. Turning to our non-genetic findings, we did not detect any influence of nuclear family, sibling or couple similarity on variation in neuroticism. This is consistent with evidence against shared environmental contributions to personality (Nivard et al. 2015; Hettema et al. 2006; Coventry and Keller 2005), and with the previous study using the same method (Hill et al. 2018). For education years, we found that 35% of the variance was explained by couple similarity. This couple effect likely captures assortative mating (education years is likely to have occurred before they became couples) and family environment effects on education.
At least three potential biases mean that the higher GREML-KIN heritability may not only be explained by ‘revealing additional genetic effects’. First, confounding is likely to remain in the heritability estimates due to indirect effects of alleles shared by parents and offspring (passive gene-environment correlation). Simulations in an Icelandic dataset suggest that although GREML-KIN gives unbiased heritability estimates when family environmental influences are present, it over-estimates heritability if phenotypes are also substantially influenced by passive gene-environment correlation (Young et al. 2018). The influence of passive gene-environment correlation for educational attainment in the UK Biobank is suggested by the lower SNP heritability for adoptees, whose rearing environments are less correlated with their genotypes (Cheesman et al., 2019). In the future, methods that can distinguish direct from indirect influences should be applied to neuroticism, education and other complex, socially-contingent traits (Eaves et al. 2014; Visscher et al. 2006; Young et al. 2018).
Second, geographic population differences complicate our interpretations of heritability estimates, particularly for education. One study found that controlling for urban childhood residence, a proxy for a range of environmental factors, reduced the SNP heritability of education (but not the heritabilities of height or body mass index) above and beyond other controls for population stratification (Conley et al. 2014). Stratification in the UK Biobank is present even at a fine scale, is unlikely to be completely removed by controlling for principal components, and is potentially more severe in rare variant analyses (Haworth et al. 2019; Young 2019).
A third key issue is assortative mating. There is clear evidence for assortment on educational attainment, including in the UK Biobank (Hugh-Jones et al. 2016; Robinson et al. 2017; Yengo et al. 2018). Our estimates of additive genetic variance are likely to be overestimated due to assortative mating among the parents of the UK Biobank participants (see Methods). Even beyond this, assortative mating may inflate the estimated additive genetic variance because it increases phenotypic covariance among relatives and the covariance among specific trait-associated loci, but it is not captured by genome-wide genetic relatedness (the predictor in our GREML models). Assortative mating may also exacerbate the bias due to gene-environment correlation (Keller et al. 2009; Kong et al. 2018). Specifically, the effects of variants influencing educational attainment in offspring may be magnified because variants are also present in the parents, influencing parental education, parents’ choice of a mate with similar attainment, and the rearing environment they provide. The effects of assortment on variance component estimates from genomic heritability methods are likely to be complicated and are an important area for future research.
Although the potential biases mean that interpretations should be cautious, our heritability estimates are notably closer to twin and extended twin family study estimates than are those from standard GREML analysis. The latter design is able to differentiate the effects of additive genetics common environment, assortative mating and gene-environment correlation, and the best-powered heritability estimate for neuroticism of ~ 25% (Coventry and Keller 2005) is close to ours (27%). On the other hand, our total heritability estimate for educational attainment (56%) is notably high and likely to be more strongly biased. It is at the upper end of twin estimates in the literature (40–50%), even though twin heritability does not only reflect additive genetic effects. The increase beyond twin heritability is likely to be partly due to passive gene-environment correlation, which inflates the shared environment component rather than heritability in twin studies.
Our allele frequency and LD partitioned heritability results (GREML-LDMS-I) give mixed support for the notion that the contribution of family-associated genetic influence is explained by rarer variants. For education, some effects of low allele frequency, low LD variants are captured by our partitioned genomic relatedness matrices. For neuroticism, the same GREML-LDMS-I heritability analyses indicated less rare genetic influence than expected based on the large kin-genetic component from the GREML-KIN model-fitting. For neuroticism, the gain in heritability from GREML-KIN despite the lack of variance in the 0.001-0.01 frequency bins is perplexing but may be partly explained by poor imputation,very rare and structural variation, and population stratification. With regard to imputation quality, the rarest SNPs were more likely to be dropped prior to analysis (see Supplementary Table 4). Since rare SNPs tag other rare SNPs poorly, dropping rare ones would lead to more dramatic underestimation of their effects.
Evidence from other studies suggests an important role for rarer variants, which will be further elucidated with the increasing availability of whole genome sequence data. Multifactorial traits such as personality and educational attainment are thought to have complex genetic aetiologies, consisting of interplay between common and rare variation. Recent exome sequencing work has demonstrated rare coding variant effects on many psychiatric traits [that are closely related to neuroticism; (Ganna et al. 2018)]. Other evidence for interactions between common variants across the genome and rare copy number variants in schizophrenia (Bergen et al. 2018) shows the importance of examining different types of genetic variation together. Importantly, pedigree estimates of the heritability of height and BMI can be recovered by using whole genome sequence data (rather than imputed data) in GREML analyses (Wainschtein et al. 2019). However, much larger sample sizes are needed to detect rare genetic influences, and confounding factors such as stratification, indirect effects and assortative mating are still present.
The additional influences correlated with kinship detected in this study indicate a chance to improve prediction using polygenic scores. Prediction accuracy is a function of the SNP heritability of the target sample, and of the SNP heritability of, and genetic correlation with, the GWAS discovery phenotype (de Vlaming et al. 2017). Polygenic scores for neuroticism currently explain maximum 4.2% of the phenotypic variance in independent samples, which is a fraction of the common SNP heritability (~ 10%; Nagel et al. 2018). For years of education, polygenic scores explain > 10% of the phenotypic variance, close to the SNP heritability of 15%. Increasing sample sizes and phenotypic homogeneity for GWAS of unrelated population samples will continue to narrow this gap between polygenic prediction and SNP heritability. However, the population-level common variant approach may be limited for educational attainment, given the relatively high variance already explained by polygenic scores. For both traits, genomic prediction could be improved by leveraging family information (Lee et al. 2017). Inclusion of relatives could help to capture additional effects of typically untagged variants that are rare and possibly even family-specific. Importantly, for the purpose of prediction, genetic scores need not be, and already are not, ‘pure’ indices of individual genetic propensity. It may help to tag influential aspects of the environment, for example by combining parent and child polygenic scores within a single model.
In summary, we provide evidence for substantial family-based and common genetic effects on neuroticism and years of education in the UK Biobank. These results motivate the recruitment of samples with dense kinships and methods to leverage genomic family information, whilst understanding gene-environment interplay.
Electronic supplementary material
Below is the link to the electronic supplementary material.
Acknowledgements
We would like to thank the scientists involved in the construction of the UK Biobank and all of the participants who have shared their life experiences with investigators in the UK Biobank. This research has been conducted using the UK Biobank Resource, under the application 18177 (with thanks to Paul F. O’Reilly). This study represents independent research part funded by the National Institute for Health Research (NIHR) Biomedical Research Centre at South London and Maudsley NHS Foundation Trust and King’s College London. The views expressed are those of the author(s) and not necessarily those of the NHS, the NIHR or the Department of Health and Social Care. High performance computing facilities were funded with capital equipment grants from the GSTT Charity (TR130505) and Maudsley Charity (980). T.C. Eley is part funded by a program grant from the UK Medical Research Council (MR/M021475/1). R. Cheesman is supported by an ESRC studentship. C. Rayner is supported by a grant from Fondation Peters to T.C. Eley and G. Breen. K.L. Purves is part supported by a grant from the Alexander Von Humboldt Foundation and UK Medical Research Council (MR/M021475/1). G. Morneau-Vaillancourt is supported by a studentship from the Quebec Network on Suicide, Mood Disorders and Related Disorders. S.W. Choi. is funded from the UK Medical Research Council (MR/N015746/1). KG is supported by a PhD studentship awarded from the UK Medical Research Council. We thank David M. Howard for assistance with checksums for Generation Scotland, and Loic Yengo for sharing information on identification of couples.
Compliance with ethical standards
Conflict of interest
G. Breen is a consultant for Eli Lilly. R. Cheesman, J. Coleman, C. Rayner, K. L. Purves, G. Morneau-Vaillancourt, K. Glanville, S. W. Choi and T. C. Eley declare no conflicts of interest.
Human and Animal Rights and Informed Consent
All participants gave full informed written consent for participation in the UK Biobank. This study was performed in accordance with the criteria defined by the rules of the UK Biobank.
Footnotes
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Contributor Information
R. Cheesman, Email: rosa.cheesman@kcl.ac.uk
T. C. Eley, Email: thalia.eley@kcl.ac.uk
References
- Allen N. E., Sudlow C., Peakman T., Collins R. UK Biobank Data: Come and Get It. Science Translational Medicine. 2014;6(224):224ed4–224ed4. doi: 10.1126/scitranslmed.3008601. [DOI] [PubMed] [Google Scholar]
- Bergen SE, Ploner A, Howrigan D, et al. Joint contributions of rare copy number variants and common snps to risk for schizophrenia. Am J Psychiatry. 2018;176(1):29–35. doi: 10.1176/appi.ajp.2018.17040467. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Branigan AR, McCallum KJ, Freese J. Variation in the heritability of educational attainment: an international meta-analysis. Soc Forces. 2013;92(1):109–140. [Google Scholar]
- Bycroft C, Freeman C, Petkova D, et al. The UK Biobank resource with deep phenotyping and genomic data. Nature. 2018;562(7726):203–209. doi: 10.1038/s41586-018-0579-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chang CC, Chow CC, Tellier LC, Vattikuti S, Purcell SM, Lee JJ. Second-generation PLINK: rising to the challenge of larger and richer datasets. GigaScience. 2015;4:7. doi: 10.1186/s13742-015-0047-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Cheesman R, Major Depressive Disorder Working Group of the Psychiatric Genomics Consortium, Purves KL et al (2018) Extracting stability increases the SNP heritability of emotional problems in young people. Transl Psychiatry 8(1):223 [DOI] [PMC free article] [PubMed]
- Cheesman R, Hunjan A, Coleman JR, Ahmadzadeh Y, Plomin R, McAdams TA, Eley TC, Breen G (2019) Comparison of adopted and non-adopted individuals reveals gene-environment interplay for education in the UK Biobank. bioRxiv, p.707695 [DOI] [PMC free article] [PubMed]
- Conley D, Siegal ML, Domingue BW, Mullan Harris K, McQueen MB, Boardman JD. Testing the key assumption of heritability estimates based on genome-wide genetic relatedness. J Hum Genet. 2014;59(6):342–345. doi: 10.1038/jhg.2014.14. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Coventry WL, Keller MC. Estimating the extent of parameter bias in the classical twin design: a comparison of parameter estimates from extended twin-family and classical twin designs. Twin Res Hum Genet. 2005;8(3):214–223. doi: 10.1375/1832427054253121. [DOI] [PubMed] [Google Scholar]
- de Vlaming R, Okbay A, Rietveld CA, et al. Meta-GWAS accuracy and power (MetaGAP) calculator shows that hiding heritability is partially due to imperfect genetic correlations across studies. PLoS Genet. 2017;13(1):e1006495. doi: 10.1371/journal.pgen.1006495. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Eaves LJ, Pourcain BS, Smith GD, York TP, Evans DM. Resolving the effects of maternal and offspring genotype on dyadic outcomes in genome wide complex trait analysis (“M-GCTA”) Behav Genet. 2014;44(5):445–455. doi: 10.1007/s10519-014-9666-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Evans LM, Tahmasbi R, Vrieze SI, et al. Comparison of methods that use whole genome data to estimate the heritability and genetic architecture of complex traits. Nat Genet. 2018;50(5):737–745. doi: 10.1038/s41588-018-0108-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ganna A, Satterstrom FK, Zekavat SM, et al. Quantifying the impact of rare and ultra-rare coding variation across the phenotypic spectrum. Am J Hum Genet. 2018;102(6):1204–1211. doi: 10.1016/j.ajhg.2018.05.002. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Haworth S, Mitchell R, Corbin L, Wade KH, Dudding T, Budu-Aggrey A, Carslake D, Hemani G, Paternoster L, Smith GD, Davies N. Apparent latent structure within the UK Biobank sample has implications for epidemiological analysis. Nat Commun. 2019;10(1):333. doi: 10.1038/s41467-018-08219-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hettema JM, Neale MC, Myers JM, Prescott CA, Kendler KS. A population-based twin study of the relationship between neuroticism and internalizing disorders. Am J Psychiatry. 2006;163(5):857–864. doi: 10.1176/ajp.2006.163.5.857. [DOI] [PubMed] [Google Scholar]
- Hill WD, Arslan RC, Xia C, et al. Genomic analysis of family data reveals additional genetic effects on intelligence and personality. Mol Psychiatry. 2018;23(12):2347. doi: 10.1038/s41380-017-0005-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hugh-Jones D, Verweij KJ, Pourcain BS, Abdellaoui A. Assortative mating on educational attainment leads to genetic spousal resemblance for polygenic scores. Intelligence. 2016;59:103–108. [Google Scholar]
- Keller MC, Medland SE, Duncan LE, Hatemi PK, Neale MC, Maes HH, Eaves LJ. Modeling extended twin family data I: description of the Cascade model. Twin Res Hum Genet. 2009;12(1):8–18. doi: 10.1375/twin.12.1.8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- KING KING: Relationship Inference Software [Online]. Available at: http://people.virginia.edu/~wc9c/KING/
- Knopik VS, Neiderhiser JM, DeFries JC, Plomin R, editors. Behavioral genetics. 7. New York: Worth; 2017. [Google Scholar]
- Kong A, Thorleifsson G, Frigge ML, Vilhjalmsson BJ, Young AI, Thorgeirsson TE, Benonisdottir S, Oddsson A, Halldorsson BV, Masson G, Gudbjartsson DF. The nature of nurture: effects of parental genotypes. Science. 2018;359(6374):424–428. doi: 10.1126/science.aan6877. [DOI] [PubMed] [Google Scholar]
- Laurin CA, Hottenga J-J, Willemsen G, Boomsma DI, Lubke GH. Genetic analyses benefit from using less heterogeneous phenotypes: an illustration with the hospital anxiety and depression scale (HADS) Genet Epidemiol. 2015;39(4):317–324. doi: 10.1002/gepi.21897. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lee SH, Weerasinghe WMSP, Wray NR, Goddard ME, van der Werf JHJ. Using information of relatives in genomic prediction to apply effective stratified medicine. Sci Rep. 2017;7:42091. doi: 10.1038/srep42091. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lee JJ, Wedow R, Okbay A, et al. Gene discovery and polygenic prediction from a genome-wide association study of educational attainment in 11 million individuals. Nat Genet. 2018;50(8):1112–1121. doi: 10.1038/s41588-018-0147-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Luciano M, Hagenaars SP, Davies G, et al. Association analysis in over 329,000 individuals identifies 116 independent variants influencing neuroticism. Nat Genet. 2018;50(1):6–11. doi: 10.1038/s41588-017-0013-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Mackenbach JP, Stirbu I, Roskam A-JR, et al. Socioeconomic inequalities in health in 22 European countries. N Engl J Med. 2008;358(23):2468–2481. doi: 10.1056/NEJMsa0707519. [DOI] [PubMed] [Google Scholar]
- McAdams TA, Hannigan LJ, Eilertsen EM, Gjerde LC, Ystrom E, Rijsdijk FV. Revisiting the children-of-twins design: improving existing models for the exploration of intergenerational associations. Behav Genet. 2018;48(5):397–412. doi: 10.1007/s10519-018-9912-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Nagel M, Jansen PR, Stringer S, et al. Meta-analysis of genome-wide association studies for neuroticism in 449,484 individuals identifies novel genetic loci and pathways. Nat Genet. 2018;50(7):920–927. doi: 10.1038/s41588-018-0151-7. [DOI] [PubMed] [Google Scholar]
- Nivard MG, Middeldorp CM, Dolan CV, Boomsma DI. Genetic and environmental stability of neuroticism from adolescence to adulthood. Twin Res Hum Genet. 2015;18(6):746–754. doi: 10.1017/thg.2015.80. [DOI] [PubMed] [Google Scholar]
- Ormel J, Jeronimus BF, Kotov R, et al. Neuroticism and common mental disorders: meaning and utility of a complex relationship. Clin Psychol Rev. 2013;33(5):686–697. doi: 10.1016/j.cpr.2013.04.003. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Peyrot WJ, Robinson MR, Penninx BW, Wray NR. Exploring boundaries for the genetic consequences of assortative mating for psychiatric traits. JAMA Psychiatry. 2016;73(11):1189–1195. doi: 10.1001/jamapsychiatry.2016.2566. [DOI] [PubMed] [Google Scholar]
- Polderman TJC, Benyamin B, de Leeuw CA, et al. Meta-analysis of the heritability of human traits based on fifty years of twin studies. Nat Genet. 2015;47(7):702–709. doi: 10.1038/ng.3285. [DOI] [PubMed] [Google Scholar]
- Robinson MR, Kleinman A, Graff M, et al. Genetic evidence of assortative mating in humans. Nat Hum Behav. 2017;1(1):0016. [Google Scholar]
- Smith BH, Campbell H, Blackwood D, et al. Generation Scotland: the Scottish Family Health Study; a new resource for researching genes and heritability. BMC Med Genet. 2006;7:74. doi: 10.1186/1471-2350-7-74. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Smith DJ, Nicholl BI, Cullen B, et al. Prevalence and characteristics of probable major depression and bipolar disorder within UK biobank: cross-sectional study of 172,751 participants. PLoS ONE. 2013;8(11):e75362. doi: 10.1371/journal.pone.0075362. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Smith DJ, Escott-Price V, Davies G, et al. Genome-wide analysis of over 106 000 individuals identifies 9 neuroticism-associated loci. Mol Psychiatry. 2016;21(6):749–757. doi: 10.1038/mp.2016.49. [DOI] [PMC free article] [PubMed] [Google Scholar]
- van den Berg SM, de Moor MHM, McGue M, et al. Harmonization of neuroticism and extraversion phenotypes across inventories and cohorts in the genetics of personality consortium: an application of item response theory. Behav Genet. 2014;44(4):295–313. doi: 10.1007/s10519-014-9654-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- van der Sluis S, Verhage M, Posthuma D, Dolan CV. Phenotypic complexity, measurement bias, and poor phenotypic resolution contribute to the missing heritability problem in genetic association studies. PLoS ONE. 2010;5(11):e13929. doi: 10.1371/journal.pone.0013929. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Vinkhuyzen AA, Van Der Sluis S, Maes HH, Posthuma D. Reconsidering the heritability of intelligence in adulthood: taking assortative mating and cultural transmission into account. Behav Genet. 2012;42(2):187–198. doi: 10.1007/s10519-011-9507-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Visscher PM, Medland SE, Ferreira MAR, et al. Assumption-free estimation of heritability from genome-wide identity-by-descent sharing between full siblings. PLoS Genet. 2006;2(3):e41. doi: 10.1371/journal.pgen.0020041. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wainschtein P, Jain DP, Yengo L, et al. Recovery of trait heritability from whole genome sequence data. BioRxiv. 2019 doi: 10.1101/588020. [DOI] [Google Scholar]
- Xia C, Amador C, Huffman J, et al. Pedigree- and SNP-associated genetics and recent environment are the major contributors to anthropometric and cardiometabolic trait variation. PLoS Genet. 2016;12(2):e1005804. doi: 10.1371/journal.pgen.1005804. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Yang J, Benyamin B, McEvoy BP, et al. Common SNPs explain a large proportion of the heritability for human height. Nat Genet. 2010;42(7):565–569. doi: 10.1038/ng.608. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Yang J, Lee SH, Goddard ME, Visscher PM. GCTA: a tool for genome-wide complex trait analysis. Am J Hum Genet. 2011;88(1):76–82. doi: 10.1016/j.ajhg.2010.11.011. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Yengo L, Robinson MR, Keller MC, et al. Imprint of assortative mating on the human genome. Nat Hum Behav. 2018;2(12):948–954. doi: 10.1038/s41562-018-0476-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Young AI. Solving the missing heritability problem. PLoS Genet. 2019;15(6):e1008222. doi: 10.1371/journal.pgen.1008222. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Young AI, Frigge ML, Gudbjartsson DF, et al. Relatedness disequilibrium regression estimates heritability without environmental bias. Nat Genet. 2018;50(9):1304–1310. doi: 10.1038/s41588-018-0178-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zaitlen N, Kraft P, Patterson N, et al. Using extended genealogy to estimate components of heritability for 23 quantitative and dichotomous traits. PLoS Genet. 2013;9(5):e1003520. doi: 10.1371/journal.pgen.1003520. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.