Abstract
Linkage studies have successfully mapped loci underlying monogenic disorders, but mostly failed when applied to common diseases. Conversely, genome-wide association studies (GWAS) have identified replicable associations between thousands of single nucleotide polymorphisms (SNPs) and complex traits yet capture less than half of the total heritability. Here we reconcile these two approaches by showing that linkage signals of height and body mass index (BMI) from 119,000 sibling pairs colocalize with GWAS-identified loci. Concordant with polygenicity, we observed a genome-wide inflation of linkage test statistics; that GWAS results predict linkage signals; and that adjusting phenotypes for polygenic scores reduces linkage signals. Finally, we develop a method using recombination rate stratified identity-by-descent sharing between siblings to unbiasedly estimate heritability of height (0.76 ± 0.05) and BMI (0.55 ± 0.07). Our results imply that substantial heritability remains unaccounted for by GWAS-identified loci, and this residual genetic variation is polygenic and enriched near GWAS-identified loci.
INTRODUCTION
Genetic studies of human complex traits have progressed from using pedigree designs (e.g., twin studies1) to estimate heritability to population-based genomic surveys for dissecting genetic variation at the level of individual loci. In between, there was a period of two decades (roughly from 1985–2005) where researchers used pedigree-based genome-wide linkage scans to map disease loci. Linkage analysis is an experimental design that studies the segregation of markers and phenotypes within a pedigree to map trait loci. It is robust to confounding effects caused by population stratification, requires only sparse marker maps but is also low in power and mapping resolution, the latter because there are very few recombination events within a pedigree2,3. Linkage analyses have been highly successful in mapping single gene Mendelian disorders4. Linkage studies of complex traits were predicated on the success of mapping single gene disorders and major risk loci affecting common diseases such as breast cancer5 and Alzheimer’s Disease6. However, linkage studies for most common diseases and other complex traits were largely a disappointment in that they failed to produce robust and replicable results7. Many explanations have been given in the literature for this failure, including “genetic heterogeneity” (family-specific genetic loci cause disease) and statistical artefacts due to model selection, insufficient correction for multiple testing and small sample sizes. The lack of power and mapping resolution of linkage studies prompted the development of genome-wide association studies (GWAS).2
Despite initial scepticism that GWAS would lead to marker-trait discoveries (e.g., refs.8,9), GWAS based on common SNP markers have been highly successful in detecting robust associations between SNPs and complex traits. Yet, SNPs tested for association in standard GWASs only capture one-third to one-half of the genetic variation estimated from pedigree data9. The polygenicity and effect sizes distribution of the genetic variation not accounted for by GWAS remains uncertain. Even after early GWAS results it was hypothesised that family-specific rare mutations of large effect could explain a substantial fraction of common and complex human diseases such as schizophrenia.10 Although this view has remained controversial,11,12 it remains possible that certain genomic loci harbor multiple penetrant alleles whose effects cannot be fully captured in a GWAS but are detectable using a linkage design. Such loci might also contain common variants already detected by GWAS. The architecture of the genetic variation unaccounted for by GWAS determines what experimental design is best for its discovery and dissection. For example, if common variation acts on phenotypes through gene expression networks that ultimately affect gene regulation at a small number of core genes13, then residual genetic variation due to rare variants may concentrate on those core genes in cis, and either large scale population-based exome sequencing studies or large-scale family-based linkage studies may identify such genes. In contrast, if residual genetic variation is just as polygenic as common genetic variance, then large-scale population-based whole genome sequencing studies would be best for variant discovery.
Family-based designs combined with dense genome-wide marker (SNP) data can be used to address a number of questions that neither pedigree nor GWAS designs alone can answer. They can be used to estimate genetic variance within families by exploiting the variation in actual relatedness around its expectation, resulting in estimates of heritability free from confounding due to population stratification and other sources of biases14,15. Estimates of SNP effects within families are likewise unaffected by population stratification and can be contrasted with between-family estimates from population-based studies to dissect direct from indirect genetic effects16,17 and to estimate the effect of non-random mating18.
In this study, we use data from n = 119,457 sibling pairs from six large cohorts of European ancestry participants (Generation Scotland19 (GS, n = 8,368), the Queensland Institute of Medical Research cohort (QIMR, n = 12,844), the Lifelines Cohort Study20,21 (LL, n = 16,836), the UK Biobank22 (UKB, n = 21,756), the Estonian Biobank23 (EBB, n = 25,333) and the Trøndelag Health Study24,25 (HUNT, n = 34,575)). We estimate genome-wide and locus-specific genetic variation for height and body mass index (BMI) and assess through theory, simulation and analysis of real data the consistency between linkage and population-based association studies. We investigate how genome-wide identity-by-descent (IBD) estimators from SNP data can yield biased heritability estimates under recombination-rate (RR) dependent genetic architectures and propose that RR-stratified analysis is a robust approach to reduce or eliminate this bias. We estimate a total heritability that is consistent with that of pedigree (twin) studies and about two-fold larger than that captured from common SNPs in GWAS, implying that a substantial proportion of genetic variation in the human genome is not captured by the common variant GWAS paradigm. We provide evidence that the residual genetic variation is also polygenic.
RESULTS
Estimates of heritability from IBD regression
Recombination rate dependent biases in IBD regression
We used the IBD regression method14,26 to partition the phenotypic correlation between siblings ( for height and for BMI, Supplementary Table 1) into a genetic and a shared environment component. Classically, this method quantifies genome-wide IBD sharing as a fraction of the length of the genome measured in centimorgan (cM) units, which implicitly upweights the contribution of loci with high recombination rates (RR). Alternatively, genome-wide IBD sharing could be quantified as the proportion of DNA base pairs shared between relatives. The latter approach implicitly assumes independence between RR and the genomic distribution of genetic variance.
We tested these two implicit assumptions through simulations and found that using either measure of IBD sharing can lead to biased heritability estimates when the genomic distribution of causal variants depends on RR (Supplementary Note, Supplementary Fig. 1–2, Supplementary Tables 2 - 5), as shown previously with population-based designs27. To remedy this problem, we propose a RR-stratified estimation method to account for variation in IBD sharing between loci with different RRs. Briefly, our method (i) groups genomic loci across the genome into four classes of homogenous-RR loci, (ii) quantifies the average IBD sharing at each class, (iii) estimates the contribution of each class to the phenotypic correlation between siblings, then (iv) sums up those contributions to obtain a final estimate of heritability (more details in Supplementary Note). We show through simulations that RR-stratified estimation is robust to differences in RR between markers and unobserved causal variants (Supplementary Table 5). Therefore, we hereafter only report results obtained with this approach. For comparison, we also report unstratified results in Supplementary Table 6.
RR-stratified estimates of heritability
The parameters of our RR-stratified IBD regression model include the sum of RR-stratified full-sib IBD heritability (hereafter denoted ) and the proportion of variance due to effects common to siblings and independent of IBD sharing (hereafter denoted ). Estimates of and (hereafter denoted and , respectively) were obtained using unconstrained (that is that estimates are allowed to be negative to ensure unbiasedness) restricted maximum likelihood (Methods) in each cohort. We then performed an inverse-variance weighted meta-analysis to combine estimates across cohorts (Fig. 1, Supplementary Table 6a). Heritability estimates across cohorts were largely consistent for height (Cochran’s heterogeneity Q-statistic I2=14.5%, ) but showed moderate heterogeneity for BMI (I2=56.8%, ). The meta-analysed estimates of were high for both height (, standard error (s.e.) = 0.05) and BMI (, s.e. 0.07), consistent with large heritability estimates from twin studies1,28,29. We found a significant non-zero for height (, s.e. 0.03) but not for BMI (, s.e. 0.04), although both estimates showed moderate heterogeneity across cohorts (height: I2= 61.7 %, ; BMI: I2=51.3%, ). A significantly positive could be due to either assortative mating, shared environmental effects, or both. We show in the Supplementary Note that the significant observed for height can be mostly explained by assortative mating, thus leaving little room for other effects. We repeated all analyses using rank-based transformed traits and found highly consistent results (Supplementary Table 6b).
Figure 1. Recombination-rate stratified estimates of heritability ( for height (a) and BMI (b).

a–b, Estimates were obtained using restricted maximum likelihood in six cohorts of European-ancestry individuals: the UK Biobank (UKB), Generation Scotland (GS), the Lifelines Study (LL), the Queensland Institute of Medical Research cohort (QIMR), the Estonian Biobank (EBB), the HUNT study (HUNT) and the fixed-effect meta-analysis results combining all cohorts (META). The number of quasi-independent sib-pairs (n) for each trait and cohort is indicated on y-axis. Each dot represents a point estimate, and the corresponding error bar represents its standard error (s.e.). Numeric values are given in Supplementary Table 3. Estimated variance components were not constrained to be positive to ensure unbiasedness.
Locus-specific linkage analysis of height and BMI
Next, we performed a locus-by-locus linkage analysis to quantify the amount of variation explained by IBD status at 0.5 cM spaced loci across each autosome. As before, we analysed each cohort separately, then meta-analysed locus-specific linkage signals across cohorts. The mean linkage test statistic () across genomic locations was 3.43 (s.e. 0.38) for height and 1.43 (s.e. 0.22) for BMI, consistent with estimates of (Methods, Extended Data Fig. 1, Supplementary Table 7a) and the effective number of independent genomic segments (Supplementary Table 8). Analyses of rank-based transformed traits yielded similar results (Supplementary Table 7b, Supplementary Fig. 3). We detected five loci on chromosomes 1, 5, 7, 10 and 19 showing significant linkage with height (Fig. 2, Table 1, for all chromosomes see Supplementary Fig. 4a), but none for BMI (Supplementary Fig. 4b). The statistical significance of linkage signals was determined using 3.6 as the threshold for the logarithm of the odds (LOD) score as previously suggested for genome-wide significance in linkage studies.30 This threshold corresponds to a p-value of .
Figure 2. Chromosomes containing loci significantly linked with height.

Linked loci were identified from the meta-analysis of 119,457 quasi-independent sibling-pairs before and after adjustment for genetic predictors (PGS, polygenic score) derived from the largest available GWAS of height33 (average proportion of height variance explained across cohorts: ). The genetic position of independent trait-associated SNPs is represented below the y=0 line by blue dots, which radius is proportional to the association statistic. Results for all the autosomes for height and BMI are shown in Supplementary Fig. 4a–b. The vertical dashed lines indicate the two LOD drop-off confidence interval (relative to the peak LOD score) on each side of a genetic position where the linkage LOD score exceed 3.6 (Table 1). The black horizontal dotted line represents the threshold for significantly linked loci (LOD score ≥ 3.6). The grey horizontal dashed line indicates a LOD score of 0.
Table 1. Genomic regions significant linked with height.
Linkage peaks were defined using the two LOD drop-off method on each side of a genetic position where the linkage LOD score exceed 3.6. Proportion of genes (or trait-associated SNPs) in peaks is defined relative to the number of genes (or trait-associated SNPs) on the chromosome. Genomic positions correspond to the hg19 genome build. CHR: chromosome.
| CHR | Start (bp) | Stop (bp) | Length (Mb) | Start (cM) | Stop (cM) | Length (cM) | max LOD score | Proportion of Genes in Peak | Proportion of Trait-associated SNPs in Peak |
|---|---|---|---|---|---|---|---|---|---|
|
| |||||||||
| 1 | 213,682,673 | 234,840,374 | 21.2 | 200.5 | 224.5 | 24 | 4.1 | 163/2566 | 119/970 |
| 5 | 78,955,590 | 89,401,919 | 10.4 | 85.0 | 93.5 | 8.5 | 5.5 | 50/1172 | 37/742 |
| 7 | 42,401,253 | 81,189,104 | 38.8 | 58.0 | 86.0 | 28 | 3.8 | 271/1235 | 131/673 |
| 10 | 104,479,048 | 122,171,602 | 17.7 | 111.5 | 130.5 | 19 | 4.0 | 118/990 | 69/558 |
| 19 | 4,196,471 | 8,360,632 | 4.2 | 16.0 | 27.0 | 11 | 3.8 | 120/1806 | 43/375 |
For each locus, we defined a confidence interval for the location of the underlying causal variants using the “two LOD drop-off” method31 (Methods). We conservatively chose a 2-unit LOD score drop-off to ensure a coverage of at least 95%. The length of these five confidence intervals varied between 4.2 Mb (height-linked locus: chr19: 4,196,471–8,360,632; genomic position in build hg19) and 38.8 Mb (height-linked locus: chr7: 42,401,253–81,189,104), which remain quite broad despite a sample size >119,000 sibling pairs (Table 1).
In summary, we confirm that widespread genetic variation underlies height and BMI and detect five loci showing significant linkage with height.
GWAS and linkage results are significantly correlated
Overview of the predLINK method
We developed a method, predLINK, to predict the variance explained at a given locus from a linkage genome scan. The predicted linkage signal is calculated as a sum of variances explained at all neighbouring causal (or trait-associated) loci weighted by their genetic distance to the locus of interest (Eqn. 1; Methods). This method builds upon previous work in which the theoretical expectation of the genetic variance captured by a marker was derived for a linkage analysis in an outbred population under an infinitesimal architecture.32 We assessed the performance of predLINK using the correlation () between the observed and predicted variance explained at 0.5-cM-spaced loci across each chromosome.
PredLINK applied to simulated data
We first performed large scale simulations of up to 100,000 sib-pairs to validate predLINK under various polygenic architectures. Overall, we found the largest correlation between the theoretically predicted (Extended Data Fig. 2a–b: black line) and observed linkage signals (Extended Data Fig. 2a–b: yellow and grey lines) for traits where the genetic variance is contributed by just a few causal variants with large effects and that it decreased with higher trait polygenicity (Extended Data Fig. 2a–c: left to right) and with a smaller discovery sample size (Extended Data Fig. 2a–c: yellow vs. grey lines). The mean across 100 replicates for the least polygenic genetic architecture was 0.91 (s.e. 0.02) and 0.69 (s.e. 0.04) in the linkage analyses of 100,000 and 20,000 sib-pairs, respectively (Extended Data Fig. 2c, left-most panel, enlarged symbols; Supplementary Table 9) and decreased to 0.34 (s.e. 0.00) and -0.01 (s.e. 0.00), for the infinitesimal model (Extended Data Fig. 2c, right-most panel, enlarged symbols; Supplementary Table 9). Moreover, we found that errors in estimated SNP effects slightly decrease , especially when polygenicity is low (Extended Data Fig. 2c, Supplementary Table 9).
PredLINK applied to height and BMI
Next, we applied predLINK to assess the co-localization of observed and predicted linkage signals from GWAS, using 12,111 and 795 genome-wide significant SNPs associated with height33 and BMI (Supplementary Data 1), respectively (Methods). The 795 genome-wide significant SNPs associated with BMI were obtained from re-analysing data from (ref. 34) after excluding sib-pairs (and their close relatives) from the UK Biobank also included in our linkage analyses (Methods). For height, the per-chromosome ranged from -0.08 to 0.94, with a length-weighted (length measured in cM unit) average across chromosomes of (s.e. 0.05) (Supplementary Fig. 5a, Supplementary Table 10). We found a smaller correlation (s.e. 0.08) for BMI (Supplementary Fig. 5b, Supplementary Table 10), consistent with a lower power to detect association and linkage for BMI as compared to height (Supplementary Note, Supplementary Fig. 6 – 8, Supplementary Table 9, 11).
Curvature effect
Under an infinitesimal genetic architecture, stronger linkage signals are expected at the centre of a chromosome as compared to its ends35. This is due to centrally located markers being in linkage with more causal variants than those located terminally, which appears as a curved linkage signal (Extended Data Fig. 2a–b). Given that predLINK recapitulates such intrinsic curvature of linkage signals, a non-zero is expected for polygenic traits like height and BMI. We call this the “curvature effect” (CE). As an empirical illustration of this CE, we not only observed a significant colocalization between height-associated SNPs and linkage signals for height ( s.e. 0.05, Fig. 3a) but also between height-associated SNPs and linkage signals for BMI ( s.e. 0.10, Fig. 3b, Supplementary Table 10). Similarly, we found a significant colocalization between BMI-associated SNPs and linkage signals for height (, s.e. 0.08, Fig. 3c, Supplementary Table 10). This observation cannot be explained by pleiotropy alone, given the low genetic correlation between height and BMI36 (rG = −0.10).
Figure 3. Colocalization between observed and GWAS-predicted linkage signals.

Row-panels (row 1 = panel a and b; row 2 = panel c and d) represent predicted linkage signals based on a given set of trait-associated SNPs and column-panels represent observed linkage signals for height (panels a and c) and body mass index (BMI; panels b and d). The x-axis in each panel displays the correlation () between observed and predicted (from GWAS results; Methods) linkage signals. The y-axis represents counts. In each panel, the vertical dashed line represents the correlation between observed linkage signals for the trait specified in the corresponding column-panel header and predicted linkage signals from either 12,010 height-associated SNPs (panels a and b) or 787 BMI-associated SNPs (panels c and d). Predicted linkage signals were also obtained under the null hypothesis (that is “the correlation between observed and predicted linkage signals is due to the curvature effect”) using 1,000 draws of random SNPs with similar minor allele frequency and linkage disequilibrium properties as trait-associated SNPs. The histogram in each panel represents the distribution of correlations (under the null) between observed linkage for the trait indicated in the corresponding column-panel and predicted linkage obtained from these 1,000 draws. The mean of correlations obtained under the null hypothesis is denoted . The P-values (P) reported in the top-left corner of each panel assess the statistical significance of the difference between and using a two-sided Wald test (conditional on ) and based on the sampling variance of across replicates. At a significance threshold P<0.05, our results imply that linkage signals for height are predictable from height-associated SNPs (panel a), but not from BMI-associated SNPs (panel c), and that linkage signals for BMI are also predictable from BMI-associated SNPs (panel d), but not from height-associated SNPs (panel b). Numeric values are presented in Supplementary Table 10.
To assess the significance of beyond the expected correlation due to the CE (), we generated a null distribution from predicted linkage based on random SNPs matched on minor allele frequency and linkage disequilibrium with the actual trait-associated SNPs. For height, the average over 1,000 sets of 12,010 random SNPs was (s.e. 0.04), which is significantly lower than (s.e. 0.05) obtained using height-associated SNPs (Wald-test p-value: , Fig. 3a, Supplementary Table 10) and implies that predLINK can significantly predict a height-specific linkage signal over and above the CE. The same was true for BMI, where the observed (s.e. 0.08) was significantly higher than (s.e. 0.05) obtained over 1,000 sets of 787 random SNPs (, Fig. 3d, Supplementary Table 10). Overall, these results show that the colocalization between GWAS and linkage signals for height and BMI is only partially explained by the high polygenicity of these two traits.
A polygenic missing heritability enriched near GWAS loci
We hypothesize that if concordance between linkage and association results is due to the same genetic loci, then correcting phenotypes for polygenic scores (PGS) should reduce the test statistic for linkage. We focus on height to test this hypothesis and used a PGS based on 12,111 height-associated variants33 explaining 0.35 – 0.41 (weighted mean = 0.38, Table 2) of phenotypic variance in our cohorts. After adjustment for the PGS, estimates of and were (s.e. 0.06) and (s.e. 0.03) and the average test statistic decreased from 3.43 (s.e. 0.38) to 2.2 (s.e. 0.28) (Table 2), implying that ~51% (i.e., 1 - (2.2 – 1)/(3.43 – 1)) of the height genetic variance in linkage analysis is captured by height-associated SNPs. Importantly, the proportion of height variance explained by each chromosome remained significantly correlated with chromosome length before and after adjustment for PGS (before adjustment: 0.73 (s.e. 0.17); after PGS-adjustment: 0.61 (s.e. 0.22); Fig. 4), implying that the unaccounted genetic variance for height is also polygenic. Finally, adjustment for PGS reduced LOD scores below 3.6 at all height-linked loci (Fig. 2).
Table 2. Estimates from linkage analyses of PGS-adjusted traits.
denotes the prediction accuracy of the PGS in each cohort measured by the squared correlation between the trait and the PGS. Approximated s.e. for were obtained using the Delta-method. The mean test statistic for linkage is denoted . Standard errors are denoted as s.e. and were obtained using recombination stratified estimation. Note that 3 significant digits were used to report standard errors for because of the larger precision of estimates relative to that of heritability estimates.
| Trait | GS | QIMR | LL | UKB | EBB | HUNT | Meta-analysis | |
|---|---|---|---|---|---|---|---|---|
|
| ||||||||
| Height | 0.44 (0.03) | 0.43 (0.04) | 0.48 (0.01) | 0.50 (0.01) | 0.52 (0.02) | 0.50 (0.01) | 0.50 (0.01) | |
|
| ||||||||
| Height (PGS adjusted) | 0.41 (0.009) | 0.35 (0.009) | 0.38 (0.006) | 0.41 (0.005) | 0.37 (0.005) | 0.39 (0.005) | 0.38 (0.002) | |
| Mean (s.e.) | 1.05 (0.17) | 1.02 (0.13) | 1.20 (0.18) | 1.01 (0.13) | 1.34 (0.16) | 1.09 (0.15) | 2.2 (0.28) | |
| Median | 0.87 | 1.22 | 1.13 | 1.14 | 1.64 | 1.18 | 2.44 | |
| Proportion > 0 | 0.62 | 0.65 | 0.67 | 0.62 | 0.74 | 0.74 | 0.84 | |
| (s.e.) | 0.76 (0.21) | 0.80 (0.21) | 0.71 (0.15) | 0.45 (0.13) | 0.81 (0.13) | 0.67 (0.12) | 0.68 (0.06) | |
| (s.e.) | 0.08 (0.11) | −0.10 (0.11) | 0.11 (0.08) | 0.25 (0.07) | 0.02 (0.07) | 0.05 (0.06) | 0.08 (0.03) | |
|
| ||||||||
|
| ||||||||
| BMI | 0.26 (0.03) | 0.25 (0.04) | 0.27 (0.01) | 0.26 (0.01) | 0.30 (0.02) | 0.25 (0.01) | 0.26 (0.01) | |
|
| ||||||||
| BMI (PGS adjusted) | 0.11 (0.006) | 0.09 (0.006) | 0.10 (0.004) | 0.10 (0.003) | 0.09 (0.003) | 0.08 (0.003) | 0.09 (0.002) | |
| Mean (s.e.) | 0.84 (0.11) | 0.87 (0.14) | 0.68 (0.09) | 1.01 (0.15) | 0.77 (0.11) | 1.04 (0.16) | 1.39 (0.21) | |
| Median | 0.95 | 0.85 | 0.72 | 0.99 | 0.64 | 0.99 | 1.38 | |
| Proportion > 0 | 0.54 | 0.61 | 0.58 | 0.71 | 0.62 | 0.62 | 0.76 | |
| (s.e.) | 0.26 (0.27) | 0.75 (0.22) | 0.44 (0.19) | 0.96 (0.17) | 0.70 (0.16) | 0.33 (0.14) | 0.58 (0.07) | |
| (s.e.) | 0.15 (0.13) | −0.14 (0.11) | 0.04 (0.1) | −0.23 (0.09) | −0.12 (0.08) | 0.05 (0.07) | −0.05 (0.04) | |
Figure 4. Correlation between chromosome length and estimates of variance explained from linkage analyses of height.

Analyses were based on summary statistics from a linkage meta-analysis of height and height adjusted for polygenic score (PGS) in 119,457 quasi-independent sibling pairs. Each dot represents a chromosome. The x-axis represents the physical length of each chromosome relative to the size of the autosome (i.e., ~2879 Mb). The y-axis represents the expected variance explained () for each chromosome () estimated as , where is the mean across the chromosome of estimates of locus-specific variance, and an effective number of independent markers per chromosome (Supplementary Table 8). Error bars around each dot represent times the standard deviation of linkage estimate across the chromosomes. Standard errors (s.e.) of the regression slopes were obtained using a leave-one-chromosome-out jackknife approach. 95% confidence intervals (CI) for the regression slopes were calculated as 1.96×s.e.
Next, we estimated the correlation between linkage signals for PGS-adjusted height and predicted linkage signals from height-associated SNPs. We found a correlation (s.e. 0.08), significant beyond the expected CE (, s.e. 0.03, testing the difference between and ; Extended Data Fig. 3), which provides evidence that causal variants for height not captured by current GWAS are also enriched within GWAS-detected height-associated loci.
In conclusion, our results support the hypothesis that the same genetic loci underlie association and linkage signals and suggest that the missing heritability of height is polygenic and involves causal variants enriched near height-associated SNPs identified through large scale GWAS.
DISCUSSION
We have conducted a large linkage analysis of complex traits in humans using ~119,000 sibling pairs. We first estimated the variance explained by degree of IBD sharing across the entire genome, and subsequently estimate the variance explained by IBD sharing at specific loci across the genome (a traditional linkage scan across the genome). We estimate a heritability of 0.76 (s.e. 0.05) and 0.55 (s.e. 0.07) for height and BMI, respectively and detect multiple significant linkage peaks for height. Analyses were repeated after adjusting the phenotype for known population-wide associations from GWAS, using a polygenic score, and this showed a reduction of the heritability and a reduction in the linkage signal, consistent with theoretical expectation.
Whether genetic associations and linkage peaks are caused by the same loci had been previously debated37. For example, a difference would be expected if family-specific mutations with large effects are the cause of the observed phenotype. Previous linkage studies in 20,240 sibling pairs failed to detect a colocalization between linkage signals and trait-associated SNPs from an independent GWAS of height and BMI in ~130,000 individuals38. This lack of concordance could be due to insufficient power in either or both linkage and GWAS. Our study revisited this observation using a new method, predLINK, and data from a 5-fold larger linkage study and that from the currently largest GWAS of height (N=5M) and BMI (N=650K). We show that observed and GWAS-predicted linkage signals are correlated across the genome, confirming that some of the same genetic loci contribute to within-family and population-wide genetic and phenotypic variance.
The heritability estimates for both height and BMI are remarkably similar to those estimated from twin studies1,28,29, despite the experimental designs being orthogonal (between versus within families). This concordance suggests that assumptions underlying twin studies (e.g., the “equal environment” assumption for identical and non-identical twin pairs) may not be strongly violated, at least for the traits studied. A large meta-analysis of twin studies across multiple complex traits concluded that the most parsimonious explanation of observed twin correlations was a simple model in which all familial correlations are due to additive genetic variance1. Our results for height and BMI agree with that conclusion in that we find no evidence of a residual sibling covariance () for BMI, while the significant observed for height is largely explained by assortative mating. Nevertheless, we cannot rule out that the similarity of estimates from twin studies and within-family segregation for height and BMI is just a co-incidence. Research involving more complex pedigrees and the analysis of multiple traits is needed to thoroughly test the congruence of estimates of genetic segregation variance and estimates of genetic variance from the phenotypic correlation between relatives.
Our heritability estimates for height (0.76) and BMI (0.55) are significantly larger than 0.55 (s.e. 0.04) and 0.29 (s.e. 0.06) obtained by Young and colleagues15 from data in Iceland using their relatedness disequilibrium regression (RDR) method (a generalisation of the sibling IBD regression method to estimate heritability in a complex pedigree). Interestingly, their estimates from an analysis of a large collection of full-sib pairs (n=56,461 – 64,847) were 0.68 (s.e. 0.10) and 0.39 (s.e. 0.12) for height and BMI, respectively, not significantly different from our estimates. There are many reasons why estimates of segregation variance from complex pedigrees could differ from those obtained only using nuclear families. First, while unaccounted interactions between genes and shared environment can yield biases in both contexts, the magnitude of these biases can differ across study designs if the amount of shared environment varies between first-degree and distant relatives. Moreover, estimates of additive genetic variance in a sibling design could also be biased upwards in the presence of dominance and epistatic effects, and any such bias is likely smaller in a complex pedigree. Yet there is little evidence for non-additive genetic variance for height and BMI39,40. Moreover, our estimates of 0.76 and 0.55 can also be compared to estimates from GWAS and WGS data. For height, the SNP-based estimate is about 0.55–0.60 (ref.41) and the WGS estimate ~0.70 (ref. 42; but with a large s.e. of ~0.10). These estimates imply that for height there is substantial genetic variation not captured by either SNP array and, to a lesser extent, sequence data, presumably ultra-rare variants (frequency <1/10,000 not included in ref.42) and, perhaps, complex structural variation not captured by short-read sequencing technologies. For BMI the gap is much larger, since estimates from both GWAS and WGS data are about 0.3041,42. Large exome studies have detected multiple genes with a significant burden of rare coding variants43,44, but these variants together explain a trivial amount of variation in the population. Our results after an adjustment for PGS imply that the remaining (“still-missing”) genetic variation for height is also polygenic, and not concentrated in a small number of genes. It is currently unknown what the genetic architecture of the remaining variants is in terms of allele frequency and effect sizes. All we can say for now is that they are not captured by common SNPs and large WES studies. Future studies on WGS data and large sample sizes, for example in the UK Biobank, may be able to refine the genetic architecture for height and BMI and other complex traits.
The estimate of genetic trait variation from realised relationships is, per definition, within-family segregation variance. Usually, this variance is the same as genetic variance in the population. However, correlation between genes and environments, assortative mating and population stratification can all lead to a difference between population and within-family variance3. For height, and less so for BMI, there is strong evidence for assortative mating, including in the UK Biobank45–47. In the presence of assortative mating, the estimate of heritability from “sib regression”, as used in our study, is biased downwards with respect to the heritability in the assortatively-mating current population26. Therefore, if we account for assortative mating and assume that the resemblance between siblings is solely due to genetic effects (and not to common environmental), then our data are consistent with a heritability of 0.87 (s.e. 0.05) for height in the current population (Supplementary Note). Under these assumptions, the still-missing heritability for height is even larger.
Our genome-wide linkage scan detected several regions that traditionally would be termed “significant” and followed up with fine-mapping or candidate gene studies. This experimental design hypothesised that the cause of the linkage peak was a single genomic locus with one or more sequence variants of large effect that co-segregated with the trait in families. However, linkage analyses for complex traits were largely unsuccessful in identifying individual loci responsible for the observed linkage peaks. Our study provides strong empirical evidence that part and perhaps most of the explanation of the failure of the linkage design is that polygenic variation creates the appearance of “major loci” when there are none. For example, the linkage peak for height on chromosome 10, which contains 69 independent height-associated SNPs (Table 1), disappears completely after adjustment for the PGS (Fig. 2). However, it is also possible that the clustering of height-associated SNPs at that locus is caused by an underlying structural variant partially tagged by each of those 69 SNPs33. If the null hypothesis being tested in linkage analysis is a highly polygenic model (e.g., the infinitesimal model), instead of the traditional null hypothesis of no genetic variation anywhere in the genome, then the threshold for declaring a significant linkage peak would be larger and most declared “significant” loci would likely disappear48.
There are several limitations to our study. First, even with more than 100,000 sibling pairs, the s.e. of heritability estimates are still 0.05–0.07, and the linkage scans show large sampling variance (as shown in our simulations, Extended Data Fig. 2). The non-centrality parameter of the test to detect locus-specific linkage with a sibling-pairs design is approximately , with the number of sibling pairs, the phenotypic correlation between siblings, and the variance explained by a locus49,50. Therefore, if a gene contains multiple rare variants of large effect that cumulatively explain 0.5% of height variance (assuming ), then at least 3.5 million sibling pairs would be required to yield 80% statistical power (at a 5% significance threshold) to detect linkage with that gene. Although using extended families (that is, beyond sibling pairs) could improve power, a large number of informative meiosis are still be needed. Secondly, we did not use a sex-specific recombination map in our RR-stratified analyses, which implies that our estimates might be affected by residual biases if height and BMI heritability is enriched at loci where RR varies between males and females. However, to the best of our knowledge, this is not supported by any evidence and would warrant further investigations beyond this study. Nevertheless, our RR-stratified framework can be easily extended to incorporate sex-specific information by using a partition of the genome that distinguishes loci with discordant RR between sexes. Thirdly, our conclusions are limited to height and BMI because they are among the most commonly measured complex traits and hence provide the largest sample sizes. Fourthly, our study focused on individuals of European ancestries because less than 5% of all sibling pairs available across cohorts could be assigned to other ancestry groups. Fifthly, our investigation only considered autosomal genetic variation. Finally, future family-based studies on common disease and traits for which there is evidence of assortative mating and indirect genetic effects (e.g., educational attainment) could provide estimates of genetic variance that are not confounded by such population-level effects. A Supplementary Discussion section addresses additional points regarding lack of power for analyses of the missing heritability of BMI (Table 2; Extended Data Fig. 4) and the effect of dominance (Supplementary Table 12).
In conclusion, we report strong evidence of a high heritability of both height and BMI, consistent with inference from twin and family studies. Our results imply a substantial still-missing heritability, i.e., a large gap between the estimate of total additive genetic variation from our study and estimates of SNP-heritability from GWAS data, in particular for BMI where this gap is approximately 30% of phenotypic variance. We reconcile results from linkage and association studies, show that “significant” linkage peaks can be created from polygenic signals and that the still-missing heritability is also polygenic.
METHODS
Ethics declaration
The research was carried out under the University of Queensland Institutional Human Research Ethics (UQ-HREC) Approval Number UQ 2020/HE002938. Written informed consent was obtained from every participant in each study, and the study was approved by relevant ethics committees. UKB: Ethics approval for the UK Biobank study was obtained from the North West Centre for Research Ethics Committee (11/NW/0382). HUNT: The HUNT Study was approved by the Regional Committee for Medical and Health Research Ethics, Norway and all participants gave informed written consent (REK Central application number 2018/2488). Lifelines Study: The Lifelines study was approved by the ethics committee of the University Medical Center Groningen, document number METC UMCG METc 2007/152. EBB: Estonian Biobank is regulated by the Estonian Human Genes Research Act (HGRA) and all participants have signed broad informed consent form. The use of the information for this study was approved by the Estonian Committee on Bioethics and Human Research (approval No 1.1–12/1478). QIMR: The QIMR studies were approved by the Human Research Ethics Committee of the QIMR Berghofer Medical Research Institute. GS: Ethical approval for the GS:SFHS study was obtained from the Tayside Committee on Medical Research Ethics A (ref 05/S1401/89). Generation Scotland obtained Research Tissue Bank approval from the East of Scotland Research Ethics Service (Ref 20/ES/0021).
Genotyping and Phenotyping
Sample selection
Genotypic and phenotypic information was collected for six large cohorts: the UK Biobank22 (UKB, N = 488,410), the Generation Scotland19 (GS, N = 20,032), the Lifelines Cohort Study20,21 (LL, N = 64,623), the Queensland Institute of Medical Research cohort (QIMR, N = 13,154), the Estonian Biobank23 (EBB, N = 197,582) and the Trøndelag Health Study24,25 (HUNT, N = 70,517), where the sample size (N) refers to the number of genotyped individuals before selection of siblings and quality control. Sample overlap (N = 622) between GS and UKB, was handled by removing the overlapping individuals from the UKB full-sib sample (N = 90). We restricted our analyses to adult (i.e., aged at least 18 years) full siblings of European ancestries with available phenotype measurements. Ancestry inference and sample exclusions are described in Supplementary Methods.
Sibship inference
We identified sibling pairs using the estimated kinship coefficients and, where available, pedigree information. In the UKB, the kinship coefficients and the proportion of markers for which pairs share no alleles (IBS0) were provided as a part of data release (estimated using KING51 software) and were used to infer the sibling pairs following the procedure outlined in Bycroft et. al.22. For all other cohorts, we similarly used the KING software (v2.2.7) option --related to estimate pairwise kinship coefficients and to infer IBD-sharing segments for first degree-relationships and then selected the inferred full-sib (FS) pairs using either SNPs information (EBB and LL) or both SNP and pedigree information where available (GS and QIMR). Consistent with previous studies14,38, we used a simplified data structure for our analyses by assuming sibling pairs to be independent even when the siblings involved were from the same family. We referred to those as quasi-independent sib-pairs (QISP). For example, a sibship of four siblings would lead to QISPs included in our analysis. In total, 119,457 adult QISPs with available measures of height and/or BMI were taken forward for the analysis: 8,368 from GS; 12,844 from QIMR; 16,581 from LL; 21,756 from UKB, 25,333 from EBB and 34,575 from HUNT. Further details for each cohort are provided in Supplementary Table 13 and in Supplementary Methods.
SNP selection for IBD inference
We selected approximately 25,000 directly genotyped and LD-independent SNPs per cohort to be used in the analysis, with exception for LL data, where high quality (imputation r2 > 0.9) imputed LD-independent HapMap3 markers were used (Supplementary Methods). SNP genotyping array, number of SNPs available for analysis, number of SNPs passing quality control steps and FST metrics are presented Supplementary Table 14. Genomic positions used in this study correspond to the hg19 genome build.
Phenotype quality control
Phenotype adjustments were performed within a sample of siblings in each cohort, separately for males and females. We set phenotype outliers (> 6 SD away from the mean) to missing and residualized phenotypes by fitting the age at assessment (AGE) as well as AGÊ2 in a linear regression model as covariates. The phenotypes were then scaled to have a mean equal to 0 and a variance equal to 1 (or rank-based inverse normally transformed) within each sex. The cohort-specific phenotype means (before adjusting for fixed effects) and SDs (after adjusting for fixed effects, before scaling) are provided in Supplementary Table 15. The age distribution across cohorts in presented in Supplementary Table 16.
IBD estimation and linkage analysis
Estimation of IBD coefficients
IBD coefficients between siblings were estimated along a grid of 0.5-cM spaced locations on each chromosome (genetic map positions from the interpolated CEU genetic map generated by the 1000 Genomes Project using OMNI arrays, see Data Availability) using the MERLIN52 software package (v1.1.2). Prior to IBD estimation we detected and set unlikely genotypes to missing (--error function and pedwipe module in MERLIN52 software package, respectively). Using the estimated IBD probabilities we further calculated the locus-specific IBD-sharing proportions as , where and are the probabilities of the siblings sharing 1 or 2 alleles identical by descent at locus , respectively. For dominance IBD coefficients (IBD2), i.e., the proportion of sharing two alleles by descent was estimated as . Subsequently, the chromosome-wide IBD and IBD2 were estimated as an average of and , respectively, across the chromosome grid locations. The genome-wide IBD was obtained as the length-weighted (length expressed in cM) average of chromosome-wide IBD coefficients. Coordinates on our genetic map (in cM) were converted to hg19 genomic positions to re-estimate IBD sharing proportions in mega-base-pairs (Mb). The distributions of genome-wide and chromosome-wide IBD coefficients between siblings are presented for each cohort in Supplementary Table 2 and Supplementary Fig. 1.
Recombination rate stratified IBD coefficients
We stratified the genome into four groups of 0.5-cM-long genomic segments corresponding to quartile groups of recombination rates (RR) within those segments. The RR within each segment was calculated as the ratio of its genetic length (i.e., 0.5 cM) over its physical length in mega base pairs (Mb), both obtained from the interpolated CEU genetic map used in our linkage analyses. The physical-length and RR distribution across these segments is shown in Supplementary Fig. 2. IBD sharing within each RR-quartile group was calculated as the length-weighted (in cM or Mb) average of for segments allocated to that group.
Genome-wide linkage analysis
Locus specific linkage analysis was performed along the same 0.5-cM grid using the Visscher-Hopper computationally fast regression approach50, which performs a weighted analysis of separate regressions of sibling phenotypic mean-centred squared differences and squared sums on their locus-specific estimated IBD coefficients. Significant linkage peaks were determined as when the LOD score at given locus exceeds 3.6 as recommended previously30. Confidence intervals for the location of causal variants underlying significant linkage peaks were calculated using the LOD drop-off method31. In brief, this method determines confidence intervals by finding the genomic locations on both sides of the peak corresponding to a decrease in LOD score of 1 or 2 units. We conservatively chose a 2-unit LOD score drop-off to ensure a coverage of at least 95%.
Heritability estimation
We estimated and (and dominance variance; Supplementary Table 12) in each cohort using the REstricted Maximum Likelihood (REML) implemented in a custom R-script available on Zenodo (see Code Availability).
GWAS of body mass index
We previously published a large GWAS of BMI34 (N~700,000 participants) combining data from the GIANT consortium53 (hereafter simply referred to as GIANT) and the UKB. To avoid biases due to the sample overlap with UKB (in particular for our prediction analyses), we regenerated GWAS summary statistics for BMI after excluding sibling pairs (and their relatives; defined as when the estimated genomic relationship exceeds 0.05) identified in the UKB. We then used the same analysis pipeline as Yengo et al. (2018)34 after excluding that sample.
In brief, we first conducted a GWAS of BMI using BOLT-LMM v2.4.154, in a sub-sample of 397,279 UKB participants excluding sib-pairs and their relatives. The BMI phenotype was as described previously34. We analysed SNPs from the third (v3) release of imputed UKB data (imputed to the HRC and UK10K reference panel) with an imputation quality score above 0.3. For each UKB participant, the genotypes were hard-called with posterior probability larger than 0.9 and removing SNPs with call rate >0.95, p-value for Hardy–Weinberg test larger than 1 × 10−5 and MAF 0.001. We used a set of 561,573 HM3 SNPs (MAF 1% and LD pruned with r2 > 0.9) as “model SNPs” to control for population structure and remaining relatedness in the sample. We then meta-analysed our results from UKB with summary statistics from GIANT across a subset of ∼1.1 million HM3 SNPs with MAF 1% and consistent alleles and allele frequencies (maximum absolute difference <0.15) between the UKB and GIANT. Finally, we used GCTA55 to perform a COnditional and JOint (COJO) analysis of summary statistics from the latter meta-analysis using a random sample of 50,000 UKB participants as LD-reference. This analysis identified 795 conditionally and jointly significant SNPs at genome-wide significant threshold p = 5 × 10−8 (Supplementary Data 1), explaining ~5% of BMI variance. We then relaxed the significance threshold to p = 1 × 10−3 to include 4,582 SNPs (Supplementary Data 2) collectively explaining ~9% of BMI variance (Table 2).
Predicted linkage signal
For each chromosome, we predict the expected linkage signal, measured as predicted variance explained, , at a given genetic position (in Morgan) using Equation (1) below:
| (1) |
where is the number of causal SNPs on the chromosome, and , and are the minor allele frequency, the minor allele effect and the genetic position of the -th causal SNP, respectively.
The intuition behind Equation (1) is to predict linkage as the product between the variance explained by the -th causal SNP (i.e., ) and the expected correlation between indicators that alleles at positions and are inherited by both siblings from the same parent14. Assuming Haldane’s mapping function, can be expressed as 14. Then, the overall expectation is obtained by summing up the products ’s across all causal SNPs on the chromosome. If each causal variant equally contributes to heritability, then Equation (1) describes a discretised version of Dekkers and Dentine’s32 results obtained under an infinitesimal genetic architecture. We implemented this method in an R script available on Zenodo (see Code Availability).
We used the same framework to derive an expectation of linkage signal even when causal variants are unknown by replacing causal SNPs with independent trait-associated SNPs identified from GWAS and causal SNP effects , with the estimated joint SNP effects obtained from GWAS summary statistics using the GCTA-COJO module56. On average the number of trait-associated SNPs (for height and BMI) is proportional to the length of the chromosome. Genetic distances were obtained using sex-averaged genetic map positions from the interpolated CEU genetic map generated by the 1000 Genomes Project from OMNI arrays (see Data Availability). We used 12,111 genome-wide significant SNPs jointly associated with height33 and 795 genome-wide significant SNPs jointly associated with BMI (See BMI GWAS above). Unique genetic positions on the CEU genetic map were available for 12,010 height-associated and 787 BMI-associated SNPs, and the allele frequencies were from the UKB sib-pair sample.
We assessed the accuracy of Equation (1) to predict observed linkage signals by calculating the Pearson’s correlation between observed and expected linkage signals for each chromosome. We also report the chromosome-length-weighted (length measured in cM units) average of these correlations () across chromosomes. The standard error of is calculated based on a leave-one-chromosome-out jackknife procedure using Equation (2) below
| (2) |
where denotes the chromosome-length-weighted average across all chromosomes but chromosome . Note that this standard error does not capture sampling variation across individuals but only tracks LD and genetic architecture differences between chromosomes.
Simulated null distribution reflecting curvature effects
We first grouped all SNPs in the HapMap 3 panel into 28 MAF-LD categories corresponding to 7 MAF classes (defined as <1%; between 1% and 5%; between 5% and 10%; between 10% and 20%; between 20% and 30%; between 30% and 40%; and between 40% and 50%) and 4 LD classes (defined by quartile groups of the LD score distribution across HapMap 3 SNPs). For each simulation replicate, we sampled the same number of SNPs as trait-associated SNPs present within each MAF-LD category. Finally, randomly sampled SNPs were allocated effect sizes also randomly sampled from the set of estimated SNP effects at trait-associated SNPs present in the corresponding MAF-LD category. We tested the statistical significance of the difference between and using a Wald test conditional on the observed value of (that is, is assumed to be fixed) and based on sampling variance of across simulation replicates.
Polygenic scores (PGS) and PGS-adjusted phenotypes
For each individual, we calculated the PGS based on the conditional and joint (COJO) effect estimates (, at SNP ) of height and BMI associated SNPs from genome-wide association studies (GWAS) on ~ 650,000 (See BMI GWAS) and over 5 million individuals33. We use 12,111 genome wide significant SNPs for height and 4,582 SNPs with a less stringent p-value threshold (p = 1 × 10−3) for BMI. We apply allelic scoring implemented in the PLINK v1.90b6.20 software package57 (--score option), to calculate the PGS of each individual included in our study. More specifically, the PGS of individual (hereafter denoted ) was calculated as , where is the minor allele count of individual at SNP and the estimated effect of the minor allele at SNP . In each cohort the imputed genotypes were used to extract available trait-associated SNPs. The imputation panel and the number of SNPs used in PGS calculations are reported in Supplementary Table 17 and amount of variance explained by PGS in each cohort is shown in Table 2.
Estimation of SNP-based heritability
We estimated the SNP-heritability of height and BMI in each cohort using the Genome-based Restricted Maximum Likelihood (GREML) method implemented in GCTA.55 We calculated genomic relationship matrices (GRM) using SNPs in the HapMap 3 panel selected to have a MAF larger than 0.01 and p-value for the Hardy-Weinberg Equilibrium test larger than 1 × 10−6. We hereafter refer to this GRM as the full GRM. We modelled shared genetic and environmental effects between close relatives using another GRM obtained from the full GRM by setting all off-diagonals elements lower than 0.05 to 0. We used these two GRMs to jointly estimate the SNP-based heritability and the residual component capturing familial effects as done previously.58 Sample sizes for these analyses are reported in Supplementary Table 13. SNP-based heritability estimates are shown in Table 2.
Simulation of linkage studies
We performed simulations to assess the predictive performances of Equation (1) under various genetic architectures. We simulated traits with a heritability (to maximize statistical power) and varied the proportion of causal variants across the genome between 0.1%, 0.5%, 1%, 5%, 10%, 20%, 50% and 100%, thus defining 8 different scenarios. All simulations were conditional on real data as described below.
Simulation of genotypes for IBD inference
We simulated genotypes of 100,000 sibling pairs using phased haplotypes of 972 unrelated individuals in the UKB. As previously described59, phasing was performed with SHAPEIT v260 using genotypes of both parents for these 972 individuals, who were also participants of the UKB. We modified the R script proposed in ref.59 (which initially focused on simulating inbreeding) to simulate sibling pairs. Our modified R script is available on Zenodo (see Code Availability). We simulated genotypes over 301,412 SNPs but focused our analyses on 26,136 LD-Pruned (LD r2 < 0.05 in a 5Mb window) SNPs with MAF >10%, consistent with our real data analysis in the UKB. Genetic positions were updated using the genetic maps downloaded from the bcfTools software website (see Data Availability). Genotypes were simulated once then fixed across simulation replicates.
Simulation of phenotypes
Phenotypes were simulated conditionally on simulated genotypes described above. Under each scenario (i.e., proportion of causal variants) and for each simulation replicate, we randomly sample causal variants out the 26,136 SNPs, then assign each of them an allelic effect (for causal SNP with MAF ) such that each causal SNP explains the same amount of trait variance. To achieve this, we set
| (3) |
Next, we simulate the phenotype of individual using the following equation
| (4) |
where is the minor allele count for individual at causal SNP . By construction the phenotypic variance is .
Impact of estimation error in SNP effects from GWAS
We assessed the impact on of errors in estimated causal SNP effects (Extended Data Fig. 2c) by replacing with , where is a random error term with mean equal to 0 and variance defined as
| (5) |
In Equation (5), denotes the sample size of a hypothetical GWAS from which SNP effects were estimated. We choose such that the expected prediction accuracy of a PGS calculated from the ’s is (ref.61).
Expected linkage test statistics under an infinitesimal genetic architecture
As in ref.14, we determined for each chromosome an effective number () of independent chromosomal segments (Supplementary Table 8). This number corresponds to the number of independent loci over which the variance of IBD sharing would be equivalent to the observed variance for the whole genome. Note that the sum of across chromosomes is , which is ~10% larger than obtained by Visscher and colleagues14 using microsatellites.
Under an infinitesimal genetic architecture, each of the independent segments is expected to explain of trait-variance. Therefore, we used to predict the non-centrality parameter (NCP) of the linkage test statistic across the genome as
| (6) |
with the number of sibling pairs and the phenotypic correlation between siblings, , and 49,50, and thereby derived expectations for the mean and SD of across loci as and . Finally, we predicted the proportion of test statistics with a positive value using the cumulative distribution function of the normal distribution with mean 0 and variance equal to NCP.
Extended Data
Extended data Figure 1. Observed and theoretically predicted statistics for locus-specific linkage analysis.

Panel a, the observed and predicted mean test statistics of linkage () test statistics for height and BMI. The error-bars indicate standard errors (s.e.) calculated as the standard deviation of locus-specific statistics divided by the square root of the effective number independent markers, that is ~ (Supplementary Table 8). The size of the circle is proportional to sample size. The theoretically predicted values are based on the REML estimates of heritability from genome wide IBD regression () and the observed correlation between siblings. Panel b, the proportion of loci with positive (i) estimated linkage (the bars and the values) and (ii) theoretically predicted (the black rectangles +/- s.e., Methods). The dotted horizontal line represents the proportion (i.e., 0.5) expected in the absence of a genetic contribution to the trait. The data is shown for Generation Scotland (GS, number of quasi-independent sib-pairs (n) = 8,368), the Queensland Institute of Medical Research cohort (QIMR, n = 12,844), the Lifelines Cohort (LL, n = 16,581), the UK Biobank (UKB, n = 21,756), the Estonian Biobank (EBB, n = 25,333), the HUNT study (HUNT, n = 34,575) and the meta-analysis combining all cohorts (META, n = 119,457). The numerical values for mean and median and proportion of > 0 are presented in Supplementary Table 7A.
Extended data Figure 2. Effect of polygenicity and sample size of linkage studies on the correlation between predicted and observed linkage signals in simulated data.

The results are shown for 8 simulated genetic architectures (polygenicity = 0.1%-100%) with a genome-wide . a-b, show the observed and predicted linkage signals (measured as variance explained) on chromosomes 1 and 22, respectively, for one simulation replicate. The simulated causal variants are depicted as green stars. The predicted signal, estimated as a weighted sum of simulated effects (Methods, Eq. 1) is depicted by the black curve. The grey and yellow lines show the observed linkage signal from the analysis of 20,000 and 100,000 simulated sib-pairs, respectively, where the phenotypes were simulated using the same causal variants (green stars). The correlations for each polygenicity panel are the chromosome-wide estimates for each linkage sample size (yellow: n=20,000; grey: n=100,000). c, the summary of results across 100 replicates. is estimated per chromosome across the grid of 0.5 cM, then a chromosome length weighted average is calculated for each replicate. Each symbol represents a mean value across 100 simulation replicates and the error bars are standard deviation across replicates. The left-most enlarged symbols for each polygenicity panel indicate that the true simulated SNP effects were used predict linkage signal, i.e., the expected prediction accuracy from polygenic scores () using these causal variants = 1. To approximate estimation errors of SNP effects in a GWAS of finite sample, was also calculated using causal variants with (regular symbols). For the numeric values see Supplementary Table 9. Estimated variance components were not constrained to ensure unbiasedness. Therefore, if a region of the genome does not explain any genetic variation, then 50% of the estimates are expected to be negative.
Extended data Figure 3. Colocalization between GWAS-predicted and observed linkage signals for traits adjusted for polygenic scores (PGS).

Panel a, the correlation between observed linkage signals for PGS-adjusted height and predicted linkage signals from 12,010 height-associated SNPs. Panel b, the correlation between observed linkage signals for PGS-adjusted BMI and predicted linkage signals from 787 BMI-associated SNPs. Height was adjusted using a PGS based on the same 12,010 height-associated SNPs (explaining 38% of height variance), while BMI was adjusted using a PGS including 4,582 SNPs (explaining 9% of BMI variance). The x-axis in each panel displays the correlation () between observed and predicted (from GWAS results; Methods) linkage signals. In each panel, the vertical dashed line represents the correlation between observed and predicted linkage signals from either height-associated SNPs (a) or 787 BMI-associated SNPs (b). Predicted linkage signals were also obtained under the null hypothesis (that is “the correlation between observed and predicted linkage signals is due to the curvature effect”) using 1,000 draws of random SNPs with similar minor allele frequency and linkage disequilibrium properties as trait-associated SNPs. The histogram in each panel represents the distribution of correlations (under the null) between observed linkage for the trait indicated in the corresponding column-panel and predicted linkage obtained from these 1,000 draws. The mean of correlations obtained under the null hypothesis is denoted . The P-values (P) reported in the top-left corner of each panel assess the statistical significance of the difference between and using a two-sided Wald test. Numeric values are presented in Supplementary Table 10.
Extended data Figure 4. Correlation between chromosome length and estimates of variance explained from linkage analyses of BMI.

Analyses were based on summary statistics from a linkage meta-analysis of BMI and BMI adjusted for polygenic score (PGS). The x-axis represents the physical length of each chromosome relative to the size of the autosome (i.e., ~2879 Mb). The y-axis represents the expected variance explained () for each chromosome () estimated as , where is the mean across the chromosome of estimates of locus-specific variance, and an effective number of independent markers per chromosome (Supplementary Table 8). Error bars around each dot represent times the standard deviation of linkage estimate across the chromosomes. Standard errors (s.e.) of the regression slopes were obtained using a leave-one-chromosome-out jackknife approach. 95% confidence intervals (CI) were calculated as 1.96s.e.
Supplementary Material
ACKNOWLEDGEMENT
We acknowledge the participants and analysts in each cohort contributing to this study. LY was funded by the Australian Research Council (DE200100425, FT220100069). PMV was funded by the Australian Research Council (FL180100072) and the Australian National Health and Medical Research Council (Grant 113400). BCD is supported by NHMRC CJ Martin Fellowship APP1161356. GHM is the recipient of an Australian Research Council Discovery Early Career Award (Project number: DE220101226) funded by the Australian Government and supported by the Research Council of Norway (Project grant: 325640). DC is supported by the Ragnar Söderberg Foundation (E42/15 to DC); DC and DJB by Open Philanthropy (grant 010623–00001 to DJB); DJB by the National Institute on Aging (NIA)/National Institutes of Health (NIH) (grants R24-AG065184 and R01-AG042568 to DJB). DME is supported by an Australian National Health and Medical Research Council Investigator Award (2017942). Additional acknowledgements are provided in the Supplementary Information.
CONSORTIUM AUTHOR LIST AND AFFILIATIONS
Estonian Biobank Research Team
Reedik Mägi9, Andres Metspalu9
9Estonian Genome Centre, Institute of Genomics, University of Tartu, Tartu, Estonia
A full list of members and their affiliations appears in the Supplementary Information.
Lifelines Cohort Study
Alireza Ani11,12, Rujia Wang11, Ilja M. Nolte11, Harold Snieder11
11Department of Epidemiology, University of Groningen, University Medical Center Groningen, Groningen, Netherlands
12Department of Bioinformatics, Isfahan University of Medical Sciences, Isfahan, Iran
A full list of members and their affiliations appears in the Supplementary Information
Footnotes
COMPETING INTERESTS
The authors declare no competing interests.
CODE AVAILABILITY
The custom code generated in this paper (source code of predLINK, R script to simulate sib-pairs, R script to run restricted maximum likelihood estimation for QISPs) is available on Zenodo: https://zenodo.org/records/10416893.62 All other analyses were performed using publicly available software. Statistical analyses were performed using R (v4.1.0, 4.2.1, https://cran.r-project.org/ ). MERLIN v1.1.2 software was used to estimate IBD sharing (https://csg.sph.umich.edu/abecasis/Merlin/download/ ). KING v2.2.7 software was used to identify sibling pairs (https://www.kingrelatedness.com/Download.shtml ). GWAS of BMI was performed using BOLT-LMM v2.4.1 (https://alkesgroup.broadinstitute.org/BOLT-LMM/BOLT-LMM_manual.html ). GCTA software (gcta_1.93.1beta, v1.93.2beta) was used for genotype data quality control (including PC calculation, SNP loading calculation and PC projection for ancestry inference), SNP-heritability estimation, and COJO analysis (https://yanglab.westlake.edu.cn/software/gcta/index.html ). Genotype data quality control, including filtering and LD pruning, as well as allelic scoring was performed with PLINK v1.90b6.20 (https://www.cog-genomics.org/plink/).
DATA AVAILABILITY
Individual-level data used in this study is available through application to the relevant cohort. The individual-level UK Biobank data is available upon application to the UK Biobank (http://www.ukbiobank.ac.uk, accessed under project number 12505). Average identity-by-descent status across four groups of loci defined by quartiles of the recombination rate distribution will be returned (to the UK Biobank) for 21,756 sibling pairs analysed in this study. This data will be accessible to researchers registered with the UK Biobank. Genetic map for linkage analyses of height and BMI was downloaded from https://github.com/joepickrell/1000-genomes-genetic-maps/tree/master/interpolated_OMNI. Genetic map used in simulations was obtained from BCFtools https://samtools.github.io/bcftools/bcftools.html
Summary statistics from GWAS of BMI conducted in this study are available in Supplementary Data and in the GWAS catalog (https://www.ebi.ac.uk/gwas/) under accession number (GCST90446645).
REFERENCES
- 1.Polderman TJC et al. Meta-analysis of the heritability of human traits based on fifty years of twin studies. Nat Genet 47, 702–9 (2015). [DOI] [PubMed] [Google Scholar]
- 2.Risch N & Merikangas K The future of genetic studies of complex human diseases. Science 273, 1516–1517 (1996). [DOI] [PubMed] [Google Scholar]
- 3.Lynch M & Walsh B Genetics and Analysis of Quantitative Traits. (Sinauer Associates, Inc., Sunderland, MA, 1998). [Google Scholar]
- 4.Botstein D & Risch N Discovering genotypes underlying human phenotypes: past successes for mendelian disease, future approaches for complex disease. Nat Genet 33, 228–237 (2003). [DOI] [PubMed] [Google Scholar]
- 5.Hall JM et al. Linkage of early-onset familial breast cancer to chromosome 17q21. Science 250, 1684–1689 (1990). [DOI] [PubMed] [Google Scholar]
- 6.Goate A et al. Segregation of a missense mutation in the amyloid precursor protein gene with familial Alzheimer’s disease. Nature 349, 704–706 (1991). [DOI] [PubMed] [Google Scholar]
- 7.Risch NJ Searching for genetic determinants in the new millennium. Nature 405, 847–856 (2000). [DOI] [PubMed] [Google Scholar]
- 8.Weiss KM & Terwilliger JD How many diseases does it take to map a gene with SNPs? Nature Genetics 2000 26:2 26, 151–157 (2000). [DOI] [PubMed] [Google Scholar]
- 9.Visscher PM, Brown MA, McCarthy MI & Yang J Five years of GWAS discovery. Am J Hum Genet 90, 7–24 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.McClellan J & King MC Genetic Heterogeneity in Human Disease. Cell 141, 210–217 (2010). [DOI] [PubMed] [Google Scholar]
- 11.Klein RJ, Xu X, Mukherjee S, Willis J & Hayes J Successes of Genome-wide Association Studies. Cell 142, 350–351 (2010). [DOI] [PubMed] [Google Scholar]
- 12.Wang K, Bucan M, Grant SFA, Schellenberg G & Hakonarson H Strategies for Genetic Studies of Complex Diseases. Cell 142, 351–353 (2010). [DOI] [PubMed] [Google Scholar]
- 13.Boyle EA, Li YI & Pritchard JK An Expanded View of Complex Traits: From Polygenic to Omnigenic. Cell 169, 1177–1186 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Visscher P et al. Assumption-Free Estimation of Heritability from Genome-Wide Identity-by-Descent Sharing between Full Siblings. PLoS Genet 2, e41 (2006). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Young AI et al. Relatedness disequilibrium regression estimates heritability without environmental bias. Nat Genet 50, 1304–1310 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Kong A et al. The nature of nurture: Effects of parental genotypes. Science 359, 424–428 (2018). [DOI] [PubMed] [Google Scholar]
- 17.Howe LJ et al. Within-sibship genome-wide association analyses decrease bias in estimates of direct genetic effects. Nat Genet 54, 581 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Lee JJ et al. Gene discovery and polygenic prediction from a genome-wide association study of educational attainment in 1.1 million individuals. Nat Genet 50, 1112–1121 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Smith BH et al. Cohort profile: Generation scotland: Scottish family health study (GS: SFHS). The study, its participants and their potential for genetic research on health and illness. Int J Epidemiol (2013) doi: 10.1093/ije/dys084. [DOI] [PubMed] [Google Scholar]
- 20.Scholtens S et al. Cohort Profile: LifeLines, a three-generation cohort study and biobank. Int J Epidemiol (2015) doi: 10.1093/ije/dyu229. [DOI] [PubMed] [Google Scholar]
- 21.Sijtsma A et al. Cohort Profile Update: Lifelines, a three-generation cohort study and biobank. Int J Epidemiol 51, e295–e302 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Bycroft C et al. The UK Biobank resource with deep phenotyping and genomic data. Nature (2018) doi: 10.1038/s41586-018-0579-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Leitsalu L et al. Cohort profile: Estonian biobank of the Estonian genome center, university of Tartu. Int J Epidemiol 44, 1137–1147 (2015). [DOI] [PubMed] [Google Scholar]
- 24.Brumpton BM et al. The HUNT study: A population-based cohort for genetic research. Cell Genomics 2, 100193 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Åsvold BO et al. Cohort Profile Update: The HUNT Study, Norway. Int J Epidemiol (2022) doi: 10.1093/IJE/DYAC095. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Kemper KE et al. Phenotypic covariance across the entire spectrum of relatedness for 86 billion pairs of individuals. Nat Commun 12, 1050 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Gazal S et al. Linkage disequilibrium-dependent architecture of human complex traits shows action of negative selection. Nat Genet 49, 1421–1427 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Schousboe K et al. Sex Differences in Heritability of BMI: A Comparative Study of Results from Twin Studies in Eight Countries. Twin Research 6, 409–421 (2003). [DOI] [PubMed] [Google Scholar]
- 29.Silventoinen K et al. Heritability of Adult Body Height: A Comparative Study of Twin Cohorts in Eight Countries. Twin Research 6, 399–408 (2003). [DOI] [PubMed] [Google Scholar]
- 30.Lander E & Kruglyak L Genetic dissection of complex traits: guidelines for interpreting and reporting linkage results. Nat Genet 11, 241–247 (1995). [DOI] [PubMed] [Google Scholar]
- 31.Lander ES & Botstein D Mapping mendelian factors underlying quantitative traits using RFLP linkage maps. Genetics 121, 185–199 (1989). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Dekkers JCM & Dentine MR Quantitative genetic variance associated with chromosomal markers in segregating populations. Theor Appl Genet 81, 212–220 (1991). [DOI] [PubMed] [Google Scholar]
- 33.Yengo L et al. A saturated map of common genetic variants associated with human height. Nature (2022) doi: 10.1038/s41586-022-05275-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Yengo L et al. Meta-analysis of genome-wide association studies for height and body mass index in ∼700000 individuals of European ancestry. Hum Mol Genet 27, 3641–3649 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Visscher PM Proportion of the Variation in Genetic Composition in Backcrossing Programs Explained by Genetic Markers. Journal of Heredity 87, 136–138 (1996). [Google Scholar]
- 36.Bulik-Sullivan B et al. An atlas of genetic correlations across human diseases and traits. Nat Genet (2015) doi: 10.1038/ng.3406. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Hodge SE Linkage analysis versus association analysis: distinguishing between two models that explain disease-marker associations. Am J Hum Genet 53, 367–384 (1993). [PMC free article] [PubMed] [Google Scholar]
- 38.Hemani G et al. Inference of the Genetic Architecture Underlying BMI and Height with the Use of 20,240 Sibling Pairs. The American Journal of Human Genetics 93, 865–875 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Hivert V et al. Estimation of non-additive genetic variance in human complex traits from a large sample of unrelated individuals. The American Journal of Human Genetics 108, 786–798 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Hill WG, Goddard ME & Visscher PM Data and Theory Point to Mainly Additive Genetic Variance for Complex Traits. PLoS Genet 4, e1000008 (2008). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Yang J et al. Genetic variance estimation with imputed variants finds negligible missing heritability for human height and body mass index. Nat Genet 47, 1114–20 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Wainschtein P et al. Assessing the contribution of rare variants to complex trait heritability from whole-genome sequence data. Nat Genet 54, 263–273 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Backman JD et al. Exome sequencing and analysis of 454,787 UK Biobank participants. Nature 599, 628–634 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Akbari P et al. Sequencing of 640,000 exomes identifies GPR75 variants associated with protection from obesity. Science 373, (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.Tenesa A, Rawlik K, Navarro P & Canela-Xandri O Genetic determination of height-mediated mate choice. Genome Biol 16, 269 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.Yengo L et al. Imprint of assortative mating on the human genome. Nat Hum Behav 2, 948–954 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47.Robinson MR et al. Genetic evidence of assortative mating in humans. Nat Hum Behav 1, 16 (2017). [Google Scholar]
- 48.Visscher PM & Haley CS Detection of putative quantitative trait loci in line crosses under infinitesimal genetic models. Theor Appl Genet 93, 691–702 (1996). [DOI] [PubMed] [Google Scholar]
- 49.Sham PC, Cherny SS, Purcell S & Hewitt JK Power of linkage versus association analysis of quantitative traits, by use of variance-components models, for sibship data. Am J Hum Genet 66, 1616–1630 (2000). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50.Visscher PM & Hopper JL Power of regression and maximum likelihood methods to map QTL from sib-pair and DZ twin data. Ann Hum Genet (2001). [DOI] [PubMed] [Google Scholar]
- 51.Manichaikul A et al. Robust relationship inference in genome-wide association studies. Bioinformatics (2010) doi: 10.1093/bioinformatics/btq559. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 52.Abecasis GR, Cherny SS, Cookson WO & Cardon LR Merlin — Rapid analysis of dense genetic maps using sparse gene flow trees. Nat Genet (2002) doi: 10.1038/ng786. [DOI] [PubMed] [Google Scholar]
- 53.Locke AE et al. Genetic studies of body mass index yield new insights for obesity biology. Nature 518, 197–206 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 54.Loh PR, Kichaev G, Gazal S, Schoech AP & Price AL Mixed-model association for biobank-scale datasets. Nature Genetics vol. 50 906–908 Preprint at 10.1038/s41588-018-0144-6 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 55.Yang J, Lee SH, Goddard ME & Visscher PM GCTA: A tool for genome-wide complex trait analysis. Am J Hum Genet 88, 76–82 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 56.Yang J et al. Conditional and joint multiple-SNP analysis of GWAS summary statistics identifies additional variants influencing complex traits. Nat Genet 44, 369 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 57.Chang CC et al. Second-generation PLINK: rising to the challenge of larger and richer datasets. Gigascience 4, 7 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 58.Zaitlen N et al. Using Extended Genealogy to Estimate Components of Heritability for 23 Quantitative and Dichotomous Traits. PLoS Genet 9, e1003520- (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 59.Yengo L, Wray NR & Visscher PM Extreme inbreeding in a European ancestry sample from the contemporary UK population. Nat Commun 10, 3719 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 60.Delaneau O, Zagury J-F & Marchini J Improved whole-chromosome phasing for disease and population genetic studies. Nature methods vol. 10 5–6 Preprint at (2013). [DOI] [PubMed] [Google Scholar]
- 61.Wray NR et al. Pitfalls of predicting complex traits from SNPs. Nat Rev Genet 14, 507–15 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 62.Yengo L Genetic Architecture Reconciles Linkage and Association Studies of Complex Traits Zenodo (v1) https://zenodo.org/records/10416893 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
Individual-level data used in this study is available through application to the relevant cohort. The individual-level UK Biobank data is available upon application to the UK Biobank (http://www.ukbiobank.ac.uk, accessed under project number 12505). Average identity-by-descent status across four groups of loci defined by quartiles of the recombination rate distribution will be returned (to the UK Biobank) for 21,756 sibling pairs analysed in this study. This data will be accessible to researchers registered with the UK Biobank. Genetic map for linkage analyses of height and BMI was downloaded from https://github.com/joepickrell/1000-genomes-genetic-maps/tree/master/interpolated_OMNI. Genetic map used in simulations was obtained from BCFtools https://samtools.github.io/bcftools/bcftools.html
Summary statistics from GWAS of BMI conducted in this study are available in Supplementary Data and in the GWAS catalog (https://www.ebi.ac.uk/gwas/) under accession number (GCST90446645).
