Liability threshold modeling of case-control status and family history of disease increases association power

Margaux LA Hujoel; Steven Gazal; Po-Ru Loh; Nick Patterson; Alkes L Price

doi:10.1038/s41588-020-0613-6

. Author manuscript; available in PMC: 2020 Oct 20.

Published in final edited form as: Nat Genet. 2020 Apr 20;52(5):541–547. doi: 10.1038/s41588-020-0613-6

Liability threshold modeling of case-control status and family history of disease increases association power

Margaux LA Hujoel ^1,^*, Steven Gazal ^2,³, Po-Ru Loh ^3,⁴, Nick Patterson ³, Alkes L Price ^1,^2,^3,^*

PMCID: PMC7210076 NIHMSID: NIHMS1576071 PMID: 32313248

Abstract

Family history of disease can provide valuable information in case-control association studies, but it is currently unclear how to best combine case-control status and family history of disease. We developed an association method based on posterior mean genetic liabilities under a liability threshold model, conditional on case-control status and family history (LT-FH). Analyzing 12 diseases from the UK Biobank (average N=350K), we compared LT-FH to genome-wide association without using family history (GWAS) and a previous proxy-based method incorporating family history (GWAX). LT-FH was +63% (s.e. 6%) more powerful than GWAS and +36% (s.e. 4%) more powerful than the trait-specific maximum of GWAS and GWAX, based on the number of independent genome-wide significant loci across all diseases (e.g. 690 loci for LT-FH vs. 423 for GWAS); relative improvements were similar when applying BOLT-LMM to GWAS, GWAX, LT-FH phenotypes. Thus, LT-FH greatly increases association power when family history of disease is available.

Introduction

Family history of disease can provide valuable information about an individual’s genetic liability for disease, potentially increasing the power of case-control association studies¹, the focus of this study. More broadly, leveraging data from ungenotyped but phenotyped relatives has a rich history in genetic risk prediction² and analyses of quantitative traits in humans³ and livestock^4–6. Standard case-control genome-wide association studies (GWAS) ignore family history information (Fig 1a). A recent method, genome-wide association by proxy¹ (GWAX), compares disease cases and “proxy cases” (controls with family history of disease) to controls without family history of disease (Fig 1a). This approach has proven successful in studies of Alzheimer’s disease^7,8, and has provided a valuable contribution in highlighting the value of family history information, but more accurate modeling of family history — for example, distinguishing disease cases from proxy cases — may be beneficial¹.

Figure 1: — **(a)** GWAS uses binary case-control status, ignoring family history; GWAX uses binary proxy-case-control status, merging controls with family history of disease with disease cases; LT-FH uses continuous-valued posterior mean genetic liability, appropriately differentiating all case-control and family history configurations. **(b)** LT-FH computes posterior mean genetic liabilities (left panel) and then tests for association between genotype and posterior mean genetic liability (right panel).

We propose a new association method based on posterior mean genetic liabilities under a liability threshold model, conditional on both case-control status and family history (LT-FH); the liability threshold model, in which an individual is a disease case if and only if an underlying continuous-valued liability lies above a threshold, has proven valuable in a wide range of settings^2,9–13. LT-FH computes association statistics via linear regression of genotypes and posterior mean genetic liabilities; association statistics can also be computed using efficient mixed-model methods^14,15. LT-FH captures the fact that controls with family history have higher genetic liability than controls without family history (and likewise for disease cases) (Fig 1a). In contrast to GWAX, which assigns the same binary phenotype to disease cases and proxy cases, LT-FH accurately models a broad range of case-control status and family history configurations, greatly increasing power for diseases of both low and high prevalence in analyses of 12 diseases from the UK Biobank.

Results

Overview of methods

The LT-FH method relies on the liability threshold model⁹, and consists of two main steps (Fig 1b): (1) compute posterior mean genetic liabilities for each genotyped individual, conditional on available case-control status and/or family history information; (2) compute association statistics via linear regression of genotypes and posterior mean genetic liabilities. In step 1, we first compute posterior mean genetic liabilities for every possible configuration of case-control status and family history (377 configurations, accounting for parental history, sibling history, missing data, etc.) and then perform a lookup for each genotyped individual, avoiding redundant computation. The posterior mean genetic liability for each configuration is computed via Monte Carlo integration, incorporating estimates of disease prevalence and parental disease prevalence from the target samples and estimates of narrow-sense heritability (which differs from SNP-heritability¹⁶) from the literature; we show that Monte Carlo integration outperforms an analytical approach (the Pearson-Aitken(PA) formula^2,17,18) in analyses of sibling history data.

Step 2 generalizes the Armitage trend test¹⁹, and we show that it is equivalent to a score test; an alternative is to apply BOLT-LMM^14,15 (see Code availability) to posterior mean genetic liabilities, increasing power in large samples. We emphasize that we utilize posterior mean genetic liabilities. We note that raw LT-FH effect sizes are not on the liability scale, but can nonetheless be transformed to the observed scale. Details of the LT-FH method are provided in the Methods section; we have publicly released open-source software implementing the method (see Code availability). We consider a simple example of a disease with a prevalence of 5% and narrow-sense heritability of 50%, utilizing parental history only. The posterior mean genetic liability is −0.10 for a control with 0 affected parents vs. 0.39 for a control with 1 affected parent, but GWAS treats these individuals identically. In addition, the posterior mean genetic liability is 1.56 for a case with 2 affected parents vs. 0.39 for a control with 1 affected parent, but GWAX treats these individuals identically. On the other hand, LT-FH appropriately differentiates all case-control and family history configurations.

Simulations

We performed simulations by simulating genotypes at 100,000 unlinked SNPs and case-control status plus family history (parental history for both parents) for 100,000 unrelated target samples; we did not include sibling history in these simulations. We simulated genotypes for both parents, used these to simulate genotypes for target samples (offspring), and simulated case-control status for both parents and target samples using a liability threshold model; target samples were not ascertained for case-control status. Our default parameter settings involved 500 causal SNPs explaining h² = 50% of variance in liability, disease prevalence K = 5% (implying liability threshold T = 1.64 and observed-scale h² = 11%), and accurate specification of h² and K to the LT-FH method; other parameter settings were also explored. We compared three methods: GWAS, GWAX, and LT-FH. Further details of the simulation framework are provided in the Methods section. We note that simulations using real LD patterns are essential for methods impacted by LD between SNPs; however, the LT-FH method (like GWAS and GWAX) is not impacted by LD between SNPs, because no genotype data is used to compute posterior mean genetic liabilities, and thus LD between the focal SNPs and other SNPs cannot impact the power of the LT-FH method at a focal SNP (aside from the fixed loss of information due to incomplete tagging of causal SNPs). We further note that simulations with LD using a subset of individuals from UK Biobank would not be feasible, as simulations of family history require genotypes of both target samples and relatives (in order to simulate the case-control status of both target samples and relatives), but genotypes of relatives are not available for (nearly all) UK Biobank samples.

We assessed the calibration of each method using average χ² statistics at null SNPs (Figure 2a and Supplementary Table 1). All methods were correctly calibrated, with an average χ² of 1.00 for GWAS, GWAX and LT-FH. Accordingly, false-positive rates at various a levels (5 * 10⁻² to 5 * 10⁻⁶) matched the corresponding a level (Supplementary Table 2), as indicated by QQ plots for null SNPs (Extended Data Figure 1).

We assessed the association power of each method using both average χ² statistics at causal SNPs and formal power calculations (proportion of causal SNPs with p < 5 * 10⁻⁸) (Figure 2b,c and Supplementary Table 1). LT-FH was the most powerful method, with a +22% increase in average χ² and a +55% increase in power compared to GWAX, which outperformed GWAS at default parameter settings.

We performed 14 secondary analyses (see Supplementary Note and Supplementary Tables 1–13 and Extended Data Figure 1). Our main findings were that LT-FH outperformed both GWAX and GWAS, regardless of the underlying disease prevalence; that the second best method after LT-FH was GWAX for lower-prevalence diseases and GWAS for higher-prevalence diseases; that the impact of a misspecified value of disease prevalence or liability-scale heritability on the association power of LT-FH was negligible; that shared environment led to slightly smaller improvements for LT-FH compared to GWAS; and that family history reporting bias led to smaller improvements for LT-FH compared to GWAS but no increase in type I error. We caution that the generative model in all of our simulations was the same as the liability threshold model that we used for inference, and thus these simulations should be viewed as a best-case scenario for LT-FH.

Analysis of 12 diseases in the UK Biobank

We analyzed 12 diseases in the UK Biobank²⁰ with genotype and case-control and/or family history data available for up to 381,493 unrelated individuals of European ancestry (20 million imputed SNPs with MAF >0.1%) and family history of disease available for most individuals (Methods, Table 1, Supplementary Table 14,Supplementary Table 15). Disease prevalence ranged from 0.001 for Alzheimer’s disease to 0.32 for hypertension; for lower-prevalence diseases, we applied stricter MAF thresholds to avoid type I error in unbalanced case-control settings²¹ (Supplementary Table 16, Supplementary Table 17). Family history information was available for each disease, including parental history (presence or absence of disease in each respective parent) and sibling history (number of brothers and sisters + presence or absence of disease in the set of all siblings). Averaged across the 12 diseases, 88% of target samples had complete parental history information and 93% of target samples had complete sibling history information (Supplementary Table 18); the accuracy of family history information is assessed below. We defined GWAS, GWAX and LT-FH phenotypes accordingly (Methods, Supplementary Table 19; Extended Data Figure 2) and compared these three methods. We included 20 principal components (PCs), assessment center, genotype array, sex, age and age squared as covariates in association analyses for each method. In this data set, the computational cost of computing LT-FH phenotypes (posterior mean genetic liabilities) was 1–2 orders of magnitude lower than the cost of computing association statistics (Supplementary Table 20).

Table 1: Overview of 12 diseases in the UK Biobank.

We report the prevalence, parental prevalence, liability-scale SNP-heritability, and number of samples analyzed by GWAS, GWAX and LT-FH. GWAX sample sizes are lower than GWAS and LT-FH sample sizes due to incomplete family history data.

Trait	UKBB Prevalence	UKBB Parental Prevalence	${\hat{h}}_{g}^{2}$	s.e. ( ${\hat{h}}_{g}^{2}$ )	N_GWAS	N_GWAX	N_LT-FH
AD	0.001	0.065	0.12	0.11	381493	324512	381493
PD	0.003	0.020	0.15	0.05	381493	318792	381493
Lung cancer	0.006	0.064	0.10	0.03	381493	323838	381493
Bowel cancer	0.013	0.054	0.14	0.018	381493	322887	381493
Stroke	0.023	0.14	0.08	0.011	381493	332468	381493
COPD	0.035	0.082	0.17	0.0092	381493	329495	381493
Prostate cancer^*	0.037	0.075	0.30	0.019	175450	331458	368940
T2D	0.042	0.092	0.36	0.0093	380180	330267	381390
Breast cancer^*	0.061	0.082	0.20	0.012	206043	348170	371064
Depression	0.073	0.052	0.11	0.0058	381493	328186	381493
CAD	0.083	0.25	0.20	0.0058	381493	341810	381493
HTN	0.32	0.26	0.31	0.0037	381493	354208	381493

Open in a new tab

denotes sex-specific diseases (for which GWAX and LT-FH incorporate family history data from samples of both sexes). Estimates of (liability-scale) narrow-sense heritability (h²) from the literature, which are used by the LT-FH method, are reported in Supplementary Table 15. Diseases are listed in order of disease prevalence. AD: Alzheimer’s disease/dementia. PD: Parkinson’s disease. COPD: Chronic bronchitis/emphysema. T2D: Type 2 diabetes. CAD: Coronary artery disease. HTN: Hypertension.

We assessed the calibration of GWAS, GWAX and LT-FH using stratified LD score regression (S-LDSC) attenuation ratio^15,22–24, defined as (S-LDSC intercept −1)/(mean χ² − 1) (Methods and Supplementary Table 21a). Attenuation ratios were similar for each method, with averages across 12 diseases of 0.18, 0.16 and 0.15, respectively (and inverse variance-weighted averages close to 0.1, consistent with ref.¹⁵); we note that attenuation ratios slightly above 0 may be caused by attenuation bias¹⁵. Although there can be no guarantee that analyses of UK Biobank data using PC covariates are completely devoid of confounding²⁵, these results confirm that that LT-FH is well-calibrated and there is no confounding that is specific to the LT-FH method.

We assessed the association power of each method by computing the number of independent genome-wide significant loci identified for each disease, as defined similar to ref.¹⁵ (Methods, Figure 3a and Supplementary Table 21b); we computed standard errors (s.e.) on relative improvements via block-jackknife (Methods). Across all diseases, LT-FH was +63% (s.e. 6%) more powerful than GWAS and +36% (s.e. 4%) more powerful than the trait-specific maximum of GWAS and GWAX (e.g. 690 independent loci for LT-FH vs. 423 for GWAS). For the 8 lower-prevalence diseases (K < 5%), LT-FH was +39% (s.e. 7%) more powerful than the trait-specific maximum of GWAS and GWAX, and GWAX was more powerful than GWAS, consistent with simulations at lower prevalence. For the 4 higher-prevalence diseases (K > 5%), LT-FH was +35% (s.e. 4%) more powerful than the trait-specific maximum of GWAS and GWAX, and GWAS was more powerful than GWAX, consistent with simulations at higher prevalence; GWAS vs. GWAX results were dominated by hypertension, the disease with highest prevalence. We also evaluated the association power of each method using relative effective sample size¹⁵; the average relative effective sample size for was 1.31 for LT-FH vs. GWAS and 1.27 for LT-FH vs. the trait-specific maximum of GWAS and GWAX (Supplementary Table 22). Finally, we assessed the association power of each method across all 12 diseases by computing the average χ² across all genome-wide significant SNPs identified by any method (Supplementary Table 21c) and average χ² across all SNPs (Supplementary Table 21d); we determined that these quantities were much larger for LT-FH than for the other methods.

Figure 3: — We report results of GWAS, GWAX and LT-FH using either **(a)** linear regression or **(b)** BOLT-LMM on unrelated European individuals. Numerical results are reported in Supplementary Table 21 and Supplementary Table 35.

We assessed whether GWAS, GWAX and LT-FH phenotypes reflect the same underlying genetic architectures by estimating pairwise genetic correlations between these phenotypes using BOLT-REML²⁶ (Table 2). Genetic correlations were high, much higher than the corresponding phenotypic correlations (e.g. 0.97 and 0.70 respectively for correlations between LT-FH and GWAS, averaged across 12 diseases), indicating that incorporating family history preserves the same underlying genetic architecture but provides substantial independent information. Phenotypic correlations for LT-FH (or GWAX) vs. GWAS were particularly low for low-prevalence diseases, for which family history is more informative than case-control status. We also computed correlations between −log₁₀ association p-values, which mirrored the phenotypic correlations (Supplementary Table 23). Notably, LT-FH phenotypes had higher values of sample size times observed-scale SNP-heritability (a measure of total genetic signal²³) than GWAS phenotypes for all 12 diseases (59% higher on average; Supplementary Table 24). These findings support the use of the LT-FH method to increase association power.

Table 2: Genetic and phenotypic correlations between GWAS, GWAX and LT-FH phenotypes.

We report the genetic correlation (estimated using BOLT-REML) and phenotypic correlation between each pair of GWAS, GWAX and LT-FH phenotypes. BOLT-REML estimates are constrained to be ≤ 1.00. Standard errors of genetic correlation estimates are reported in parentheses. Standard errors of phenotypic correlation estimates are ≤ 0.003 in each case.

Trait	Genetic Correlation (r_g)						Phenotypic Correlation
	GWAS/GWAX		GWAS/LT-FH		GWAX/LT-FH		GWAS/GWAX	GWAS/LT-FH	GWAX/LT-FH
	r_g	s.e.	r_g	s.e.	r_g	s.e.	GWAS/GWAX	GWAS/LT-FH	GWAX/LT-FH
AD	1.00	0.38	1.00	0.34	1.00	0.002	0.09	0.30	0.96
PD	1.00	0.16	1.00	0.10	1.00	0.009	0.26	0.57	0.93
Lung cancer	1.00	0.14	1.00	0.12	1.00	0.004	0.20	0.49	0.91
Bowel cancer	0.95	0.06	0.97	0.03	0.99	0.006	0.30	0.64	0.91
Stroke	0.83	0.07	0.88	0.04	0.99	0.008	0.24	0.62	0.86
COPD	0.90	0.02	0.95	0.01	0.99	0.003	0.40	0.77	0.87
Prostate cancer	1.00	0.03	1.00	0.006	1.00	0.005	0.56	0.88	0.89
T2D	0.97	0.01	0.99	0.004	0.99	0.002	0.38	0.77	0.86
Breast cancer	1.00	0.02	1.00	0.006	1.00	0.004	0.58	0.87	0.89
Depression	0.88	0.02	0.93	0.007	0.98	0.003	0.58	0.84	0.89
CAD	0.89	0.02	0.96	0.005	0.98	0.005	0.31	0.76	0.78
HTN	0.95	0.006	0.98	0.001	0.98	0.003	0.52	0.86	0.75
Average	0.95		0.97		0.99		0.37	0.70	0.88

Open in a new tab

We performed a replication analysis using independent data to assess whether the novel associations identified by LT-FH are genuine. We analyzed 4 diseases (CAD, T2D, breast cancer and prostate cancer) with publicly available summary statistics for case-control data independent from UK Biobank (see Methods). For both genome-wide significant loci identified by GWAS and genome-wide significant loci identified by LT-FH, we computed the replication slope (the slope of a regression of standardized effect sizes in case-control replication data vs. UK Biobank discovery data²⁷); association statistics for case-control replication data were always computed using GWAS. The replication slopes were similar for GWAS (slope=0.81, se=0.02, 124 loci) and LT-FH (slope=0.79, se=0.01, 243 loci) (Figure 4 and Supplementary Table 25). In addition, the replication slopes were similar for loci detected only by GWAS (slope=0.67, se=0.06, 7 loci) and loci detected only by LT-FH (slope=0.69, se=0.03, 126 loci); as expected, these slopes were lower than 0.81 and 0.79 since loci detected by only one method represent weaker effects that are more susceptible to winner’s curse. These results indicate that the novel loci identified by LT-FH are genuine associations, as assessed by the GWAS method in replication data.

Figure 4: — We plot standardized effect sizes ( $Z / \sqrt{N_{eff}}$ ) in the non-UK Biobank replication data (average N_eff = 99K for GWAS) vs. the UK Biobank discovery data (average N_eff = 62K for GWAS, 102K for LT-FH), aggregated across 4 diseases (CAD, T2D, breast cancer and prostate cancer), for **(a)** the 124 loci identified by GWAS, **(b)** the 243 loci identified by LT-FH, **(c)** the 7 loci identified by GWAS but not LT-FH, and **(d)** the 126 loci identified by LT-FH but not GWAS. Numerical results are reported in Supplementary Table 25.

We performed six secondary analyses (see Supplementary Note, Supplementary Tables 27–34, Extended Data Figure 3). Our main findings were that using only parental history was slightly less powerful than using parental and sibling history; that downweighting family history information based on its accuracy had little impact on association results; that explicitly accounting for age when computing posterior mean genetic liabilities had little impact on association results; and that our (Monte-Carlo) LT-FH method attained higher power than the (analytical) PA formula when using parental and sibling history. In summary, LT-FH was well-calibrated and more powerful than all other methods in all primary and secondary analyses.

BOLT-LMM increases GWAS, GWAX and LT-FH association power

BOLT-LMM, a method for efficient Bayesian mixed-model analysis, has been shown to increase association power (for GWAS phenotypes) as compared to linear regression^14,15. We thus investigated the application of BOLT-LMM to GWAS, GWAX and LT-FH phenotypes, analyzing the same data for 12 UK Biobank diseases as above. We first confirmed that GWAS, GWAX and LT-FH remained well-calibrated (Supplementary Table 35a). We then assessed association power (Figure 3b and Supplementary Table 35b). Across all diseases, LT-FH was +67% (s.e. 6%) more powerful than GWAS and +39% (s.e. 4%) more powerful than the trait-specific maximum of GWAS and GWAX (e.g. 736 independent loci for LT-FH vs. 442 for GWAS), implying similar relative power among the three methods when using BOLT-LMM; the absolute increases in power attained by using BOLT-LMM were modest (Figure 3b vs. Figure 3a) and smaller than in analyses of highly heritable UK Biobank traits¹⁵, because disease traits have lower observed-scale SNP-heritability (particularly for lower-prevalence diseases) and the increase in power attained by BOLT-LMM scales with SNP-heritability¹⁵. Analyses of average χ² across all genome-wide significant SNPs identified by any method (Supplementary Table 35c) and average χ² across all SNPs (Supplementary Table 35d) yielded similar conclusions.

Our above analyses using both linear regression and BOLT-LMM were restricted to unrelated target samples (up to 381,493 unrelated individuals of European ancestry), consistent with previous studies leveraging family history using GWAX^1,7,8. We investigated the consequences of including related individuals in BOLT-LMM analyses. We first applied BOLT-LMM to GWAS, GWAX and LT-FH phenotypes of target samples defined without applying any filter for relatedness (up to 459,256 related individuals of European ancestry). We determined that GWAX and LT-FH suffered poor calibration (average S-LDSC attenuation ratios equal to 0.26 for GWAX and 0.22 for LT-FH, vs. 0.14 for GWAS) (Supplementary Table 36a), rendering moot the increases in power due to larger sample size (Supplementary Table 36b). A likely explanation for the poor calibration is the extreme concordance between sibling pair phenotypes (close to 100% for GWAX) arising from nearly identical family histories. Our results indicate that BOLT-LMM is unable to correct for inflation in test statistics in such extreme cases, as it does not have the flexibility to model excess phenotypic concordance for sibling pairs specifically. Thus, BOLT-LMM analysis of GWAX and LT-FH phenotypes should not be applied to target samples without any filter for relatedness.

We thus applied BOLT-LMM to LT-FH phenotypes for all target samples (up to 459,256 related individuals of European ancestry) that were modified to incorporate only case-control status (ignoring family history information) for each individual from a sibling pair or parent-offspring pair within the set of target samples (also see secondary analysis below). We determined that this approach was well-calibrated, with an average S-LDSC attenuation ratio of 0.14 for LT-FH, vs. 0.14 for GWAS (and inverse variance-weighted average close to 0.1, consistent with ref.¹⁵) (Supplementary Table 37a). We compared the power of this approach to applying BOLT-LMM to GWAS for all related Europeans (see above) and GWAX for all unrelated Europeans (as previously recommended^1,7,8; see above). Across all diseases, LT-FH was +54% (s.e. 5%) more powerful than GWAS and +44% (s.e. 4%) more powerful than the trait-specific maximum of GWAS and GWAX (e.g. 908 independent loci for LT-FH vs. 590 for GWAS) (Extended Data Figure 4 and Supplementary Table 37b), similar to our previous analyses (Figure 3). Thus, BOLT-LMM analysis of LT-FH phenotypes defined using only case-control status for all sibling pairs and parent-offspring pairs within the set of target samples is the recommended approach; we have publicly released LT-FH association statistics computed using this approach (see Data availability).

We performed a secondary analysis in which we instead applied BOLT-LMM to LT-FH phenotypes for all target samples (up to 459,256 related individuals of European ancestry) that were modified to incorporate family history information for exactly one sibling for each set of siblings within the set of target samples (with no filter on family history information for parent-offspring pairs). We note that this approach may suffer from residual concordance between sibling pair phenotypes (and parent-offspring phenotypes) arising from redundancy between the family history of one target sample and the case-control status of the other target sample. Indeed, we determined that LT-FH phenotypes defined in this way were slightly but significantly miscalibrated (average S-LDSC attenuation ratio equal to 0.15 for LT-FH, vs. 0.14 for the recommended LT-FH approach; p < 10⁻⁶ for difference in inverse-variance weighted means) (Supplementary Table 38a), rendering moot the slightly higher power of this approach (Supplementary Table 38b).

In summary, BOLT-LMM analysis increases the power of LT-FH (and GWAS and GWAX), but care is required to avoid poor calibration of LT-FH (and GWAX) due to extreme concordance in sibling pair phenotypes. Our recommendation is for BOLT-LMM analysis of LT-FH phenotypes defined using only case-control status for all sibling pairs and parent-offspring pairs within the set of target samples.

Discussion

We have introduced a new method, LT-FH, that accurately models a broad range of case-control status and family history configurations to increase association power compared to prevailing methods¹; in particular, LT-FH differs from other methods, including methods developed in the livestock genetics community^4–6, by allowing for complex modes of reporting family history (e.g. binary response for sibling history, i.e. at least one sibling has the disease), instead of requiring phenotypes for each relative. Across 12 diseases from the UK Biobank, LT-FH was +63% (s.e. 6%) more powerful than GWAS and +36% (s.e. 4%) more powerful than the trait-specific maximum of GWAS and GWAX, while maintaining correct calibration. LT-FH attained similar increases in power when using BOLT-LMM to compute association statistics for all three methods, and when incorporating related individuals. These findings provide a strong motivation to apply LT-FH to genetic data sets for which family history information is available, and to collect family history information in future genetic studies.

Although LT-FH greatly increases association power, it has several limitations. First, self-reported family history information may be inaccurate. Indeed, we determined that the sibling concordance of self-reported family history was substantially lower than 100% (Supplementary Table 28); however, our efforts to account for this had little impact on association power (Supplementary Table 29). Second, family history may reflect a different underlying genetic architecture than case-control status, e.g. due to differences in the etiology of early-onset vs. late-onset disease or differences in diagnostic criteria over time; however, we observed high genetic correlations between GWAS and LT-FH phenotypes (average of 0.97 across diseases; Table 2). Third, real diseases may not adhere to the liability threshold model used by LT-FH, which may explain why increases in power in analyses of real diseases (Figure 3) were not as large as in simulations (Figure 2c). Moreover, posterior mean genetic liabilities inferred from family history information may be biased by unmodeled effects. This might reduce the power of the LT-FH method, but would not cause its test for additive association with genetic variants to produce false positives (Supplementary Table 7 and Supplementary Table 8) Fourth, just like standard GWAS analysis, LT-FH may require MAF thresholds stricter than our default value of 0.001 to avoid type I error in unbalanced case-control settings²¹. Although the number of diseases requiring stricter MAF thresholds was reduced from 8 for standard GWAS analysis to 2 for LT-FH (due to lower kurtosis; Supplementary Table 16), more sophisticated approaches (generalizing ref.²¹) could be explored. Fifth, just like GWAX, LT-FH should not be applied to target samples without any filter for relatedness, because extreme concordance in sibling pair phenotypes leads to miscalibration — even when applying BOLT-LMM, an efficient mixed-model method that is intended to account for phenotypic concordance among related individuals. Specifically, GWAS analysis of related target samples can induce miscalibration because target samples have correlated genotypes and correlated phenotypes. Mixed model methods, such as BOLT-LMM, are designed to correct this miscalibration. However, if related target samples have extremely concordant phenotypes due to inclusion of family history information, BOLT-LMM may fail to correct this miscalibration (Supplementary Table 36). Thus, scenarios of extreme concordance between phenotypes of related target samples should be avoided. We instead recommend applying BOLT-LMM to LT-FH phenotypes defined using only case-control status for all sibling pairs and parent-offspring pairs within the set of target samples; this approach is straightforward to implement and still greatly increases association power while maintaining correct calibration. Sixth, LT-FH does not currently incorporate family history information for genetically correlated diseases, which could be highly informative; however, LT-FH association results for two or more genetically correlated diseases could be jointly analyzed using existing methods²⁸. Seventh, the novel genome-wide significant associations detected by LT-FH explain less variance on average, such that the increase in variance explained by LT-FH compared to other methods is lower than the increase in number of genome-wide significant loci. However, small genetic effects can still produce important biological insights^29,30. Eighth, posterior mean genetic liabilities can be computed using an analytical approach, the PA formula^2,17,18, instead of LT-FH. However, the PA formula is not well-suited to data that includes sibling history as a binary response (i.e. at least one sibling has the disease). Indeed, we confirmed that LT-FH attains higher power than the PA formula in simulations with sibling history (Supplementary Table 11) and analyses of UK Biobank diseases (Supplementary Table 34). Finally, we have not explored the use of family history to improve the accuracy of genetic risk prediction², which remains as a direction for future research. Despite all these limitations, we anticipate that LT-FH will attain large increases in power in future association studies.

Methods

LT-FH method: estimating posterior mean genetic liability

The liability threshold model can be written as ϵ = Xβ + ϵ_e, where ϵ is the liability, X is the genotypes at candidate SNPs (normalized to mean 0, variance 1), β are the effect sizes of SNPs on the liability scale, and an individual is a case (z = 1) if and only if ϵ ≥ T and is a control otherwise (z = 0). T determines the disease prevalence (Φ (T) = P (x ≥ T) where x ~ N(0, 1)). We further assume that ϵ_g = Xβ ~ N(0, h²), and assume that when testing a SNP g for association, the effect size is small enough such that the liability is βg + ϵ = βg + ϵ_g + ϵ_e where ϵ ~ N(0, 1), ϵ_g ~ N(0, h²), ϵ_e ~ N(0, 1−h²).

We assume a multivariate normal distribution for ϵ: for two individuals Cov(ϵ₁, ϵ₂) = K₁₂h², where K₁₂ is the coefficient of relationship for the pair of individuals (e.g. 1/2 for parent-offspring). We assume that we have the genotype and case-status of an individual, and potentially the disease status of family members. For example, to incorporate parental history and sibling history we assume

(\begin{matrix} ϵ_{o, e} \\ ϵ_{o, g} \\ ϵ_{p 1} \\ ϵ_{p 2} \\ ϵ_{s} \end{matrix}) ~ {MVN}_{5} ((\begin{matrix} 0 \\ 0 \\ 0 \\ 0 \\ 0 \end{matrix}), (\begin{matrix} 1 - h^{2} & 0 & 0 & 0 & 0 \\ 0 & h^{2} & 0.5 h^{2} & 0.5 h^{2} & 0.5 h^{2} \\ 0 & 0.5 h^{2} & 1 & 0 & 0.5 h^{2} \\ 0 & 0.5 h^{2} & 0 & 1 & 0.5 h^{2} \\ 0 & 0.5 h^{2} & 0.5 h^{2} & 0.5 h^{2} & 1 \end{matrix}))

(1)

where ϵ_{o, e}, ϵ_{o, g} are the environmental and genetic components of the liability of the target sample (offspring), ϵ_p1 and ϵ_p2 are the liabilities of the parents, and ϵ_s is the liability of the sibling(s) (for simplicity we include only one sibling here, but this can be extended to an arbitrary number of siblings). We estimate the posterior mean genetic liability $E [ϵ_{o, g} | z_{o}, z_{p 1}, z_{p 2}, \vec{z_{s}}]$ for each individual given the case-control status of the genotyped individual (z_o), both parents (z_p1, z_p2), and sibling(s) ( $\vec{z_{s}}$ ). When disease status is unknown for subsets of these individuals, we condition only on known information. We are interested in genetic signal and therefore we estimate the posterior mean genetic liability, rather than the posterior mean liability; genetic association power is increased by integrating out the noise contributed by the environmental component of liability.

We estimate E[ϵ_{o, g} |z_o, z_p1, z_p2, z_s|] using Monte Carlo integration. Briefly, (1) we sample from a multivariate normal with the given covariance structure and compute the posterior mean genetic liability of the offspring in all possible configurations of case-control status and family history (using liabilities that fall above or below a given threshold to determine case-control status and family history). If the posterior mean genetic liabilities in any configuration has a standard error of the mean above 0.01, we resample from a conditional multivariate normal to ensure that we sample in a manner that represents the tails of the distribution appropriately. (2) We sample relevant liabilities conditional on the offspring’s case-control status (z_o) or we sample conditional on the parents (i.e. z_p1 + z_p2 = {0, 1, 2}). (3) We then combine all samples of ϵ_{o, g} from given configurations. (4) We compute posterior mean genetic liabilities and assess the standard error of each posterior mean. We then repeat (2)-(4) until the standard error of the mean is less than 0.01 for all configurations (further details in Supplementary Note and Supplementary Table 39). We note that family history reporting bias may reduce the power of LT-FH, but would not lead to false positives (Supplementary Table 8). In addition, distinct from family history, non-random missingness of target sample phenotypes would equally impact GWAS and LT-FH. Specifically, LT-FH is identical to GWAS in the absence of family history information, because in this setting both methods exclude individuals with missing case-control status.

We considered the Pearson-Aitken (PA) formula as an approximate analytical approach and implemented this approximation for LT-FH_no-sib^2,17,18 (Supplementary Note and Supplementary Table 32). However, in UK Biobank data, sibling history is provided as a binary response, i.e., at least one sibling has the disease. This “at least one” condition is complex to incorporate using analytical approaches but straightforward to incorporate using Monte Carlo integration. For a given number of siblings (S), we sample S liabilities (ϵ_s1, ϵ_s2,…, ϵ_sS) and can easily assess the presence or absence of disease in the set of all siblings by comparing each sibling’s liability to the relevant threshold; the absence of disease corresponds to ϵ_si < T for all i ∈ {1, 2,…, S} whereas the presence of disease corresponds to at least one i with ϵ_si ≥ T. Thus, we estimate posterior mean genetic liabilities using Monte Carlo integration in our main analyses and open-source software.

LT-FH method: LT-FH association statistic

The LT-FH association statistic is a measure of association between genotype g and posterior mean genetic liability across samples. We can compute this statistic either using linear regression or using other methods such as BOLT-LMM^14,15, while incorporating additional covariates such as principal components. If we treat the posterior mean genetic liability as a continuous variable and compute the number of samples times the squared correlation between g and posterior mean genetic liability (generalizing the Armitage trend test¹⁹), this is equivalent to the Score test (see Supplementary Note).

LT-FH effect sizes

Although LT-FH estimates the posterior mean genetic liability, the raw effect sizes computed when using LT-FH are not on the liability scale. However, the raw effect sizes computed when using linear regression can be transformed to the observed scale by making use of the effective sample size¹⁵. In detail, to obtain per-allele observed-scale effect sizes for a non-standardized phenotype (as is computed by BOLT-LMM software) we compute:

{\hat{β}}_{LT-FH,obs} = \frac{{\hat{β}}_{LT-FH}}{se ({\hat{β}}_{LT-FH}) \sqrt{{c*N}_{GWAS}}} * \sqrt{\frac{K (1 - K)}{2 * MAF * (1 - MAF)},}

(2)

where ${\hat{β}}_{LT-FH}$ is raw per-allele effect size, c is the relative effective sample size for LT-FH vs GWAS, K is the disease prevalence, and MAF is the minor allele frequency. We note that it is straightforward to use minor allele frequency (MAF) to convert between per-allele and per-standardized-genotype effect sizes, distinct from converting between liability and observed scales. Further details are provided in the Supplementary Note.

GWAS and GWAX methods

GWAS aims to discover associations between variants and disease by comparing cases to controls of a disease. GWAX compares cases and proxy cases (controls with a family history of the disease) to controls with no family history of disease. GWAX can be particularly valuable when the case-control status of the genotyped individual is unknown, however when the case-control status of the genotyped individual is known incorporating this information increases power¹, and as such we incorporate the case-control status of the genotyped individual (if known) when computing GWAX association statistics. For both GWAS and GWAX, association statistics can be computed using methods such as linear regression, logistic regression, BOLT-LMM¹⁵, or SAIGE²¹.

For GWAX-2df test we perform an F-test, regressing genotype on case status and control with family history status (two binary variables) and testing whether the coefficients associated with these two indicator variables are both 0 (see Supplementary Note). When no covariates are included, results are virtually identical to a Pearson-χ² test on a 2×3 table.

Simulations

We simulate genotypes at 100,000 unlinked SNPs and case-control status plus family history (parental history for both parents) for 100,000 unrelated target samples; we do not include sibling history in these simulations. We simulate genotypes for both parents and use these to simulate genotypes for target samples (offspring); the underlying MAF is ~ U(0.01, 0.5). Using these genotypes we simulate case-control status for both parents and target samples using a liability threshold model in which we normalize genotype to a mean of 0 and variance of 1 according to the true MAF, set the effect sizes for the C causal SNPs on the liability scale to be $\sqrt{h^{2} / C}$ , and add environmental noise with a variance of 1 − h². Target samples are not ascertained for case-control status. Our default parameter settings involve 500 causal SNPs explaining h² = 50% of variance in liability, disease prevalence K = 5% (implying liability threshold T = 1.64 and observed-scale h² = 11%), and accurate specification of h² and K to the LT-FH method. For each parameter setting we compute 10 simulation replicates in which there is a different underlying MAF distribution for each simulation replicate. Further details are provided in the Supplementary Note.

UK Biobank data sets

We analyzed genetic data from the UK Biobank²⁰ consisting of 381,493 unrelated individuals or 459,256 related individuals of European ancestry (based on self-reported white ethnicity; sample sizes depended on the trait analyzed; Table 1; Supplementary Table 14) and ~20 million imputed variants with MAF >0.1%¹⁵. We identified pairs of siblings (full siblings and MZ twins) as outlined previously^20,31. We note that it is easy to distinguish sibling pairs from parent-child pairs using identity-by-descent (IBD). We identified parent-offspring pairs as individuals having a kinship in (0.177, 0.354) and having proportion of SNPs with identity-by-state equal to 0 (IBS0) < 0.0012. We computed association statistics applying either linear regression or default BOLT-LMM analysis (both implemented in BOLT-LMM software). We note that for this default analysis, BOLT-LMM association statistics reduce to infinitesimal mixed model association statistics (BOLT-LMM-inf) when the non-infinitesimal BOLT-LMM association statistics are not expected to increase power^14,15. We controlled for assessment center, genotype array, sex, age, age squared, and the first 20 principal components¹⁵.

We considered 12 diseases in UK Biobank for which family history of disease was available (Table 1); these 12 diseases were considered when GWAX was proposed¹. These 12 diseases had a wide range of narrow-sense heritabilities (Supplementary Table 15). We assigned case-control status for all genotyped individuals based on the presence or absence of ICD-9 codes, ICD-10 codes, or other relevant codes (Supplementary Table 14). Individuals with no ICD-9 or ICD-10 codes were selected assigned as controls. An alternative would be to assign missing phenotypes to individuals with no ICD9/10 codes, which reduces sample size while increasing disease prevalence (Supplementary Table 40). We decided against this possibility, because we determined that it slightly but consistently reduces sample size times observed-scale SNP-heritability (a measure of total genetic signal²³) and has very little impact on GWAS power (Supplementary Table 41)

Parental history and sibling history information was available for all 12 diseases. Parental history information consisted of presence or absence of disease in each respective parent; we note that the parental prevalence was often greater than the disease prevalence (Table 1). Sibling history information consisted of number of brothers and sisters + presence or absence of disease in the set of all siblings. Family history information is self-reported and thus there is a risk of misreporting, recall bias, or phenotype misclassification. For example, participants were asked about their family history of “diabetes” but we assume that this family history corresponds to that specifically of type II diabetes, consistent with previous studies¹. To assess the accuracy of family history information we compute the correlation of self-reported family history between siblings. When computing the correlation of self-reported sibling history we restricted to concordant sibling pairs (e.g. both cases or both controls) and for sex-specific diseases (breast and prostate cancer) we restricted to concordant sibling pairs of the relevant sex, sibling pairs of the non-relevant sex, and sibling pairs of discordant sex where the relevant sex is a control. We observed a moderately high concordance of self-reported family history among siblings (Supplementary Table 28).

LT-FH in UK Biobank

There are 377 possible configurations of case-control status and family history of disease. For example, there are 2 configurations when case-control status is available but no family history is available (case or control) and 4 configurations when case-control status and one parent’s disease status is available (case with affected parent, case with unaffected parent, control with affected parent, control with unaffected parent). In detail, there are 252 configurations when case-control status is available: 2 configurations of case-control status with no family history (see above); 4 of case-control status and one parent’s disease status (see above); 6 of case-control status and both parents’ disease status; 40 of case-control status and sibling’s disease status (1–10 siblings; at least one or none affected); 80 of case-control status, one parent’s disease status, and sibling’s disease status; 120 of case-control status, both parents’ disease status, and sibling’s disease status. There are 125 configurations when case-control status is unavailable: 2 of one parent’s disease status; 20 of sibling’s disease status; 3 of both parents’ disease status; 40 of one parent’s disease status and sibling’s disease status; 60 of both parents’ disease status and sibling’s disease status. A comprehensive list of configurations is provided in Supplementary Table 42.

We condition on known information to compute the posterior mean genetic liability. We computed E[ϵ_{o, g}| ·] (where · is all known disease status information) through Monte Carlo integration as described above. We allow for genotyped individuals and their siblings to have a different disease prevalence than their parents. We select the prevalence for genotyped individuals and their siblings using the prevalence of the disease in genotyped (unrelated European) individuals, and the prevalence for parents using the parental disease prevalence (for parents of unrelated European individuals). h² values were obtained from the published literature (Supplementary Table 15). We used h² estimates from twin studies, but we note the existence of h² estimates based on family history³².

For individuals who report having 0 siblings, more than 10 siblings, or who do not know or prefer not to answer how many brothers and/or sisters they have, we ignore sibling history and use parental history only for LT-FH; for all other individuals we incorporate both the number of siblings as well as whether or not at least one is affected in the computation of E[ϵ_{o, g}| ·]. For sex-specific diseases our exclusion criteria remains consistent however we now incorporate the number of siblings of the relevant sex as well as whether or not at least one is affected in the computation of E[ϵ_{o, g}| ·]. Further details are provided in the Supplementary Note.

GWAS and GWAX in UK Biobank

We implemented GWAS considering the case-control status of the genotyped individual only (Supplementary Table 19; Table 1). Due to the increased risk of a type I error in unbalanced case-control settings, for some diseases we selected a stringent MAF threshold when computing the number of independent loci^15,21. However, one could also use a method such as SAIGE that controls type I error in unbalanced case-control settings²¹.

We considered the case-control status of the genotyped individual and the disease history of parents and siblings when implementing GWAX (Supplementary Table 19). The GWAX prevalence (i.e. prevalence of proxy cases) was generally more than double the parental prevalence and many times larger than the disease prevalence (see Supplementary Table 43). For individuals with at least 1 sibling or who do not know or prefer not to answer how many brothers and/or sisters they have we assign using sibling and parental disease history; for individuals with 0 reported siblings we assign using parental disease history only (Supplementary Table 19). For sex-specific diseases (breast and prostate cancer) the conditions are replaced with siblings of the relevant sex.

Assessing calibration in UK Biobank

We assessed the calibration of each method using stratified LD score regression (S-LDSC) attenuation ratio^15,22–24. We report two estimates for the mean attenuation ratio (and mean attenuation ratio difference) across all traits. Let x_i denote the attenuation ratio (or attenuation ratio difference) for trait i and σ_i the reported jackknife standard error (for the ratio or the difference of ratios). We compute a simple average across traits, with standard error computed as $\frac{1}{12} \sqrt{\sum_{i} σ_{i}^{2}}$ . We also compute an inverse-variance weighted mean where the weights are determined by the variance of the GWAS attenuation ratio. The inverse-variance weighted estimate is $\sum_{i} {x_{i} / σ_{gwas,i}^{2}} / \sum_{i} {1 / σ_{gwas,i}^{2}}$ , with standard error computed as $\sqrt{\sum_{i} [σ_{i}^{2} / σ_{gwas,i}^{4}]} / \sum_{i} [1 / σ_{gwas,i}^{2}]$ . Further details are provided in the Supplementary Note.

Assessing power in UK Biobank

We computed the number of independent loci as defined similar to ref.¹⁵. The relative improvement for a method M versus a method m is defined as $\frac{\sum_{t} #_{t,M} - \sum_{t} #_{t,m}}{\sum_{t} #_{t,m}}$ where #_{t, M} is the number of independent loci for trait t and method M; the standard error is computed via block jackknife of the genome (200 blocks). Further details are provided in the Supplementary Note.

Assessing heritability and correlation in UK Biobank

We estimate observed-scale SNP-heritability ( ${\hat{h}}_{g,o}^{2}$ ) for each method by applying BOLT-REML²⁶ to unrelated European individuals. Observed-scale SNP-heritability is converted to liability-scale SNP-heritability using the in-sample disease prevalence¹⁰. Genetic correlation is estimated using BOLT-REML²⁶. The phenotypic correlation and correlation of −log₁₀ (p) (across all variants with a minimum MAF determined by case prevalence; Supplementary Table 16) is estimated in R. For both genetic and phenotypic correlation we restrict attention to individuals with non-NA values for both methods being compared.

Replication analysis

We conducted a replication analysis of loci identified by GWAS and/or LT-FH in independent non-UK Biobank data sets (4 diseases: coronary artery disease, type 2 diabetes, breast cancer, and prostate cancer) with publicly available summary statistics^33–36 (Supplementary Table 25). The replication summary statistics are from studies consisting of predominantly non-UK Europeans and were always computed using GWAS (not LT-FH). For both genome-wide significant loci identified in UK Biobank by GWAS and genome-wide significant loci identified in UK Biobank by LT-FH, we computed the replication slope (the slope of a regression of standardized effect sizes in case-control replication data vs. UK Biobank discovery data).

In detail, we restricted association results from linear regression on unrelated Europeans for GWAS and LT-FH to SNPs which (1) appear in the replication studies SNP set, (2) meet the MAF threshold for that trait from our main analysis, and (3) are significant in GWAS or LT-FH. We then LD-clumped this restricted set of association statistics (stringent 5Mb window and R² threshold of 0.01 for LD clumping). For GWAS (and LT-FH, respectively) for each LD clump with at least one significant SNP, we selected the most significant SNP as the lead SNP. When an LD clump contained only SNPs that were significant for LT-FH (resp. GWAS), this was considered a LT-FH only locus (resp. GWAS only). The replication slope^27,37 was then computed as the slope of a regression of standardized effect sizes of lead SNPs in case-control replication data vs. GWAS or LT-FH UK Biobank discovery data. Standardized effect sizes were computed as $Z / \sqrt{N_{eff}}$ , where Z denotes z-score, for GWAS (and GWAS replication data) N_eff was computed as $\frac{4}{\frac{1}{N_{A}} + \frac{1}{N_{U}}}$ , and for LT-FH N_eff was computed as N_eff,GWAS *c (where c is the relative effective sample size of LT-FH vs GWAS; see Supplementary Table 22).

Extended Data

Extended Data Fig. 3 — We plot the increase in number of independent loci for ${LT-FH}_{n o - s i b, a g e}^{P A}$ relative to for ${LT-FH}_{n o - s i b}^{P A}$ (Table S32) against the liability threshold model parameter |c_age| (Table S30).

Extended Data Fig. 4 — We report results of GWAS using BOLT-LMM on related Europeans, GWAX using BOLT-LMM on unrelated Europeans, and LT-FH using BOLT-LMM on related Europeans using only case-control status for all sibling pairs and parent-offspring pairs within the set of target samples. Numerical results are reported in Table S37.

Extended Data Fig. 5 — We plot GWAS BOLT-LMM-inf effect sizes and transformed LT-FH BOLT-LMM-inf effect sizes for genome-wide significant effect sizes (P ≤ 5 * 10⁻⁸ for both GWAS and LT-FH BOLT-LMM-inf).

We note that BOLT-LMM only outputs effect size estimates for BOLT-LMM-inf, the BOLT-LMM approximation to the infinitesimal mixed model. Our effect size for GWAS is the outputted *β_{GWAS,BOLT - LMM - in f}* (per-allele observed scale) and for LT-FH we estimate a (per-allele observed scale) effect size as $β = \frac{β_{LT - FH, BOLT - LMM - in f}}{se (β_{LT - FH, BOLT - LMM - in f}) \sqrt{N_{GWAS} * c}} \frac{\sqrt{K (1 - K)}}{\sqrt{2 (MAF) (1 - MAF)}}$ , where c is the boost in N_eff for LT-FH relative to GWAS, K is disease prevalence in GWAS and *MAF* is the minor allele frequency of the SNP.

Supplementary Material

supplement

NIHMS1576071-supplement-supplement.pdf^{(493.2KB, pdf)}

Acknowledgements

We are grateful to L. O’Connor, O. Weissbrod, N. Zaitlen, G. Kichaev, and A. Gusev for helpful discussions, and E.M. Pedersen for computational suggestions. This research was funded by NIH grants R01 HG006399 (N.P. and A.L.P), R01 MH101244 (A.L.P.), R01 MH107649 (A.L.P.), NSF CAREER award DBI-1349449 (S.G. and A.L.P.), and 5T32CA009337–32 (M.L.A.H.). P.-R.L. was supported by the Next Generation Fund at the Broad Institute of MIT and Harvard and a Sloan Research Fellowship. This research was conducted using the UK Biobank Resource under Application #10438.

Footnotes

Competing Interests Statement

The authors declare no competing interests.

Data Availability

This study analyzed data from the UK Biobank, which is publicly available by application (http://www.ukbiobank.ac.uk/). We have publicly released summary association statistics computed by applying our LT-FH method to UK Biobank data; LT-FH summary association statistics for 12 diseases: https://data.broadinstitute.org/alkesgroup/UKBB/LTFH/sumstats/. A Life Sciences Reporting Summary is available.

Code Availability

We have publicly released open-source software implementing our LT-FH method; LT-FH software (v1 and v2): https://data.broadinstitute.org/alkesgroup/UKBB/LTFH/. BOLT-LMM v2.3 software: https://data.broadinstitute.org/alkesgroup/BOLT-LMM. LTSOFT software: https://data.broadinstitute.org/alkesgroup/LTSOFT/. PLINK software: https://www.cog-genomics.org/plink2.

References

1.Liu JZ, Erlich Y & Pickrell JK Case-control association mapping by proxy using family history of disease. Nat. Genet 49, 325–331 (2017). [DOI] [PubMed] [Google Scholar]
2.So H-C, Kwan JSH, Cherny SS & Sham PC Risk Prediction of Complex Diseases from Family History and Known Susceptibility Loci, with Applications for Cancer Screening. Am. J. Hum. Genet 88, 548–565 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
3.Visscher PM & Duffy DL The Value of Relatives With Phenotypes But Missing Genotypes in Association Studies for Quantitative Traits. Genet. Epidemiol 30, 30–36 (2006). [DOI] [PubMed] [Google Scholar]
4.Hayes BJ, Bowman PJ, Chamberlain AJ & Goddard ME Genomic selection in dairy cattle: Progress and challenges. J. Dairy Sci 92, 433–443 (2008). [DOI] [PubMed] [Google Scholar]
5.Misztal I, Legarra A & Aguilar I Computing procedures for genetic evaluation including phenotypic, full pedigree, and genomic information. J. Dairy Sci 92, 4648–4655 (2009). [DOI] [PubMed] [Google Scholar]
6.Liu Z, Goddard ME, Reinhardt F & Reents R A single-step genomic model with direct estimation of marker effects. J. Dairy Sci 97, 5833–5850 (2014). [DOI] [PubMed] [Google Scholar]
7.Marioni RE, Harris SE, Zhang Q & et al. GWAS on family history of Alzheimer’s disease. Transl. Psychiatry 8, (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
8.Jansen IE, Savage JE, Watanabe K & et al. Genome-wide meta-analysis identifies new loci and functional pathways influencing Alzheimer’s disease risk. Nat. Genet 51, 404–413 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Falconer DS The inheritance of liability to diseases with variable age of onset, with particular reference to diabetes mellitus. Ann. Hum. Genet 31, 1–20 (1967). [DOI] [PubMed] [Google Scholar]
10.Lee SH, Wray NR, Goddard ME & Visscher PM Estimating missing heritability for disease from genome-wide association studies. Am. J. Hum. Genet 88, 294–305 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
11.Zaitlen N, Lindstrom S, Pasaniuc B, Cornelis M & et al. Informed Conditioning on Clinical Covariates Increases Power in Case-Control Association Studies. PLoS Genet. 8, (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
12.Weissbrod O, Lippert C, Geiger D & Heckerman D Accurate liability estimation improves power in ascertained case-control studies. Nat. Methods 12, 332–334 (2015). [DOI] [PubMed] [Google Scholar]
13.Hayeck TJ, Zaitlen NA, Loh P-R, Vilhjalmsson B & et al. Mixed Model with Correction for Case-Control Ascertainment Increases Association Power. Am. J. Hum. Genet 96, 720–730 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
14.Loh P-R, Tucker G, Bulik-Sullivan BK, Vilhjalmsson BJ & et al. Efficient Bayesian mixed-model analysis increases association power in large cohorts. Nat. Genet 47, 284–290 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
15.Loh P-R, Kichaev G, Gazal S, Schoech AP & Price AL Mixed-model association for biobank-scale datasets. Nat. Genet 50, 906–911 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
16.Yang J et al. Common SNPs explain a large proportion of the heritability for human height. Nat. Genet 42, 565–569 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]
17.Pearson K Mathematical contributions to the theory of evolution. XI. On the influence of natural selection on the variability and correlation of organs. Philos. Trans. R. Soc. Math. Phys. Eng. Sci 200, 1–66 (1903). [Google Scholar]
18.Aitken AC Note on selection from a multivariate normal population. Proc. Edinb. Math. Soc. B 4, 106–110 (1934). [Google Scholar]
19.Armitage P Tests for Linear Trends in Proportions and Frequencies. Biometrics 11, 375–386 (1955). [Google Scholar]
20.Bycroft C et al. The UK Biobank resource with deep phenotyping and genomic data. Nature 562, 203–209 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
21.Zhou W, Nielsen JB, Fritsche LG & et al. Efficiently controlling for case-control imbalance and sample relatedness in large-scale genetic association studies. Nat. Genet 50, 1335–1341 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
22.Bulik-Sullivan BK et al. LD Score regression distinguishes confounding from polygenicity in genome-wide association studies. Nat. Genet 47, 291–295 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
23.Finucane HK et al. Partitioning heritability by functional annotation using genome-wide association summary statistics. Nat. Genet 47, 1228–1235 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
24.Gazal S et al. Linkage disequilibrium–dependent architecture of human complex traits shows action of negative selection. Nat. Genet 49, 1421–1427 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
25.Haworth S et al. Apparent latent structure within the UK Biobank sample has implications for epidemiological analysis. Nat. Commun 10, (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
26.Loh P-R, Bhatia G, Gusev A, Finucane HK & et al. Contrasting genetic architectures of schizophrenia and other complex diseases using fast variance components analysis. Nat. Genet 47, 1385–1392 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
27.Marigorta U & Navarro A High Trans-ethnic Replicability of GWAS Results Implies Common Causal Variants. PLOS Genet. 9, (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
28.Turley P, Walters RK & et al. Multi-trait analysis of genome-wide association summary statistics using MTAG. Nat. Genet 50, 229–237 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
29.Price AL, Spencer CCA & Donnelly P Progress and promise in understanding the genetic basis of common diseases. Proc R Soc B 282, (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
30.Visscher PM, Wray NR, Zhang Q & et al. 10 Years of GWAS Discovery: Biology, Function, and Translation. Am. J. Hum. Genet 101, 5–22 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
31.Manichaikul A et al. Robust relationship inference in genome-wide association studies. Bioinformatics 26, 2867–2873 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]
32.Munoz M, Pong-Wong R, Canela-Xandri O & et al. Evaluating the contribution of genetic and familial shared environment to common disease using the UK Biobank. Nat. Genet 48, 980–983 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
33.Schunkert H, Konig IR, Kathiresan S & et al. Large-scale association analyses identifies 13 new susceptibility loci for coronary artery disease. Nat. Genet 43, 333–338 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
34.Morris AP, Voight BF, Teslovich TM & et al. Large-scale association analysis provides insights into the genetic architecture and pathophysiology of type 2 diabetes. Nat. Genet 44, 981–990 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
35.Michailidou K, Lindstrom S, Dennis J & et al. Association analysis identifies 65 new breast cancer risk loci. Nature 551, 92–94 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
36.Schumacher F, Amin Al Olama A, Berndt SI & et al. Association analyses of more than 140,000 men identify 63 new prostate cancer susceptibility loci. Nat. Genet 50, 928–936 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
37.Kichaev G, Bhatia G, Loh P-R & et al. Leveraging Polygenic Functional Enrichment to Improve GWAS Power. Am. J. Hum. Genet 104, 65–75 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

supplement

NIHMS1576071-supplement-supplement.pdf^{(493.2KB, pdf)}

[R1] 1.Liu JZ, Erlich Y & Pickrell JK Case-control association mapping by proxy using family history of disease. Nat. Genet 49, 325–331 (2017). [DOI] [PubMed] [Google Scholar]

[R2] 2.So H-C, Kwan JSH, Cherny SS & Sham PC Risk Prediction of Complex Diseases from Family History and Known Susceptibility Loci, with Applications for Cancer Screening. Am. J. Hum. Genet 88, 548–565 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R3] 3.Visscher PM & Duffy DL The Value of Relatives With Phenotypes But Missing Genotypes in Association Studies for Quantitative Traits. Genet. Epidemiol 30, 30–36 (2006). [DOI] [PubMed] [Google Scholar]

[R4] 4.Hayes BJ, Bowman PJ, Chamberlain AJ & Goddard ME Genomic selection in dairy cattle: Progress and challenges. J. Dairy Sci 92, 433–443 (2008). [DOI] [PubMed] [Google Scholar]

[R5] 5.Misztal I, Legarra A & Aguilar I Computing procedures for genetic evaluation including phenotypic, full pedigree, and genomic information. J. Dairy Sci 92, 4648–4655 (2009). [DOI] [PubMed] [Google Scholar]

[R6] 6.Liu Z, Goddard ME, Reinhardt F & Reents R A single-step genomic model with direct estimation of marker effects. J. Dairy Sci 97, 5833–5850 (2014). [DOI] [PubMed] [Google Scholar]

[R7] 7.Marioni RE, Harris SE, Zhang Q & et al. GWAS on family history of Alzheimer’s disease. Transl. Psychiatry 8, (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R8] 8.Jansen IE, Savage JE, Watanabe K & et al. Genome-wide meta-analysis identifies new loci and functional pathways influencing Alzheimer’s disease risk. Nat. Genet 51, 404–413 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R9] 9.Falconer DS The inheritance of liability to diseases with variable age of onset, with particular reference to diabetes mellitus. Ann. Hum. Genet 31, 1–20 (1967). [DOI] [PubMed] [Google Scholar]

[R10] 10.Lee SH, Wray NR, Goddard ME & Visscher PM Estimating missing heritability for disease from genome-wide association studies. Am. J. Hum. Genet 88, 294–305 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R11] 11.Zaitlen N, Lindstrom S, Pasaniuc B, Cornelis M & et al. Informed Conditioning on Clinical Covariates Increases Power in Case-Control Association Studies. PLoS Genet. 8, (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R12] 12.Weissbrod O, Lippert C, Geiger D & Heckerman D Accurate liability estimation improves power in ascertained case-control studies. Nat. Methods 12, 332–334 (2015). [DOI] [PubMed] [Google Scholar]

[R13] 13.Hayeck TJ, Zaitlen NA, Loh P-R, Vilhjalmsson B & et al. Mixed Model with Correction for Case-Control Ascertainment Increases Association Power. Am. J. Hum. Genet 96, 720–730 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R14] 14.Loh P-R, Tucker G, Bulik-Sullivan BK, Vilhjalmsson BJ & et al. Efficient Bayesian mixed-model analysis increases association power in large cohorts. Nat. Genet 47, 284–290 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R15] 15.Loh P-R, Kichaev G, Gazal S, Schoech AP & Price AL Mixed-model association for biobank-scale datasets. Nat. Genet 50, 906–911 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R16] 16.Yang J et al. Common SNPs explain a large proportion of the heritability for human height. Nat. Genet 42, 565–569 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R17] 17.Pearson K Mathematical contributions to the theory of evolution. XI. On the influence of natural selection on the variability and correlation of organs. Philos. Trans. R. Soc. Math. Phys. Eng. Sci 200, 1–66 (1903). [Google Scholar]

[R18] 18.Aitken AC Note on selection from a multivariate normal population. Proc. Edinb. Math. Soc. B 4, 106–110 (1934). [Google Scholar]

[R19] 19.Armitage P Tests for Linear Trends in Proportions and Frequencies. Biometrics 11, 375–386 (1955). [Google Scholar]

[R20] 20.Bycroft C et al. The UK Biobank resource with deep phenotyping and genomic data. Nature 562, 203–209 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R21] 21.Zhou W, Nielsen JB, Fritsche LG & et al. Efficiently controlling for case-control imbalance and sample relatedness in large-scale genetic association studies. Nat. Genet 50, 1335–1341 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R22] 22.Bulik-Sullivan BK et al. LD Score regression distinguishes confounding from polygenicity in genome-wide association studies. Nat. Genet 47, 291–295 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R23] 23.Finucane HK et al. Partitioning heritability by functional annotation using genome-wide association summary statistics. Nat. Genet 47, 1228–1235 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R24] 24.Gazal S et al. Linkage disequilibrium–dependent architecture of human complex traits shows action of negative selection. Nat. Genet 49, 1421–1427 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R25] 25.Haworth S et al. Apparent latent structure within the UK Biobank sample has implications for epidemiological analysis. Nat. Commun 10, (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R26] 26.Loh P-R, Bhatia G, Gusev A, Finucane HK & et al. Contrasting genetic architectures of schizophrenia and other complex diseases using fast variance components analysis. Nat. Genet 47, 1385–1392 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R27] 27.Marigorta U & Navarro A High Trans-ethnic Replicability of GWAS Results Implies Common Causal Variants. PLOS Genet. 9, (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R28] 28.Turley P, Walters RK & et al. Multi-trait analysis of genome-wide association summary statistics using MTAG. Nat. Genet 50, 229–237 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R29] 29.Price AL, Spencer CCA & Donnelly P Progress and promise in understanding the genetic basis of common diseases. Proc R Soc B 282, (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R30] 30.Visscher PM, Wray NR, Zhang Q & et al. 10 Years of GWAS Discovery: Biology, Function, and Translation. Am. J. Hum. Genet 101, 5–22 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R31] 31.Manichaikul A et al. Robust relationship inference in genome-wide association studies. Bioinformatics 26, 2867–2873 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R32] 32.Munoz M, Pong-Wong R, Canela-Xandri O & et al. Evaluating the contribution of genetic and familial shared environment to common disease using the UK Biobank. Nat. Genet 48, 980–983 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R33] 33.Schunkert H, Konig IR, Kathiresan S & et al. Large-scale association analyses identifies 13 new susceptibility loci for coronary artery disease. Nat. Genet 43, 333–338 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R34] 34.Morris AP, Voight BF, Teslovich TM & et al. Large-scale association analysis provides insights into the genetic architecture and pathophysiology of type 2 diabetes. Nat. Genet 44, 981–990 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R35] 35.Michailidou K, Lindstrom S, Dennis J & et al. Association analysis identifies 65 new breast cancer risk loci. Nature 551, 92–94 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R36] 36.Schumacher F, Amin Al Olama A, Berndt SI & et al. Association analyses of more than 140,000 men identify 63 new prostate cancer susceptibility loci. Nat. Genet 50, 928–936 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R37] 37.Kichaev G, Bhatia G, Loh P-R & et al. Leveraging Polygenic Functional Enrichment to Improve GWAS Power. Am. J. Hum. Genet 104, 65–75 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Liability threshold modeling of case-control status and family history of disease increases association power

Margaux LA Hujoel

Steven Gazal

Po-Ru Loh

Nick Patterson

Alkes L Price

Abstract

Introduction

Figure 1: Overview of LT-FH and other methods.

Results

Overview of methods

Simulations

Figure 2: LT-FH is well-calibrated and increases association power in simulations.

Analysis of 12 diseases in the UK Biobank

Table 1: Overview of 12 diseases in the UK Biobank.

Figure 3: LT-FH increases association power across 12 diseases from the UK Biobank.

Table 2: Genetic and phenotypic correlations between GWAS, GWAX and LT-FH phenotypes.

Figure 4: Loci identified by LT-FH replicate in independent data sets.

BOLT-LMM increases GWAS, GWAX and LT-FH association power

Discussion

Methods

LT-FH method: estimating posterior mean genetic liability

LT-FH method: LT-FH association statistic

LT-FH effect sizes

GWAS and GWAX methods

Simulations

UK Biobank data sets

LT-FH in UK Biobank

GWAS and GWAX in UK Biobank

Assessing calibration in UK Biobank

Assessing power in UK Biobank

Assessing heritability and correlation in UK Biobank

Replication analysis

Extended Data

Extended Data Fig. 1. QQ plots from simulations with default parameter settings.

Extended Data Fig. 2. Distribution of LT-FH phenotypes for 12 UK Biobank diseases.

Extended Data Fig. 3. Impact of modifying the LT-FH method to incorporate age information as a function of the liability threshold model parameter for age for 12 UK Biobank diseases.

Extended Data Fig. 4. LT-FH increases association power across 12 diseases from the UK Biobank in analyses incorporating related individuals.

Extended Data Fig. 5. Strong concordance between GWAS BOLT-LMM-inf effect sizes and transformed LT-FH BOLT-LMM-inf effect sizes.

Supplementary Material

Acknowledgements

Footnotes

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases