Skip to main content
American Journal of Human Genetics logoLink to American Journal of Human Genetics
. 2015 Nov 5;97(5):677–690. doi: 10.1016/j.ajhg.2015.10.002

Two-Variance-Component Model Improves Genetic Prediction in Family Datasets

George Tucker 1,2,3,10, Po-Ru Loh 2,3,10, Iona M MacLeod 4,5, Ben J Hayes 5,6,7, Michael E Goddard 4,6, Bonnie Berger 1,8, Alkes L Price 2,3,9,
PMCID: PMC4667134  PMID: 26544803

Abstract

Genetic prediction based on either identity by state (IBS) sharing or pedigree information has been investigated extensively with best linear unbiased prediction (BLUP) methods. Such methods were pioneered in plant and animal-breeding literature and have since been applied to predict human traits, with the aim of eventual clinical utility. However, methods to combine IBS sharing and pedigree information for genetic prediction in humans have not been explored. We introduce a two-variance-component model for genetic prediction: one component for IBS sharing and one for approximate pedigree structure, both estimated with genetic markers. In simulations using real genotypes from the Candidate-gene Association Resource (CARe) and Framingham Heart Study (FHS) family cohorts, we demonstrate that the two-variance-component model achieves gains in prediction r2 over standard BLUP at current sample sizes, and we project, based on simulations, that these gains will continue to hold at larger sample sizes. Accordingly, in analyses of four quantitative phenotypes from CARe and two quantitative phenotypes from FHS, the two-variance-component model significantly improves prediction r2 in each case, with up to a 20% relative improvement. We also find that standard mixed-model association tests can produce inflated test statistics in datasets with related individuals, whereas the two-variance-component model corrects for inflation.

Introduction

Mixed linear models (MLMs) are widely used for genetic prediction and association testing in genome-wide association studies (GWASs). In prediction, MLMs produce best linear unbiased predictions (BLUPs); BLUP and its extensions were first developed in agricultural genetics1, 2, 3, 4 and have since been applied to human genetics.5, 6, 7, 8, 9, 10 In association testing, MLMs model relatedness and population stratification, correcting for confounding and increasing power over linear regression (essentially by testing association of the residual from BLUP).11, 12, 13, 14, 15, 16 Mixed-model methods harness information from either genetic markers (identity by state [IBS] sharing) or known pedigree relationships. Recent work on the estimation of components of heritability17 has demonstrated the advantages of a model with two variance components: one component for IBS sharing (corresponding to SNP heritability, hg2 18, 19) and one for approximate pedigree structure, estimated via IBS sharing above a threshold (corresponding to total narrow-sense heritability, h2 20). However, the potential advantages of this model for genetic prediction (or mixed-model association) have not been explored.

Through systematic simulations and analyses of quantitative phenotypes in the Candidate-gene Association Resource (CARe)21 and Framingham Heart Study (FHS)22, 23 cohorts, we show that the two-variance-component model improves prediction r2 over single-variance-component (standard BLUP) methods. Our simulations demonstrate that this improvement is achieved both at current sample sizes and for larger sample sizes, and our analyses of real CARe and FHS phenotypes confirm relative improvements in prediction r2 of up to 20%. We also consider the situation in which phenotypes are available for ungenotyped individuals who are related to the genotyped cohort (e.g., via family history24, 25) and show that leveraging this additional information for genetic prediction within a two-variance-component model achieves similar gains.

Additionally, we investigate the utility of the two-variance-component model for association testing. We evaluate the standard prospective MLM association statistic15 in the context of familial relatedness and observe inflation of test statistics over a range of simulation parameters, contrary to previous findings.11, 13, 14, 15, 26 We show that the two-variance-component model substantially reduces the inflation in simulations and in GWASs of CARe and FHS phenotypes.

Material and Methods

Overview of Methods

We use the two-variance-component model described in previous work on the estimation of components of heritability.17 The first variance component is the usual genetic relationship matrix (GRM) computed from genetic markers (corresponding to hg2).18 The second variance component is a thresholded version of the GRM in which pairwise relationship estimates smaller than a threshold t are set to zero. The idea is to capture strong relatedness structure, similarly to a pedigree relationship matrix. (If full pedigree information is available, the pedigree relationship matrix can also be used directly.) Explicitly modeling relatedness in this way allows the two-variance-component mixed model to capture additional heritability from untyped SNPs (corresponding to h2hg2).17 We used the two-variance-component model to compute genetic predictions via BLUP and to compute test association statistics via a Wald test.1, 11, 27 (We note that best linear unbiased prediction, BLUP, is a general method for prediction that can be applied once a covariance model has been established, whether from one or many variance components. We will therefore use “standard BLUP” to refer to BLUP with the GRM as a single variance component, and we will use “BLUP” to more generally refer to BLUP with any number of variance components.) We further developed methods to treat the situation in which phenotypes for ungenotyped relatives are available; in brief, our approach uses pedigree information to impute the missing information.28 Full mathematical details are provided below and in the Appendix, and we have released an open source Matlab implementation of these methods (FAMBLUP; see Web Resources).

Standard Mixed Model for Prediction

We begin by establishing notation and reviewing standard formulas for mixed-model prediction (i.e., standard BLUP) and association testing with one variance component.1, 11, 27 Let N be the number of individuals in the study and M be the number of genotyped SNPs. Denote phenotypes by y, fixed-effect covariates by X, and normalized genotypes by W, all of which are mean-centered. We normalize each genotype by dividing by 2pˆ(1pˆ), where pˆ is the empirical minor allele frequency (MAF).18 We model phenotypes by using the following mixed model:

y=Xb+g+ϵ, (Equation 1)

where gN(0,Σg) is a random-effect term modeling genetic effects, ϵN(0,σe2I) is a random-effect modeling noise, and b is a vector of coefficients for the fixed effects. In the standard marker-based mixed model, we assume g=Wα is a linear combination of genotyped SNPs, where α is an M-vector of independent and identically distributed (iid) normal SNP effect sizes (the infinitesimal model), so that

y=Xb+Wα+ϵ. (Equation 2)

Then the genetic covariance satisfies Σg=σg2WWT/M, where WWT/M is the GRM and σg2 and σe2 are variance parameters typically estimated via restricted maximum likelihood (REML).29 In pedigree-based models that do not use marker information, Σg=σh2Θ, where Θ is the pedigree relationship matrix, σh2 and σe2 are again estimated via REML.

These models naturally yield formulas for standard BLUP prediction.1 Explicitly, if we denote training individuals (i.e., those with observed phenotypes) with subscript i and test individuals (i.e., those with phenotypes to be predicted) with subscript i, predictions are given by

yˆi=σg2WiWiT(σg2WiWiT+σe2I)1(yiXib)+Xib. (Equation 3)

Standard Mixed-Model Association Test

To test a candidate SNP (w) for association with the phenotype (y), we augment the marker-based model by including w as an additional fixed-effect covariate:

y=wβ+Xb+Wα+ϵ, (Equation 4)

where β is the coefficient for the SNP (w) and we wish to test whether β0. To do so, we estimate the variance parameters (σg2, σe2) by using REML and estimate the fixed-effect coefficients (β,b) by using maximum likelihood.27 We then compute the Wald statistic to test β0 as follows. Let

V=σg2ˆWWT/M+σe2ˆI (Equation 5)

denote the total phenotypic covariance and let Q=[w;X] denote the combined fixed effects. Then βˆ is equal to the first entry of (QTV1Q)1QTV1y and var(βˆ) is equal to the first entry of (QTV1Q)1. The Wald chi-square test statistic is given by

χWald2=βˆ2var(βˆ) (Equation 6)

and is asymptotically χ2 distributed with 1 degree of freedom (df) under the null distribution. (We note that, theoretically, comparing the square root of Equation 6 to a t distribution is more precise, but in practice, the distinction is negligible at sample sizes of many thousands.)

We make one slight modification to the above association test to avoid proximal contamination (i.e., masking of the association signal by SNPs included in the random-effects term that are in linkage disequilibrium [LD] with the SNP being tested). Specifically, we use a leave-one-chromosome-out procedure in which, when testing SNP w, we exclude all SNPs on the same chromosome as w from the genotype matrix W used to model random genetic effects.15, 30, 31

Two-Variance-Component Mixed Model

Our use of a two-variance-component mixed model is motivated by the idea that in a sample containing related individuals, the pedigree relationship matrix (or an approximation thereof) can model additional heritable variance explained by untyped SNPs.17 More precisely, consider expanding the marker-based model (Equation 2) to

y=Xb+Wα+Uγ+ϵ, (Equation 7)

where Uγ is the analog of Wα for untyped SNPs, U, so that the total genetic effect is g=Wα+Uγ. Ideally, we would use this model for prediction and its augmentation for association testing, but U is unobserved. Because the BLUP and Wald statistic formulas only require UUT, however, we can still improve upon the standard model (Equation 2) by using an approximation of UUT. If we let Mh denote the number of untyped SNPs, the matrix UUT/Mh is the realized relationship matrix from untyped SNPs. Assuming a fixed pedigree relationship matrix (Θ) we have

E[UUT/Mh]=Θ, (Equation 8)

where the expectation is computed over possible realizations of genotypes passed down by descent (e.g., siblings share half of their genomes on average). When the study samples include close relatives, off-diagonal entries of Θ can be large, in which case these entries are good approximations of the corresponding entries of UUT/Mh and hold additional information not fully harnessed by models that use only the usual GRM from typed SNPs (WWT/M). Substituting Θ for UUT/Mh gives the model

yN(Xb,σg2WWT/M+σh2Θ+σe2I). (Equation 9)

In our case, the pedigree relationship matrix (Θ) is also unavailable, so we need to make a further approximation in which we replace Θ with the estimator

Θ(WWT/M)>t, (Equation 10)

obtained from the usual GRM by keeping only those entries larger than a threshold t and setting all other entries to zero.17 This approximation gives the model

yN(Xb,σg2WWT/M+σh2(WWT/M)>t+σe2I). (Equation 11)

In theory, the optimal threshold (t) depends on M,N, and the amount of relatedness in the dataset, but in our genetic prediction analyses using human datasets, we found that the results were robust to the choice of t, so we set t=0.05. For association testing, we found t=0.05 to generally be robust (and we expect this choice to be appropriate in human genetics settings), but in more extreme simulation scenarios in which we built the GRM from only a few chromosomes, we observed that higher thresholds were required to model relatedness accurately enough to produce well-calibrated statistics. We therefore optimize t in all association analyses (all of which we conduct by using a leave-one-chromosome-out procedure15, 30, 31) by using the following approach. For each chromosome c in turn, we choose t to minimize the deviation between the thresholded GRM, (WcWcT/Mc)>t, computed with all chromosomes but c, and the GRM, WcWcT/Mc, computed on the left-out chromosome c. We measure this deviation with the Frobenius norm

WcWcT/Mc(WcWcT/Mc)>t22, (Equation 12)

i.e., the sum of squared differences between matrix entries. Prediction and association testing proceed as before, once the threshold (t) has been set: we estimate σg2,σh2, and σe2 by REML to enable calculation of BLUP predictions, and for association testing, we again introduce an additional fixed-effect term, wβ, for the SNP being tested and construct a Wald statistic. (Again, for computational efficiency, we apply a leave-one-chromosome-out procedure within which we reuse variance parameters fitted once per left-out chromosome.13, 15, 16) We note that the computation of predictions, yˆ, can no longer be expressed as a simple matrix-vector product of genotypes with a vector (βˆ) of SNP weights, as is the case for standard (one-variance-component) genomic BLUP. Instead, the formula for yˆ (given in the Appendix) involves two terms, only one of which has the above form; the other involves combining training and testing genotypes and has substantially greater computational cost (O(N3)). Although performing prediction with the first term alone would be computationally efficient, we found that such an approach yields suboptimal results; see the Appendix for details.

We have released Matlab code that implements these methods in a stand-alone software package, FAMBLUP (see Web Resources). Although our implementation uses standard O(N3)-time eigendecomposition-based variance components methods, we have taken care to optimize its central processing unit (CPU) and memory use: for example, FAMBLUP association analysis of N = 20,000 samples requires 16 GB RAM and 1 single-threaded CPU day per chromosome. Memory usage scales with N2, and computation time scales with N3, so for N = 30,000 samples, the requirements are 36 GB RAM and 4 CPU days per chromosome. These computations are automatically multithreaded on multi-core machines and can be parallelized across chromosomes.

Extension to Ungenotyped Individuals

In the Appendix, we derive extensions of two-variance-component mixed-model prediction and association testing to make use of data available from additional phenotyped but ungenotyped relatives of genotyped individuals. In this setup, we assume that the full pedigree relationship matrix (containing both typed and untyped individuals) is known, whereas some entries of the SNP GRM—namely, those in rows or columns corresponding to untyped individuals—are unknown. Our procedure amounts to replacing the unobserved GRM with the expected GRM (based on known pedigree),32, 33, 34 similar in spirit to regression imputation; mathematical derivations are presented in the Appendix.

CARe and FHS Datasets

We analyzed 8,367 African-American CARe samples from the ARIC, CARDIA, CFS, JHS, and MESA cohorts, comprising high-quality genotypes at 770,390 SNPs from an Affymetrix 6.0 array; the CARe dataset and quality-control procedures used to obtain the sample and SNP sets we analyzed are described in Lettre et al. and Pasaniuc et al.21, 35 We analyzed all samples in analyses of simulated phenotypes (for which real genotypes were used); in analyses of real CARe phenotypes—including BMI, height, high density lipoprotein cholesterol (HDL), and low density lipoprotein cholesterol (LDL), each available for 5,000–8,000 samples—we removed outlier individuals with phenotype values in the top or bottom 0.1%, individuals younger than 18 years old, and individuals with missing age or sex, given that regression results can be quite sensitive to outliers; we then applied a Box-Cox transformation to remove skewness. We analyzed 7,476 FHS SHARe samples with high-quality genotypes at 413,943 SNPs from an Affymetrix 500K array and with BMI and height phenotypes available; the FHS dataset and quality-control procedures are described in Dawber et al., Splansky et al., and Chen et al.22, 23, 36 Our analyses were performed under the oversight of the Harvard institutional review board.

Genetic Prediction: Simulations with Real Genotypes

To assess the accuracy of genetic prediction methods, we simulated phenotypes based on genotypes from the CARe and FHS datasets; both CARe and FHS are family studies containing many close relatives. Because the CARe individuals are admixed, we projected out the first five principal components (equivalent to including them as fixed-effect covariates29) from genotypes and phenotypes in all analyses of both CARe and FHS data to avoid confounding from population structure.37 We simulated phenotypes by generating causal effects for two subsets of SNPs: a set of M “observed SNPs,” which we used for both phenotype simulation and BLUP prediction, and a set of Mh “untyped SNPs,” which we used for phenotype simulation but did not provide to prediction methods. In this simulation framework, the standard GRM built by MLM methods accurately models variation due to observed SNPs, but direct or inferred pedigree information is necessary to capture variation due to untyped SNPs. We generated effect sizes for observed and untyped SNPs from independent normal distributions N(0,hg2/M) and N(0,(h2hg2)/Mh), respectively, where hg2 denotes heritability explained by observed SNPs and h2 denotes total narrow-sense heritability. To build phenotypes, we multiplied the simulated effect sizes with the genotypes and added random noise from N(0,(1h2)). We used SNPs on chromosome 1 as untyped SNPs and SNPs on varying subsets of chromosomes 2–22 as observed SNPs so as to simulate different values of N/M (which is a key quantity affecting performance of mixed-model prediction38 and association15) and thereby estimate projected performance at larger N. We used hg2=0.25 and h2=0.5 as typical values of these parameters.39

We note that under the above setup, untyped SNPs are completely untagged by typed SNPs, whereas in real data, untyped SNPs might be partially tagged by typed SNPs. In either case, the phenotype can be written as a sum of “genetic value explained by typed SNPs,” “remaining genetic value,” and “environmental value” (with variance parameters hg2, h2hg2, and 1h2 corresponding to the same covariance structures in either case), so we expected that our results would be insensitive to this distinction. To verify this expectation, we also performed a set of simulations in which we selected the set of observed SNPs to be the 90% of SNPs with highest MAF and the set of untyped SNPs to be the 10% of SNPs with lowest MAF—similar to the simulation framework of Yang et al.18—to produce a realistic gap between hg2 and h2 as a result of untyped rare variants. (The MAF cutoff corresponding to this split was 5.4%.)

Genetic Prediction: Simulations with Simulated Genotypes

To assess the potential performance of genetic prediction methods at extremely large sample sizes, we also simulated genotypes for sets of sib-pairs (relatedness = 0.5) with M = 100 SNPs and N/M = 10,20,…,100. We generated unlinked markers for simplicity by randomly generating MAFs uniformly in the interval [0.05,0.5] and sampling genotypes of unrelated individuals from a binomial distribution with the generated MAF. For sib-pairs, with probability 0.5, the pair shared an allele drawn randomly; otherwise, the alleles for the pair were drawn independently. (We ran this procedure twice per SNP to create diploid genotypes.) We simulated phenotypes as above.

Genetic Prediction: Assessing Performance on Real Phenotypes

To compare the predictive performance of the two-variance-component model versus standard BLUP on real phenotypes, we performed cross-validation studies in which we repeatedly selected 10% of the phenotyped samples (from either CARe or FHS) as test data and used the remaining 90% of samples to train each predictor. For each training/test split (s) we thus obtained a pair of observed prediction r2 values (r2VC,s2,rBLUP,s2). We then estimated the improvement of the two-variance-component model over standard BLUP as

r2VC2rBLUP2ˆ=mean(r2VC,s2rBLUP,s2), (Equation 13)

where the mean is taken over the random splits (s). We estimated the SE of this quantity as:

SE(r2VC2rBLUP2ˆ)SD(r2VC,s2rBLUP,s2)/10. (Equation 14)

The numerator is the SD of the per-split differences in r2 (across random 90% training and 10% test set splits [s]), which measures the variability in observed performance differences between the two methods when assessed on 10% of the data. The division by 10 accounts for the 10× larger sample size of the full dataset. This estimate is approximate due to the complexities of estimating variance under cross-validation (specifically, the overlap among different test sets and among different training sets); in general, unbiased estimators of variance under cross-validation do not exist.40

Finally, in the Results section, we estimate relative improvements and SEs (i.e., we divide Equations (Equation 13), (Equation 14) by the estimated baseline, mean(rBLUP,s2)) to put our absolute estimates in context.

Association Testing: Simulations with Simulated Genotypes

We conducted a suite of mixed-model association simulations by using genotypes simulated in a similar manner as above. We systematically varied the number of related individuals, the degree of relatedness, the number of markers (M) in the genome, and the SNP heritability (hg2) and total heritability (h2) of the simulated trait. Specifically, we simulated sets of N = 1,000 diploid individuals, in which Nrel = 50, 125, 250, or 500 pairs of individuals who were related (leaving 900, 750, 500, or 0 unrelated individuals, respectively). Each pair of individuals shared a proportion, namely p = 0, 0.1, 0.2, 0.3, 0.4, or 0.5, of their genomes in expectation. Additionally, we varied the number of markers, using M = 1,000, 5,000, 10,000, or 20,000. We generated unlinked markers as above; for pairs of related individuals, with probability equal to the relatedness (p), the pair shared an allele drawn randomly; otherwise, the alleles for the pair were drawn independently. (As above, we ran this procedure twice per SNP to create diploid genotypes.) We also generated 100 additional candidate causal SNPs and 500 candidate null SNPs (at which to compute association test statistics) in the same way. We used an infinitesimal model to generate the phenotype: that is, we generated effect sizes for the observed SNPs from N(0,hg2/M). We also generated effect sizes for the candidate causal SNPs from N(0,(h2hg2)/100). Because these SNPs are distinct from the M SNPs used for model building, they effectively served as untyped causal loci. Finally, we formed the phenotype by multiplying the effect sizes with the genotypes and adding independent noise distributed as N(0,(1h2)I).

Association Testing: Simulations with Real Genotypes

We also assessed mixed-model association methods in simulation studies by using simulated phenotypes based on genotypes from the CARe and FHS datasets. To avoid proximal contamination,15, 30, 31 we tested SNPs on chromosomes 1 and 2 for association and used M observed SNPs on subsets of chromosomes 3–22 to compute GRMs, varying the number of chromosomes used in order to vary N/M. We generated quantitative phenotypes in which observed SNPs collectively explained 25% of variance and 250 causal SNPs from chromosome 1 explained another 25% of variance; all SNPs on chromosome 2 were null SNPs.

Results

Genetic Prediction: Simulations

To analyze the predictive power of the two-variance-component model, we simulated phenotypes based on genotypes from the CARe and FHS datasets as described in Material and Methods. In each simulation, we used the following procedure to measure the prediction accuracies of BLUP with the standard GRM as a single variance component, BLUP with the thresholded GRM as a single variance component, and BLUP with the two-variance-component model. First, we simulated phenotypes for all individuals (independently for each simulation replicate). Second, we randomly split the dataset, setting aside 90% of the individuals for training and 10% for testing. We then used each BLUP method to predict held-out test phenotypes by using the training samples to estimate genetic effects, and we calculated r2 between the predicted phenotypes and the true genetic components of the simulated phenotypes (i.e., eliminating the added noise) on the test samples. (We chose to compute r2 because it is a very widely used metric for assessing prediction accuracy;2, 3, 4, 6, 7, 9 however, other metrics such as mean square error are also sometimes used.5) We call this quantity “prediction r2(g)”; on average, prediction r2(g) is 1/h2 times as large as standard prediction r2, i.e., r2 computed to simulated phenotypes that include both genetic and noise components. Relative performance of prediction methods is the same (on average) whether measured with prediction r2 or prediction r2(g).

The two-variance-component model provided significant increases in r2(g) over both standard BLUP and BLUP using the thresholded GRM alone (Table 1), and the improvements were consistent across simulation replicates (Figure S1). We observed much larger prediction r2(g) values (across all methods) for the FHS simulations than for the CARe simulations, as expected given the much greater number of close relatives in the FHS dataset (18,415 pairs of individuals with genetic relatedness > 0.2 among 7,476 FHS individuals versus 4,954 pairs among 8,367 CARe individuals). However, the relative improvements achieved by the two-variance-component model were fairly similar in these two distinct pedigree structures, and importantly, increasing values of N/M (mimicking larger sample sizes) also yielded similar relative improvements (Table 1). We also observed that the heritability parameter estimated by the standard mixed model was intermediate to hg2 and h2, whereas the two-variance-component model accurately estimated hg2 and h2hg2 (Table S1), as expected in samples with related individuals.17 (We note that because the sum of the entries of the thresholded GRM is nonzero, we used the general formula given in Speed et al.41 to estimate heritability parameters.) We also verified that in simulations with no untyped causal SNPs, the two-variance-component model produced no improvement over standard BLUP, indicating that our cross-validation scheme was immune to differences in model complexity (Table S2). Finally, we verified that simulations involving linkage disequilbrium between typed and untyped SNPs (achieved by setting typed SNPs to be the 90% of SNPs with highest MAF and untyped SNPs to be the 10% of SNPs with lowest MAF) produced similar results (Table S3). In these simulations, we also varied the fraction of heritability explained by typed versus untyped SNPs, and we observed that the two-variance-component model achieved larger gains for hg2h2 and smaller gains for hg2 approaching h2 (Table S3), consistent with our intuition that, if typed SNPs explain most of heritable variance, prediction using only typed SNPs achieves most of the available predictive power.

Table 1.

Prediction Accuracy for Simulations Using CARe and FHS Genotypes

Observed SNPs Predictionr2(g)
BLUP BLUP w/ Thresh. 2VC BLUP
CARe genotypes

chr 2–22 0.062 (0.002) 0.061 (0.002) 0.071 (0.002)
chr 3–6 0.084 (0.002) 0.063 (0.002) 0.094 (0.002)
chr 3–4 0.098 (0.002) 0.059 (0.002) 0.108 (0.002)

FHS genotypes

chr 2–22 0.225 (0.003) 0.225 (0.003) 0.238 (0.003)
chr 3–6 0.246 (0.003) 0.230 (0.003) 0.269 (0.003)
chr 3–4 0.263 (0.003) 0.231 (0.003) 0.291 (0.003)

Phenotypes were simulated to have h2=0.5 and hg2=0.25, and we measured prediction r2(g) by using a random 90% of samples as training data and the remaining 10% as test data. Reported values are mean prediction r2(g) and SEM over 100 independent simulations (in which phenotypes were re-simulated and training-test splits resampled). “BLUP w/ Thresh.” denotes BLUP prediction using the thresholded relationship matrix instead of the standard approach of using the GRM (denoted simply as “BLUP”). “2VC BLUP” denotes two-variance-component BLUP. “Prediction r2(g)” denotes r2 between predicted phenotypes and true genetic components of the simulated phenotypes.

We further assessed the potential performance of the two-variance-component approach at very large values of N/M (up to 100) by simulating both genotypes and phenotypes (Material and Methods). (We note that human genotyping arrays typically contain 60,000 independent SNPs,15, 42 so N/M = 8 in this simulation corresponds to a dataset the size of UK Biobank, N = 500,000; see Web Resources.) In these simulations, we continued to observe gains when using the two-variance-component approach; two-variance-component prediction r2 exceeded hg2 for very large N, whereas standard BLUP prediction r2 was limited to less than hg2 (Figure S2).

Genetic Prediction: Real Phenotypes

Next, we evaluated the prediction accuracy of each method on CARe phenotypes—BMI, height, LDL, and HDL—and on FHS phenotypes—height and BMI. We adjusted phenotypes for age, sex, study center (for CARe phenotypes), and the top five principal components. (The complexities of the impact of ancestry on genetic prediction are discussed in Chen et al.43) To measure performance, we created 100 independent random 90%/10% splits of the dataset, as before, and calculated r2 between predicted and true phenotypes on the test samples of each split. We observed that, for all phenotypes, the two-variance-component model increased prediction accuracy over both single-variance-component BLUP approaches, with a maximum relative improvement of 20% for height (Table 2); this improvement was consistent across different training/test splits (Figure S4). We observed no significant difference between the performance of the the two single-variance-component BLUP approaches (Table 2). As in our simulations, we observed a larger absolute prediction r2 in FHS than in CARe, due to strong relatedness (consistent with de los Campos et al.6), and we observed that the heritability parameter estimated by the standard mixed model was intermediate to the heritability parameters hˆg2 and hˆ2 estimated by the two-variance-component model (Table S4). We verified in the CARe data that evaluating prediction accuracy by using the mean square error metric produced near-identical results (Table S5).

Table 2.

Prediction Accuracy for CARe and FHS Phenotypes

Phenotype Predictionr2
Predictionr2Relative to BLUP (SE)
BLUP BLUP w/ Thresh. 2VC BLUP BLUP w/ Thresh. 2VC BLUP
CARe prediction

BMI 0.023 0.027 0.029 +14% (9%) +18% (5%)
height 0.063 0.067 0.079 +5% (5%) +20% (3%)
LDL 0.017 0.017 0.019 +2% (15%) +11% (5%)
HDL 0.034 0.032 0.038 −7% (10%) +11% (4%)

FHS prediction

BMI 0.103 0.104 0.107 +1.0% (2.3%) +3.5% (1.2%)
height 0.344 0.342 0.354 −0.7% (1.1%) +2.9% (0.5%)

CARe prediction with genome-wide significant SNPs as fixed-effect covariates

BMI 0.023 0.026 0.028 +14% (9%) +19% (5%)
height 0.063 0.066 0.078 +5% (5%) +20% (3%)
LDL 0.038 0.039 0.041 +3% (6%) +6% (2%)
HDL 0.051 0.049 0.055 −4% (6%) +7% (3%)

FHS prediction with genome-wide significant SNPs as fixed-effect covariates

BMI 0.105 0.107 0.109 +1.2% (2.3%) +3.5% (1.2%)
height 0.344 0.341 0.354 −0.8% (1.1%) +2.8% (0.5%)

Prediction r2 values are means over 100 random 90% training and 10% test data splits. Relative performance values reported are ratios of means minus 1; SEs are estimated as SDs of per-split differences in r2 (over the random 10% test sets) divided by 10 (to account for the 10× larger sample size of the full dataset; see Material and Methods). “BLUP w/ Thresh.” denotes BLUP prediction using the thresholded relationship matrix instead of the standard approach of using the GRM (denoted simply as “BLUP”). “2VC BLUP” denotes two-variance-component BLUP.

For phenotypes with a small number of large-effect loci, methods that explicitly model a non-infinitesimal genetic architecture can have substantially better prediction accuracy than standard BLUP.2 A two-variance-component approach could be combined with such models, and as an initial exploration of this approach, we examined a non-infinitesimal extension of two-variance-component BLUP in which we included large-effect loci as fixed-effect covariates.8 Explicitly, we first identified genome-wide significant SNPs (p<5×108) according to a two-variance-component mixed-model association statistic. (As we show below, the standard MLM statistic is miscalibrated in scenarios with pervasive relatedness, precluding its use.) We then added these SNPs as fixed-effect covariates in all of the models we previously compared and recomputed predictions (Table 2). Including large-effect loci resulted in substantial improvements in prediction r2 achieved by each model for the CARe HDL and LDL phenotypes (Table 2), both of which are known to have several large-effect loci.44 As before, for all phenotypes, we observed an increase in r2 when using the two-variance-component model. We expect that the two-variance-component model will provide similar improvements in prediction r2 if incorporated in more sophisticated non-infinitesimal models (e.g., Erbe et al. and Zhou et al. 3, 5).

Additionally, we explored the scenario in which some phenotypes are available for ungenotyped relatives of genotyped individuals. We simulated data with ungenotyped individuals by randomly masking the genotypes of 25% of the training individuals. Results on simulated and real phenotypes when using this masking were broadly consistent with the results reported above in which all individuals were typed (Tables S6–S9).

Association Testing

We next compared mixed-model association testing using the two-variance-component approach to standard MLM association testing12, 15 in datasets with related individuals, measuring calibration and power for each method. We began by running a suite of tests using simulated genotypes and phenotypes, systematically varying the number of related individuals, the degree of relatedness, the number of markers in the genome, and the heritability of the simulated trait (see Material and Methods). Each simulation included both causal SNPs and “null SNPs,” i.e., SNPs with no phenotypic effect. For null SNPs, Wald statistics computed by mixed-model association tests follow a 1 df chi-square distribution, under the assumption that the mixed model accurately models the phenotypic covariance. If the mixed model does not accurately model the covariance, as we expect for phenotypes with hg2<h2 in datasets containing relatedness, then the distribution of association statistics at null SNPs is miscalibrated, i.e., approximately follows a scaled 1 df chi-square.45 We therefore measured calibration of MLM association methods by computing the mean Wald statistic over null SNPs. We measured power by dividing the mean Wald statistic over causal SNPs by the mean Wald statistic over null SNPs. Computing the ratio in the latter benchmark ensured that all methods, including those susceptible to inflation of test statistics, were equally calibrated before we compared power.

Contrary to previous work suggesting that mixed models fully correct for relatedness,11, 13, 14, 15, 26 we found that for many parameter settings, standard MLM association analysis produced significantly inflated test statistics (up to 11% inflation, increasing with trait heritability, sample size, and extent of relatedness; Figure 1). In contrast, introducing a second variance component—either the thresholded GRM (Figure 1) or the true pedigree (Figure S4)—nearly eliminated the inflation. For all parameter settings, we observed that, compared to standard MLM association, the two-variance-component model maintained or slightly increased power (Figure S4).

Figure 1.

Figure 1

Calibration of Standard and Two-Variance-Component Mixed-Model Association Statistics on Simulated Genotypes and Phenotypes

We computed mean Wald statistics over null SNPs by using the standard mixed-model association test (MLM) and a two-variance-component model (2 var. comp. MLM) using GRM and thresholded GRM (i.e., approximate pedigree) components. Each panel shows results from a set of simulations with selected values of the simulation parameters N/M, h2, and hg2. The set of simulations contained within each panel varies one additional parameter, NS, which measures the amount of relatedness in the simulated data. (S denotes the average squared off-diagonal entry of the pedigree relationship matrix.) Plotted values are mean Wald statistics and SEM over 100 simulations.

Next, we simulated phenotypes based on genotypes from the CARe and FHS datasets (Material and Methods). Consistent with the previous simulations, standard MLM association produced inflated statistics (as measured in test statistics from chromosome 2, simulated to contain no causal SNPs) whereas the two-variance-component model alleviated inflation (Table 3; also see type I errors in Table S10). Importantly, these results suggest that the levels of relatedness that are required for inflation are present in real datasets.

Table 3.

Calibration of Standard and Two-Variance-Component Mixed-Model Association Statistics in CARe and FHS Simulations

Observed SNPs No. of SNPs (M) Standard Mixed Model
Two Variance Components
Mean Wald Mean Wald Threshold (t)
CARe genotypes

chr 3–22 615,445 1.013 (0.002) 1.000 (0.002) 0.024
chr 3–6 195,333 1.024 (0.002) 1.002 (0.002) 0.051
chr 3–4 99,690 1.028 (0.002) 1.003 (0.002) 0.081
chr 22 9,713 1.036 (0.002) 1.014 (0.002) 0.387

FHS genotypes

chr 3–22 346,005 1.032 (0.003) 1.003 (0.003) 0.021
chr 3–6 110,203 1.071 (0.003) 1.008 (0.003) 0.040
chr 3–4 55,480 1.097 (0.003) 1.014 (0.003) 0.055
chr 22 5,277 1.189 (0.004) 1.055 (0.003) 0.258

Mean Wald statistics on candidate null SNPs for simulations with CARe or FHS genotypes and a trait with h2=0.5,hg2=0.25. Reported values are means and SEM over 100 simulations. The two-variance-component model selected the specified threshold (t) to estimate the relatedness matrix. In simulations using only SNPs on chromosome 22 to compute GRMs, we observed slight inflation when using the two-variance-component model; given the large thresholds (t>0.25) chosen by the model in these scenarios, we hypothesize that the number of SNPs was too small to distinguish relatedness from noise in the GRM, causing an incomplete correction. For corresponding type I error at different α levels, see Table S10.

Finally, we analyzed MLM association statistics for the CARe and FHS phenotypes (adjusted for covariates as before). Because we do not know the identity of causal and null SNPs in this case, we calculated the average Wald statistic over all SNPs by using leave-one-chromosome-out analysis,15, 30 noting that we expected the statistics to be slightly larger than 1 due to polygenicity.15, 42 Consistent with simulations, the average Wald statistics were higher for standard MLM association than for the two-variance-component method, suggesting that standard MLM statistics are slightly inflated, with an up to 1.05-fold inflation in FHS data (Table 4). Analysis of genomic inflation factors λGC46 corroborated these results (Table 4). We also compared our test statistics (which involve approximations, as in previous work13, 15, 16; see Material and Methods) to exact likelihood ratio test statistics under the two-variance-component model and verified that the approximate versus exact statistics were near identical (r2 = 0.999997; Figure S5).

Table 4.

Calibration of Standard and Two-Variance-Component Mixed-Model Association Statistics for CARe and FHS Phenotypes

Phenotype N Standard Mixed Model
Two Variance Components
Mean Wald λGC hˆg2 Mean Wald λGC hˆg2 hˆ2
CARe phenotypes

BMI 7,987 1.044 1.044 0.35 1.029 1.027 0.17 0.46
height 7,988 1.110 1.099 0.73 1.080 1.070 0.38 0.95
LDL 4,965 1.030 1.026 0.32 1.021 1.017 0.18 0.44
HDL 5,184 1.054 1.046 0.50 1.037 1.028 0.26 0.66

FHS phenotypes

BMI 7,476 1.060 1.058 0.43 1.032 1.032 0.21 0.47
height 7,476 1.126 1.112 0.81 1.070 1.058 0.39 0.87

We report the number of individuals (N) phenotyped for each trait and the mean Wald statistics and heritability parameters computed by the standard and two-variance-component mixed models (averaged over 22 leave-one-chromosome-out runs).

Discussion

We have shown that a mixed model with two variance components, one modeling genetic effects of typed SNPs and the other modeling phenotypic covariance from close relatives, offers increased prediction accuracy over standard BLUP and corrects miscalibration of standard mixed-model association analysis in human datasets containing strong relatedness. For current sample sizes and levels of relatedness, the absolute increase in prediction accuracy is modest (similar to other recent work on improving prediction accuracy for human complex traits,5, 7, 8, 9, 10 in contrast to agricultural traits2, 3, 4) and the inflation of standard mixed-model test statistics is small. However, our simulations suggest that, for larger sample sizes, the effects of relatedness will become more pronounced, so we expect the two-variance-component model to become increasingly relevant as sample sizes increase.

Although we are not aware of prior work in human genetics that involves using two variance components to model effects of typed SNPs as well as additional phenotypic covariance from close relatives, other methods for combining these two sources of information for prediction have been proposed; however, these methods either use only a limited number of genome-wide significant SNPs24 or use only limited information about family history.25 Separately, several studies have applied different multiple-variance-component models to improve mixed-model prediction and association in other ways. Widmer et al.26 recently proposed a two-variance-component model that uses the standard GRM along with a GRM created from selected SNPs (as in FaST-LMM-Select31) that improves association power and calibration in family studies. (We note that, although Widmer et al. observe that standard mixed-model association is properly calibrated in their simulated family datasets, their simulations do not include untyped causal SNPs.) In another direction, Speed et al.7 recently proposed a multiple variance component model that partitions SNPs into contiguous blocks, each used in a distinct variance component, and showed that this approach improves prediction accuracy. Incorporation of a variance component modeling relatedness—either from pedigree, thresholding the GRM, or other approaches47—into these methods or into recently proposed non-infinitesimal models for genetic prediction (e.g., weighted G-BLUP,6 BayesR,3, 10 or BSLMM5) is a possible direction for future research.

A challenge facing all genetic prediction methods is the very large sample sizes that will be required to achieve clinically relevant prediction accuracy.25, 48 Indeed, in absolute terms, the prediction accuracy we achieved on real datasets of up to 8,000 samples was low, similar to other methods when applied to traits without large-effect loci.5, 6, 10 Our simulations show that the two-variance-component approach we have proposed will maintain its relative improvement over standard BLUP as sample sizes increase; however, both of these methods face computational barriers at large N. (Standard BLUP does have the advantage that O(N3)-time computation is required only for fitting the model but not for computing predictions on new samples; in contrast, a straightforward implementation of our two-variance-component method for prediction requires O(N3) time per REML iteration when estimating variance parameters as well as when computing predictions, a consequence of the need to combine training and target genotypes.) These limitations could potentially be overcome by using a combination of rapid relationship inference,49 fast multiple-variance-component analysis (e.g., as implemented in BOLT-REML50), and iterative solution of the mixed-model equations.51, 52 Similarly, the computational challenge of large-scale two-variance-component association analysis could potentially be addressed by extending fast iterative methods for mixed-model association.16 An alternative, computationally simple solution to inflation of association-test statistics is LD score regression;53 however, this approach might incur slight deflation as a result of attenuation bias.16, 53

We also note four additional limitations of our two-variance-component approach. First, the method is only applicable to datasets with related individuals for which genotypes are available for analysis; however, large human datasets of this type are now being generated: deCODE Genetics has genotyped >30% of the Icelandic population,54 the UK Biobank will soon have genotypes for N = 500,000 individuals (close to 1% of the UK population; see Web Resources), and 23andMe has assembled an even larger cohort.55 Second, the improved predictive performance of the two-variance-component approach is a function of the relatedness structure. Our parallel work in cattle has reported improved prediction accuracy when using a two-variance-component model incorporating exact pedigree information56 or breed information;57 however, the two-variance-component model did not produce an improvement in analyses of Holstein dairy cattle (Table S11), perhaps because of the very small effective population size of this breed.58 Third, although the intuition behind the two-variance-component model is to capture effects of rare variants not tagged by SNP arrays, our observed gains in prediction accuracy could also be partially explained by the approximate pedigree component picking up shared environment or epistasis; as such, care is needed in interpreting fitted variance parameters as heritability estimates.17 Fourth, our approach does not address case-control ascertainment. Although many large family datasets are not ascertained for phenotype, investigating whether techniques employed by methods that do model ascertainment8 can be integrated into our two-variance-component approach is a possible avenue for future work.

Acknowledgments

We are grateful to N. Zaitlen, B. Vilhjalmsson, S. Rosset, and H. Johnsen for helpful discussions. This research was supported by NIH grants R01 HG006399, R01 GM105857, and RO1 GM108348 and by NIH fellowship F32 HG007805.

Published: November 5, 2015

Footnotes

Supplemental Data include eleven tables and can be found with this article online at http://dx.doi.org/10.1016/j.ajhg.2015.10.002.

Appendix A

Formulas for Two-Variance-Component Mixed-Model Prediction

Here, we provide explicit formulas for computing best linear unbiased predictions under the two-variance-component model

yN(0,WWTσg2/M+(WWT/M)>tσh2+Iσe2),

combining Equation 9 with the approximation of pedigree by using thresholded IBS, Equation 10, and leaving out fixed effects for simplicity. We assume that we have a set of training individuals (denoted with subscript − i) for whom we have both genotype and phenotype information and that we have a set of testing individuals (denoted with subscript i) for whom we have only genotype information and wish to predict phenotypes. Thus, under this notation, Wi denotes the submatrix of genotypes from testing individuals and Wi denotes the submatrix of genotypes from training individuals.

Under the assumption that the variance parameters σg2,σh2, and σe2 have already been fitted (e.g., by using REML on the training individuals), the BLUP predictions for the test phenotypes are given by

yˆi=Wiβˆ+σh2(WiWiT/M)>t(WiWiTσg2/M+(WiWiT/M)>tσh2+Iσe2)1yi,

where

βˆ=σg2WiT(WiWiTσg2/M+(WiWiT/M)>tσh2+Iσe2)1yi.

We note that the first term of the formula for yˆi, namely Wiβˆ, has the form of a simple matrix-vector product between genotypes of testing individuals and a vector (βˆ) of SNP weights, as is the case for standard (one-variance-component) genomic BLUP. This term is easy to compute on testing individuals once we have estimated βˆ by using the training data, whereas the second term of the prediction formula requires more computation.

This observation suggests the possibility of performing prediction by using only the first term as a computationally efficient alternative to carrying out the full two-variance-component computation. We tested the performance of this approach on the CARe height phenotype but found no significant difference in its performance versus that of standard BLUP: –1% (SE 3%) change in prediction r2, in contrast to the +20% (SE 3%) change in prediction r2 of the full two-variance-component approach over standard BLUP (Table 2). This observation indicates that the gain in prediction accuracy achieved by the two-variance-component model is largely a result of capturing effects of rare variants and requires the use of both variance components.

Two-Variance-Component Mixed Model with Ungenotyped Individuals

Here, we derive extensions of two-variance-component mixed-model prediction and association testing to make use of data available from additional phenotyped but ungenotyped relatives of genotyped individuals. In this case, we assume that we are given the pedigree relationship matrix (Θ) among all individuals, both typed and untyped.

We will use subscripts u and t to denote submatrices of genotype and relationship matrices corresponding to untyped and typed individuals (e.g., Wu is the matrix of [unobserved] genotypes for the ungenotyped individuals, and Wt is the matrix of genotypes for the typed individuals, so that Wu and Wt together comprise the genotype matrix W). Because Wu is unknown, we need a distribution on W to describe the relationship between the genotypes of typed and untyped individuals. For modeling purposes, we assume that normalized SNPs in Wu are independently drawn from a multivariate normal distribution according to the pedigree structure: N(0,Θ).

To adapt our pedigree-based two-variance-component model (Equation 9),

yN(Xb,WWTσg2/M+Θσh2+Iσe2),

to deal with the fact that W now has a subset of unknown entries, we would ideally marginalize over the unobserved genotypes, assuming the above distribution on Wu. However, this approach leads to an intractable integral. Instead, we assume that y is normally distributed conditional on the observed genotypes, in which case it suffices to compute the mean and covariance of y. It is straightforward to see that the mean of y is Xb (assuming that we observe the covariates for all individuals), and the covariance of y is

V:=Cov(y)=σg2E[WWT/M|Wt]+σh2Θ+σe2I.

Therefore, this procedure amounts to replacing the unobserved GRM with the expected GRM,

EWu[WWT/M|Wt],

where the expectation is over the unobserved genotypes. By using standard properties of the normal distribution, we can compute the required moments

E[Wu|Wt]=ΘutΘtt1WtE[WuWuT|Wt]=MΘuuMΘutΘtt1Θtu+E[Wu|Wt]E[WuT|Wt]=MΘuu+Θut(WtWtTMΘtt1)Θtu.

Thus,

EWu[WWT/M|Wt]=[Θuu+Θut(WtWtT/MΘtt1)ΘtuΘutΘtt1WtWtT/MWtWtTΘtt1Θtu/MWtWtT/M].

Legarra, Misztal, and Aguilar developed the same variance component to incorporate genetic marker information with pedigree information in the context of cattle phenotype prediction.32, 33, 34

The new phenotype model

yN(Xb,σg2E[WWT/M|Wt]+σh2Θ+σe2I)=N(Xb,V)

immediately enables BLUP prediction as before. However, for association testing, the test SNP w=[wu;wt] is not completely specified, so we need a novel association statistic that accounts for the uncertainty in w. Assuming that wN(0,Θ), the model for y simplifies to

yN([ΘutΘtt1wtwt]β+Xb,[ΘuuΘutΘtt1Θtu000]β2+V).

and w˜=[ΘutΘtt1wt;wt] can be interpreted as the BLUP imputation of the missing genotypes.

This distribution for y gives rise to a score statistic as follows. We start from the log likelihood function

logp(y|V,X,W,w˜,Θ,β)=0.5(log|V+Aβ2|+(yw˜βXb)T(V+Aβ2)1(yw˜βXb))

(up to a constant that does not depend on y or β), where

A=[ΘuuΘutΘtt1Θtu000]

is the adjustment to the variance. Expanding around β=0, the log likelihood simplifies to

2logp(y|V,X,W,w˜,Θ,β)=log|V|+trace(V1A)β2+(yw˜βXb)TV1(yw˜βXb)(yXb)T(V2A)(yXb)β2+O(β3),

so the score function is

U(β)=logp(y|V,X,W,w˜,Θ,β)β=(trace(V1A)β+(yXb)TV1w˜+w˜TV1w˜β(yXb)T(V2A)(yXb)β)+O(β2),

and

Uβ=(trace(V1A)+w˜TV1w˜(yXb)T(V2A)(yXb))+O(β)

Hence, the score statistic to test the hypothesis that β=0 is

U(0)2U/β(0)=(w˜TV1(yXb))2w˜TV1w˜+trace(V1A)(yXb)T(V2A)(yXb),

where the nuisance parameters (b,σg2,σh2,σe2) are set to their maximum likelihood values when β=0. The score statistic is asymptotically distributed as χ2 with 1 df. It is possible to adjust the statistic slightly to produce a nearly equivalent statistic that is easier to compute and precisely χ2-distributed under the null distribution. Observe that

E[(yXb)T(V2A)(yXb)]=E[trace((yXb)T(V2A)(yXb))]=trace(E(yXb)(yXb)T]V2A)=trace(VV2A)=trace(V1A),

resulting in the simplified statistic

(w˜TV1(yXb))2w˜TV1w˜,

which is χ2-distributed under the null distribution. Notably, when V only incorporates pedigree information (i.e., V=Θσh2+Iσe2), we recover a prospective analog to the retrospective MASTOR statistic.28

Alternatively, in a similar manner, we can compute a retrospective statistic by considering the score statistic produced by logp(w|y) instead. Analogous manipulations yield the retrospective statistic

(w˜TV1(yXb))2(yXb)TV1ΘtΘtt1ΘtV1(yXb).

Given that the results from the retrospective model and prospective model are similar (data not shown), we focused on the more commonly used prospective approach.

Web Resources

The URLs for data presented herein are as follows:

Supplemental Data

Document S1. Tables S1–S11
mmc1.pdf (285.9KB, pdf)
Document S2. Article plus Supplemental Data
mmc2.pdf (618.3KB, pdf)

References

  • 1.Henderson C.R. Best linear unbiased estimation and prediction under a selection model. Biometrics. 1975;31:423–447. [PubMed] [Google Scholar]
  • 2.Meuwissen T.H., Hayes B.J., Goddard M.E. Prediction of total genetic value using genome-wide dense marker maps. Genetics. 2001;157:1819–1829. doi: 10.1093/genetics/157.4.1819. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Erbe M., Hayes B.J., Matukumalli L.K., Goswami S., Bowman P.J., Reich C.M., Mason B.A., Goddard M.E. Improving accuracy of genomic predictions within and between dairy cattle breeds with imputed high-density single nucleotide polymorphism panels. J. Dairy Sci. 2012;95:4114–4129. doi: 10.3168/jds.2011-5019. [DOI] [PubMed] [Google Scholar]
  • 4.Habier D., Fernando R.L., Garrick D.J. Genomic BLUP decoded: a look into the black box of genomic prediction. Genetics. 2013;194:597–607. doi: 10.1534/genetics.113.152207. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Zhou X., Carbonetto P., Stephens M. Polygenic modeling with bayesian sparse linear mixed models. PLoS Genet. 2013;9:e1003264. doi: 10.1371/journal.pgen.1003264. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.de los Campos G., Vazquez A.I., Fernando R., Klimentidis Y.C., Sorensen D. Prediction of complex human traits using the genomic best linear unbiased predictor. PLoS Genet. 2013;9:e1003608. doi: 10.1371/journal.pgen.1003608. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Speed D., Balding D.J. MultiBLUP: improved SNP-based prediction for complex traits. Genome Res. 2014;24:1550–1557. doi: 10.1101/gr.169375.113. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Golan D., Rosset S. Effective genetic-risk prediction using mixed models. Am. J. Hum. Genet. 2014;95:383–393. doi: 10.1016/j.ajhg.2014.09.007. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Maier R., Moser G., Chen G.-B., Ripke S., Coryell W., Potash J.B., Scheftner W.A., Shi J., Weissman M.M., Hultman C.M., Cross-Disorder Working Group of the Psychiatric Genomics Consortium Joint analysis of psychiatric disorders increases accuracy of risk prediction for schizophrenia, bipolar disorder, and major depressive disorder. Am. J. Hum. Genet. 2015;96:283–294. doi: 10.1016/j.ajhg.2014.12.006. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Moser G., Lee S.H., Hayes B.J., Goddard M.E., Wray N.R., Visscher P.M. Simultaneous discovery, estimation and prediction analysis of complex traits using a bayesian mixture model. PLoS Genet. 2015;11:e1004969. doi: 10.1371/journal.pgen.1004969. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Yu J., Pressoir G., Briggs W.H., Vroh Bi I., Yamasaki M., Doebley J.F., McMullen M.D., Gaut B.S., Nielsen D.M., Holland J.B. A unified mixed-model method for association mapping that accounts for multiple levels of relatedness. Nat. Genet. 2006;38:203–208. doi: 10.1038/ng1702. [DOI] [PubMed] [Google Scholar]
  • 12.Chen W.-M., Abecasis G.R. Family-based association tests for genomewide association scans. Am. J. Hum. Genet. 2007;81:913–926. doi: 10.1086/521580. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Kang H.M., Sul J.H., Service S.K., Zaitlen N.A., Kong S.Y., Freimer N.B., Sabatti C., Eskin E. Variance component model to account for sample structure in genome-wide association studies. Nat. Genet. 2010;42:348–354. doi: 10.1038/ng.548. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Zhou X., Stephens M. Genome-wide efficient mixed-model analysis for association studies. Nat. Genet. 2012;44:821–824. doi: 10.1038/ng.2310. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Yang J., Zaitlen N.A., Goddard M.E., Visscher P.M., Price A.L. Advantages and pitfalls in the application of mixed-model association methods. Nat. Genet. 2014;46:100–106. doi: 10.1038/ng.2876. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Loh P.-R., Tucker G., Bulik-Sullivan B.K., Vilhjálmsson B.J., Finucane H.K., Salem R.M., Chasman D.I., Ridker P.M., Neale B.M., Berger B. Efficient Bayesian mixed-model analysis increases association power in large cohorts. Nat. Genet. 2015;47:284–290. doi: 10.1038/ng.3190. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Zaitlen N., Kraft P., Patterson N., Pasaniuc B., Bhatia G., Pollack S., Price A.L. Using extended genealogy to estimate components of heritability for 23 quantitative and dichotomous traits. PLoS Genet. 2013;9:e1003520. doi: 10.1371/journal.pgen.1003520. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Yang J., Benyamin B., McEvoy B.P., Gordon S., Henders A.K., Nyholt D.R., Madden P.A., Heath A.C., Martin N.G., Montgomery G.W. Common SNPs explain a large proportion of the heritability for human height. Nat. Genet. 2010;42:565–569. doi: 10.1038/ng.608. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.de los Campos G., Sorensen D., Gianola D. Genomic heritability: what is it? PLoS Genet. 2015;11:e1005048. doi: 10.1371/journal.pgen.1005048. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Visscher P.M., Hill W.G., Wray N.R. Heritability in the genomics era--concepts and misconceptions. Nat. Rev. Genet. 2008;9:255–266. doi: 10.1038/nrg2322. [DOI] [PubMed] [Google Scholar]
  • 21.Lettre G., Palmer C.D., Young T., Ejebe K.G., Allayee H., Benjamin E.J., Bennett F., Bowden D.W., Chakravarti A., Dreisbach A. Genome-wide association study of coronary heart disease and its risk factors in 8,090 African Americans: the NHLBI CARe Project. PLoS Genet. 2011;7:e1001300. doi: 10.1371/journal.pgen.1001300. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Dawber T.R., Meadors G.F., Moore F.E., Jr. Epidemiological approaches to heart disease: the Framingham Study. Am. J. Public Health Nations Health. 1951;41:279–281. doi: 10.2105/ajph.41.3.279. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Splansky G.L., Corey D., Yang Q., Atwood L.D., Cupples L.A., Benjamin E.J., D’Agostino R.B., Sr., Fox C.S., Larson M.G., Murabito J.M. The third generation cohort of the National Heart, Lung, and Blood Institute’s Framingham Heart Study: design, recruitment, and initial examination. Am. J. Epidemiol. 2007;165:1328–1335. doi: 10.1093/aje/kwm021. [DOI] [PubMed] [Google Scholar]
  • 24.So H.-C., Kwan J.S., Cherny S.S., Sham P.C. Risk prediction of complex diseases from family history and known susceptibility loci, with applications for cancer screening. Am. J. Hum. Genet. 2011;88:548–565. doi: 10.1016/j.ajhg.2011.04.001. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Chatterjee N., Wheeler B., Sampson J., Hartge P., Chanock S.J., Park J.-H. Projecting the performance of risk prediction based on polygenic analyses of genome-wide association studies. Nat. Genet. 2013;45:400–405.e1–e3. doi: 10.1038/ng.2579. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Widmer C., Lippert C., Weissbrod O., Fusi N., Kadie C., Davidson R., Listgarten J., Heckerman D. Further improvements to linear mixed models for genome-wide association studies. Sci. Rep. 2014;4:6874. doi: 10.1038/srep06874. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Kang H.M., Zaitlen N.A., Wade C.M., Kirby A., Heckerman D., Daly M.J., Eskin E. Efficient control of population structure in model organism association mapping. Genetics. 2008;178:1709–1723. doi: 10.1534/genetics.107.080101. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Jakobsdottir J., McPeek M.S. MASTOR: mixed-model association mapping of quantitative traits in samples with related individuals. Am. J. Hum. Genet. 2013;92:652–666. doi: 10.1016/j.ajhg.2013.03.014. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Patterson H.D., Thompson R. Recovery of inter-block information when block sizes are unequal. Biometrika. 1971;58:545–554. [Google Scholar]
  • 30.Lippert C., Listgarten J., Liu Y., Kadie C.M., Davidson R.I., Heckerman D. FaST linear mixed models for genome-wide association studies. Nat. Methods. 2011;8:833–835. doi: 10.1038/nmeth.1681. [DOI] [PubMed] [Google Scholar]
  • 31.Listgarten J., Lippert C., Kadie C.M., Davidson R.I., Eskin E., Heckerman D. Improved linear mixed models for genome-wide association studies. Nat. Methods. 2012;9:525–526. doi: 10.1038/nmeth.2037. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Legarra A., Aguilar I., Misztal I. A relationship matrix including full pedigree and genomic information. J. Dairy Sci. 2009;92:4656–4663. doi: 10.3168/jds.2009-2061. [DOI] [PubMed] [Google Scholar]
  • 33.Misztal I., Legarra A., Aguilar I. Computing procedures for genetic evaluation including phenotypic, full pedigree, and genomic information. J. Dairy Sci. 2009;92:4648–4655. doi: 10.3168/jds.2009-2064. [DOI] [PubMed] [Google Scholar]
  • 34.Aguilar I., Misztal I., Johnson D.L., Legarra A., Tsuruta S., Lawlor T.J. Hot topic: a unified approach to utilize phenotypic, full pedigree, and genomic information for genetic evaluation of Holstein final score. J. Dairy Sci. 2010;93:743–752. doi: 10.3168/jds.2009-2730. [DOI] [PubMed] [Google Scholar]
  • 35.Pasaniuc B., Zaitlen N., Lettre G., Chen G.K., Tandon A., Kao W.H., Ruczinski I., Fornage M., Siscovick D.S., Zhu X. Enhanced statistical tests for GWAS in admixed populations: assessment using African Americans from CARe and a Breast Cancer Consortium. PLoS Genet. 2011;7:e1001371. doi: 10.1371/journal.pgen.1001371. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Chen C.-Y., Pollack S., Hunter D.J., Hirschhorn J.N., Kraft P., Price A.L. Improved ancestry inference using weights from external reference panels. Bioinformatics. 2013;29:1399–1406. doi: 10.1093/bioinformatics/btt144. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Price A.L., Patterson N.J., Plenge R.M., Weinblatt M.E., Shadick N.A., Reich D. Principal components analysis corrects for stratification in genome-wide association studies. Nat. Genet. 2006;38:904–909. doi: 10.1038/ng1847. [DOI] [PubMed] [Google Scholar]
  • 38.Wray N.R., Yang J., Hayes B.J., Price A.L., Goddard M.E., Visscher P.M. Pitfalls of predicting complex traits from SNPs. Nat. Rev. Genet. 2013;14:507–515. doi: 10.1038/nrg3457. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.Visscher P.M., Brown M.A., McCarthy M.I., Yang J. Five years of GWAS discovery. Am. J. Hum. Genet. 2012;90:7–24. doi: 10.1016/j.ajhg.2011.11.029. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40.Bengio Y., Grandvalet Y. No unbiased estimator of the variance of k-fold cross-validation. J. Mach. Learn. Res. 2004;5:1089–1105. [Google Scholar]
  • 41.Speed D., Hemani G., Johnson M.R., Balding D.J. Improved heritability estimation from genome-wide SNPs. Am. J. Hum. Genet. 2012;91:1011–1021. doi: 10.1016/j.ajhg.2012.10.010. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42.Yang J., Weedon M.N., Purcell S., Lettre G., Estrada K., Willer C.J., Smith A.V., Ingelsson E., O’Connell J.R., Mangino M., GIANT Consortium Genomic inflation factors under polygenic inheritance. Eur. J. Hum. Genet. 2011;19:807–812. doi: 10.1038/ejhg.2011.39. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43.Chen C.-Y., Han J., Hunter D.J., Kraft P., Price A.L. Explicit modeling of ancestry improves polygenic risk scores and BLUP prediction. Genet. Epidemiol. 2015;39:427–438. doi: 10.1002/gepi.21906. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44.Willer C.J., Schmidt E.M., Sengupta S., Peloso G.M., Gustafsson S., Kanoni S., Ganna A., Chen J., Buchkovich M.L., Mora S., Global Lipids Genetics Consortium Discovery and refinement of loci associated with lipid levels. Nat. Genet. 2013;45:1274–1283. doi: 10.1038/ng.2797. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45.Svishcheva G.R., Axenovich T.I., Belonogova N.M., van Duijn C.M., Aulchenko Y.S. Rapid variance components-based method for whole-genome association analysis. Nat. Genet. 2012;44:1166–1170. doi: 10.1038/ng.2410. [DOI] [PubMed] [Google Scholar]
  • 46.Devlin B., Roeder K. Genomic control for association studies. Biometrics. 1999;55:997–1004. doi: 10.1111/j.0006-341x.1999.00997.x. [DOI] [PubMed] [Google Scholar]
  • 47.Crossett A., Lee A.B., Klei L., Devlin B., Roeder K. Refining genetically inferred relationships using treelet covariance smoothing. Ann. Appl. Stat. 2013;7:669–690. doi: 10.1214/12-AOAS598. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48.Dudbridge F. Power and predictive accuracy of polygenic risk scores. PLoS Genet. 2013;9:e1003348. doi: 10.1371/journal.pgen.1003348. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49.Manichaikul A., Mychaleckyj J.C., Rich S.S., Daly K., Sale M., Chen W.-M. Robust relationship inference in genome-wide association studies. Bioinformatics. 2010;26:2867–2873. doi: 10.1093/bioinformatics/btq559. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 50.Loh, P.-R., Bhatia, G., Gusev, A., Finucane, H., Bulik-Sullivan, B., Pollack, S., P.-S. W., de Candia, T., Lee, S., Wray, N., et al.; Schizophrenia Working Group Psychiatric Genomics Consortium. Contrasting genetic architectures of schizophrenia and other complex diseases using fast variance components analysis. Nat. Genet. Published online November 2, 2015. http://dx.doi.org/10.1038/ng.3431. [DOI] [PMC free article] [PubMed]
  • 51.Legarra A., Misztal I. Technical note: Computing strategies in genome-wide selection. J. Dairy Sci. 2008;91:360–366. doi: 10.3168/jds.2007-0403. [DOI] [PubMed] [Google Scholar]
  • 52.VanRaden P.M. Efficient methods to compute genomic predictions. J. Dairy Sci. 2008;91:4414–4423. doi: 10.3168/jds.2007-0980. [DOI] [PubMed] [Google Scholar]
  • 53.Bulik-Sullivan B.K., Loh P.-R., Finucane H.K., Ripke S., Yang J., Patterson N., Daly M.J., Price A.L., Neale B.M., Schizophrenia Working Group of the Psychiatric Genomics Consortium LD Score regression distinguishes confounding from polygenicity in genome-wide association studies. Nat. Genet. 2015;47:291–295. doi: 10.1038/ng.3211. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 54.Styrkarsdottir U., Thorleifsson G., Sulem P., Gudbjartsson D.F., Sigurdsson A., Jonasdottir A., Jonasdottir A., Oddsson A., Helgason A., Magnusson O.T. Nonsense mutation in the LGR4 gene is associated with several human diseases and other traits. Nature. 2013;497:517–520. doi: 10.1038/nature12124. [DOI] [PubMed] [Google Scholar]
  • 55.Do C.B., Tung J.Y., Dorfman E., Kiefer A.K., Drabant E.M., Francke U., Mountain J.L., Goldman S.M., Tanner C.M., Langston J.W. Web-based genome-wide association study identifies two novel loci and a substantial genetic component for Parkinson’s disease. PLoS Genet. 2011;7:e1002141. doi: 10.1371/journal.pgen.1002141. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 56.Haile-Mariam M., Nieuwhof G.J., Beard K.T., Konstatinov K.V., Hayes B.J. Comparison of heritabilities of dairy traits in Australian Holstein-Friesian cattle from genomic and pedigree data and implications for genomic evaluations. J. Anim. Breed. Genet. 2013;130:20–31. doi: 10.1111/j.1439-0388.2013.01001.x. [DOI] [PubMed] [Google Scholar]
  • 57.Khansefid M., Pryce J.E., Bolormaa S., Miller S.P., Wang Z., Li C., Goddard M.E. Estimation of genomic breeding values for residual feed intake in a multibreed cattle population. J. Anim. Sci. 2014;92:3270–3283. doi: 10.2527/jas.2014-7375. [DOI] [PubMed] [Google Scholar]
  • 58.Kemper K.E., Goddard M.E. Understanding and predicting complex traits: knowledge from cattle. Hum. Mol. Genet. 2012;21(R1):R45–R51. doi: 10.1093/hmg/dds332. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Document S1. Tables S1–S11
mmc1.pdf (285.9KB, pdf)
Document S2. Article plus Supplemental Data
mmc2.pdf (618.3KB, pdf)

Articles from American Journal of Human Genetics are provided here courtesy of American Society of Human Genetics

RESOURCES