Skip to main content
Bioinformatics logoLink to Bioinformatics
. 2018 Jun 27;34(13):i187–i194. doi: 10.1093/bioinformatics/bty253

A scalable estimator of SNP heritability for biobank-scale data

Yue Wu 1, Sriram Sankararaman 1,2,
PMCID: PMC6022682  PMID: 29950019

Abstract

Motivation

Heritability, the proportion of variation in a trait that can be explained by genetic variation, is an important parameter in efforts to understand the genetic architecture of complex phenotypes as well as in the design and interpretation of genome-wide association studies. Attempts to understand the heritability of complex phenotypes attributable to genome-wide single nucleotide polymorphism (SNP) variation data has motivated the analysis of large datasets as well as the development of sophisticated tools to estimate heritability in these datasets. Linear mixed models (LMMs) have emerged as a key tool for heritability estimation where the parameters of the LMMs, i.e. the variance components, are related to the heritability attributable to the SNPs analyzed. Likelihood-based inference in LMMs, however, poses serious computational burdens.

Results

We propose a scalable randomized algorithm for estimating variance components in LMMs. Our method is based on a method-of-moment estimator that has a runtime complexity O(NMB) for N individuals and M SNPs (where B is a parameter that controls the number of random matrix-vector multiplications). Further, by leveraging the structure of the genotype matrix, we can reduce the time complexity to O(NMBmax(log3N,log3M)).

We demonstrate the scalability and accuracy of our method on simulated as well as on empirical data. On standard hardware, our method computes heritability on a dataset of 500 000 individuals and 100 000 SNPs in 38 min.

Availability and implementation

The RHE-reg software is made freely available to the research community at: https://github.com/sriramlab/RHE-reg.

1 Introduction

A central question in biology is to understand how much of the variation in a trait (phenotype) can be explained by genetics as opposed to environmental factors. The heritability of a trait is a central notion in quantifying the contribution of genetics to the variation in a trait. The heritability of a trait refers to the proportion of variation in the trait that can be explained by genetic variation (Visscher et al., 2008). The narrow-sense heritability (h2) refers to the proportion of trait variation that can be explained by a linear function of genetic variation (Almasy and Blangero, 1998). Beyond understanding the genetic basis of a phenotype, heritability determines the power of genetic association studies to detect genetic variants associated with a phenotype, the accuracy of using genetic data to predict phenotypes, as well as the response of a phenotype to natural and artificial selection (Houle, 1992).

While family-based studies enabled the estimation of heritability of a wide variety of traits, the availability of genome-wide genetic variation data has enabled a direct estimation of the heritability associated with genotyped single nucleotide polymorphisms (SNPs), termed SNP heritability. Initial attempts to estimate heritability from genomic data focused on the variation in a trait could be explained by SNPs that were discovered to be significantly associated with the trait in a genome-wide association study (GWAS). These estimates were found to severely under-estimate the narrow-sense heritability, a phenomenon known as missing heritability. A major insight into the mystery of missing heritability emerged in Yang et al. (2010) who showed that using all genotyped SNPs jointly to explain variation in a trait led to a substantially larger estimate of heritability than from SNPs that were found to be associated in GWAS. Subsequent analyses suggest that much of missing heritability could be explained by the presence of a large number of SNPs of weak effects that has, in turn, motivated analyses of larger datasets.

Linear mixed models (LMMs) has emerged as a key analytically technique for estimating the heritability of complex traits using genome-wide SNP variation data. Beyond their application in estimating SNP heritability, LMMs are widely used in association tests where they are used to control for population stratification (Kang et al., 2008a; Lippert et al., 2011; Loh et al., 2015b; Yu et al., 2006; Zhou and Stephens, 2014), in phenotype and disease risk prediction (Makowsky et al., 2011; Speed et al., 2012; Wray et al., 2013; Yang et al., 2010; Zhou et al., 2013), and in understanding the relative contribution of genomic regions to variation in a trait of interest (Makowsky et al., 2011; Wray et al., 2013; Yang et al., 2010). A key step in the application of LMMs is the estimation of their parameters, i.e. often referred to as variance components. Estimation of variance components is a computationally challenging problem on genomic datasets containing large numbers of individuals and SNPs. The most commonly used method for variance components estimation in LMMs relies on maximizing the likelihood of the parameters. Often, a related estimator, known as the restricted maximum likelihood (REML) estimator, is preferred due to a reduced bias relative to maximum likelihood estimators. Both maximum likelihood as well as REML estimation, however, rely on computationally intensive optimization problems. While a number of methods have been proposed to improve the computational efficiency of REML estimators (Kang et al., 2008b; Lippert et al., 2011; Loh et al., 2015a, b; Pirinen et al., 2013; Yang et al., 2011), all of these methods rely on iterative optimization algorithms that do not scale well to biobank-scale datasets consisting of millions of individuals genotyped at tens of millions of SNPs. Further, REML has been shown to yield biased estimates of heritability in ascertained case-control studies (Chen, 2014; Golan et al., 2014).

1.1 Our contributions

We propose a scalable randomized algorithm to estimate variance components of a LMM. Our method is based on Haseman–Elston (HE) regression (Bulik-Sullivan, 2015; Chen et al., 2004; Elston et al., 2000; Haseman and Elston, 1972), a method-of-moment (MoM) estimator of the heritability of a phenotype. The HE-regression estimator, like other MoM estimators, tends to be statistically less efficient compared to REML. On the other hand, HE-regression is computationally attractive as it leads to a set of linear equations in the variance components that can be solved analytically. While this property of HE-regression is appealing, a key computational bottleneck in the application of HE-regression is the computation of an N × N matrix that summarizes the relationship between all N pairs of individuals in the dataset. As a result, the computation and memory requirements of HE scale quadratically with the number of individuals.

Our randomized HE-regression (RHE-reg) estimator relies on the observation that the key bottleneck in HE-regression can be replaced by multiplying the × M (individuals × SNPs) matrix of genotypes with a small number, B, of random vectors. This leads to a randomized estimator with runtime O(NMB) and memory requirements O(NM). Further, we leverage the observation that the genotype matrix has entries in a finite set, i.e. {0,1,2} so that the time complexity of matrix-vector multiplication reduces to O(NMBmax(log3(N),log3(M))) (Liberty and Zucker, 2009). This additional gain in efficiency can be substantial when the number of SNPs or individuals is large. For example, in the UK Biobank, N is of the order of 105 while M is of the order of 106. Thus, we propose an estimator of variance components with runtime O(NMBmax(log3(N),log3(M))+NM) and memory requirement O(NM).

We apply the RHE-reg estimator to the problem of estimating SNP heritability. We show that our method yields unbiased SNP heritability estimates. While our method is statistically inefficient compared to REML (both because it is moment-based as well as the added randomization), we show in practice that the statistical inefficiency is minimal, particularly for large sample sizes. Further, our method is substantially more computationally efficient so that it can be effectively applied to whole-genome genotype data from hundreds of thousands of individuals. REML has been shown to yield biased estimates of heritability in ascertained case-control studies (Chen et al., 2004; Golan et al., 2014) while the RHE-reg estimator can also be applied in this setting.

Finally, since variance component analysis is of interest beyond heritability estimation, the RHE-reg estimator can enable rapid estimation of variance components in all of the settings in which LMMs are used.

2 Materials and methods

We observe genotypes from N individuals at M SNPs. The genotype vector for individual i is a length M vector denoted by gi{0,1,2}M. The jth entry of gi denotes the number of minor allele carried by individual i at SNP j. Let G be the N × M genotype matrix where G=[g1TgNT]. X is a × M matrix of standardized genotypes obtained by centering and scaling each column of G so that ngn,m=0 and ngn,m2=1 for all n{1,,N}. Let y is an N-vector of phenotypes and β be an M-vector of SNP effect sizes.

2.1 Linear mixed model

We assume the vector of phenotypes y is related to the genotypes by a LMM:

y|ϵ,β=Xβ+ϵ (1)
ϵ|σe2N(0,σe2IN) (2)
β|σg2N(σg2MIM). (3)

Here y is centered so that nyn=0. σe2 is the residual variance while σg2 is the variance component corresponding to the M SNPs. The SNP heritability is defined as h2=σg2σg2+σe2.

In this model, we have E[y]=0 while the population covariance of the phenotype vector y is:

cov(y)=E[yyT]E[y]E[y]T=σg2XXTM+σe2IN (4)
=σg2K+σe2IN. (5)

Here K=1MXXT is the genetic relatedness matrix (GRM) computed from all SNPs. One approach to estimate the SNP heritability is HE-regression (Haseman and Elston, 1972) which is a MoM estimator obtained by equating the population covariance to the empirical covariance [several variants of HE-regression have been proposed; what we consider here is HE-CP (Sham and Purcell, 2001)]. The empirical covariance of the phenotype vector y is estimated by yyT. The MoM estimator is obtained by solving the following ordinary least squares (OLS) problem (see Appendix A1 for details):

(σg2^,σe2^)=argminσg2,σe2||yyT(σg2K+σe2I)||F2. (6)

The MoM estimator satisfies the normal equations:

[tr[K2]tr[K]tr[K]N][σg2^σe2^]=[yTKyyTy]. (7)

Solving the normal equations requires computing tr[K2]=i,jKi,j2,tr[K]=iKi,i,yTKy=i,jKi,jyiyj and yTy=n=1Nyn2. The GRM K can be computed in time O(MN2) and requires O(N2) memory. Given the GRM, computing each of the coefficients for the normal equation requires O(N2) time. Finally, given each of the coefficients, we can solve analytically solve for the σg2^ and σe2^. Indeed, we can write

σg2^=yT(KI)ytr[K2]N. (8)

Thus, the key bottleneck in solving the HE-regression lies in computing the GRM.

2.2 RHE-reg: a randomized estimator of heritability

Given that K=1MXXT, we can compute the quantities tr[K]=1Mi,jXi,j2,w=XTy,tr[yTKy]=1mm=1Mwm2. For standardized genotypes, tr[K]=N while tr[yTKy] can be computed in O(MN) time.

The one remaining quantity that we need to compute efficiently is tr[K2]. Given a × N matrix A and a random vector z with mean zero and covariance IN, we use the following identity to construct a randomized estimator of the trace of matrix A (see Appendix A2 for a proof):

E[zTAz]=tr[A]. (9)

Equation (9) leads to the following unbiased estimator of the trace of K2 given B random vectors, z1,,zB, drawn independently from a distribution with zero mean and identity covariance matrix IN:

LBtr[K2]^=1BbzbTKKzb=1B1M2bzbTXXTXXTzb=1B1M2b||XXTzb||22. (10)

In practice, we draw each entry of z independently from a standard normal distribution. We note that the estimator LB involves two matrix-vector multiplications of × M matrix repeated B times for a total runtime of O(NMB).

The RHE-reg estimator (σg2˜,σe2˜) is obtained by solving the Normal equations [Equation (7)] by replacing tr[K2] with LB.

[LBtr[K]tr[K]N][σg2˜σe2˜]=[yTKyyTy]. (11)

The RHE-reg estimator of the SNP heritability is then obtained by hrhe2=σ2˜sy2 where sy2=yTyN1 is the unbiased estimator of the phenotypic variance.

2.3 Sub-linear computations

The key bottleneck in the RHE-reg is the computation of LB which involves repeated multiplication of the normalized genotype matrix X by a real-valued vector. Leveraging the fact that each element of the genotype matrix G takes values in the set {0, 1, 2}, we can improve the complexity of these multiplication operations from O(NM) to O(NMmax(log3N,log3M)) using the Mailman algorithm (Liberty and Zucker, 2009).

2.3.1 The Mailman algorithm

Consider a ×  N matrix AT whose entries take values in {0, 1, 2}. Assume that the number of SNPs M=log3(N). The naive way to compute the product ATb for any real-valued vector b takes O(log3(N)*N) time.

The Mailman algorithm decomposes the matrix A as AT=UnP. Un is a log3(N)×N matrix whose column contains all possible vectors over {0, 1, 2} of length log3(N). And P is an indicator matrix, where entry Pi,j=1 if the ith column is the same as jth column in matrix A:A(j)=Un(i). The decomposition of matrix A takes O(Nlog3(N)) time. The desired product ATb is computed in two steps as c=Pb followed by Unc, each of which can be computed in only O(N) operations (Liberty and Zucker, 2009).

For a matrix AT with M>log3(N), we partition AT into Mlog3(N) sub-matrices each of size log3(N)×N each of which can be multiplied in time O(N) for a total computational cost of O(NMlog3(N)).

2.3.2 Application of the Mailman algorithm to RHE-reg

Now consider the standardized genotype X, which could be written as X=(GM)Σ, where M is a matrix where the ith column contains the sample mean of the ith SNP (M=1Ng¯T), and Σ is an × M diagonal matrix, with the inverse of variance of each SNP as the diagonal entries.

Thus, when we compute yTKy=1MyTXXTy=1M||Σ(GTyMTy)||22 in Equation (11), computing GTy using the Mailman algorithm takes O(NMmax(log3M,log3N)) operations. Similarly, to compute each term in the sum of the randomized estimator of tr[K2] [Equation (10)], XTzb, we can substitute XTzb with ΣGTzbΣMTzb. The first term ΣGTzb can again be computed using O(NMmax(log3M,log3N)) using the Mailman algorithm, and the second term ΣMTzb is equivalent to scaling the N-vector zb which can be computed in time O(N+M).

2.4 Computing the standard error

We show in Appendix A4 that the variance of the RHE-reg estimator of σg2 can be approximated by the variance of the exact HE-regression estimator with an additional contribution due to the randomization:

Var[σg2˜]Var[σg2^]+1B(tr[K2]N)2(σg4)tr[K2].

Here B is the number of samples used and z is a random vector with mean zero and identity covariance matrix. For samples with low-levels of relatedness, we can assume KI and our estimates of σg2 and tr[K2] to estimate the variance. Further, we show in Appendix A4 that we can estimate the variance (and hence, the standard error) of the RHE-reg estimator in sub-linear time without assuming that KI.

2.5 Some remarks on the RHE-reg estimator

  1. The RHE-reg is biased as we show in Appendix A3 with a bias that decreases with B. In practice, the bias appears to be small (see Fig. 1).

  2. Equation (3) assumes an infinitesimal model for the phenotype. However, all our results only depend on the second moment of the SNP effect sizes. Thus, the RHE-reg estimator can yield valid estimates for non-infinitesimal architectures.

  3. In a number of settings, it is desirable to include covariates, such as age or sex, in the analysis. This changes the model in Equation (3) to:
    y|ε,=Wα+Xβ+ε. (12)

Fig. 1.

Fig. 1.

RHE-reg accurately estimates heritability: in the first series of (ac), we simulated genotypes with varying sample size while fixing the number of SNPs to 10 000. The phenotype in each of the (a), (b) and (c) is simulated with true heritability of 0.2, 0.5 and 0.8, respectively. The second series of (df) considers genotype data with varying number of SNPs while the number of samples is fixed at 10 000. All three methods that we evaluated (GCTA, HE-reg and RHE-reg) have similar accuracies. GCTA which estimates the REML has smaller standard errors when the heritability is large (h2=0.80). For lower values of true heritability (h2=0.20,h2=0.50), the estimates from REML, HE-regression and RHE-reg are comparable. HE and RHE-reg have similar variance suggesting that randomization only makes a minor contribution to the statistical accuracy

Here W is a × C matrix of covariates while α is a C-vector of coefficients. In this setting, we transform Equation (12) by multiplying by the projection matrix V=INW(WTW)1WT:

Vy=VXVXβ+Vε. (13)

The RHE-regression estimator applied to Equation (13) then must satisfy the following moment conditions:

[JBtr[VK]tr[VK]NC][σg2˜σe2˜]=[yTVKVyyTVy]. (14)

Here JB is a randomized estimator of tr[VKVK] analogous to Equation (10). The cost of computing the RHE-reg estimator now includes the cost of computing the inverse of WWT as well as multiplying W by a real-valued vector for an added computational cost of O(C3+NC). Typically, the number of covariates C is small (tens to hundreds) so that the presence of covariance does not greatly increase the computational burden.

  • 4. The variance components model [Equations (3) and (5)] can be extended in a straightforward manner to more than two variance components. A number of recent studies have explored the utility of these models to partition heritability based on functional annotations as well as other categories.

  • 5. The accuracy and the runtime of RHE-reg depends on the choice of the number of random vectors B. In practice, we find that the estimator is highly accurate with a small B100 even for moderate sample sizes N5000 as we show empirically (Fig. 2). Further, for larger sample sizes, even smaller values of B should be adequate. It is also possible to choose increasing values of B and to terminate when the estimate of tr[K2] does not change considerably. We have not explored this option in detail in this work.

Fig. 2.

Fig. 2.

Impact of the number of random vectors on the accuracy of RHE-reg: we ran RHE-reg with a different number of random vectors B, and compared the point estimate and standard error to GCTA. The gray area indicates the standard error computed by GCTA. As RHE-reg use more random vectors, the estimate converges. In fact, even with 10 random vectors, the point estimation is accurate

3 Results

3.1 Simulations

We performed simulations to measure the performance of RHE-reg to other methods for heritability estimation in terms of accuracy, running time and memory usage. We compared RHE-reg to two methods for computing REML estimates: GCTA (Yang et al., 2011) (which implements an exact numerical optimization algorithm to compute the REML) as well as implementations of HE-regression.

3.2 Accuracy

In our first set of simulations, we compared the accuracy of RHE-reg to our implementation of exact HE-regression as well as GCTA, an implementation that computes the REML. We simulated genotypes assuming each SNP is drawn independently from a Binomial distribution with allele frequency that is sampled uniformly from the interval (0, 1). Given the genotypes, we simulated phenotypes under an infinitesimal model, i.e. with effect size at each SNP drawn independently from a normal distribution with mean zero and variance equal to the heritability divided by the number of SNPs. We considered different values for the true SNP heritability of the phenotype to be 0.2, 0.5 and 0.8.

In our first series of experiments, we fixed the number of SNPs at M = 10 000 and varied the number of individuals N=1k,2k10k. In the second series of experiments, we varied the number of SNPs M=1k,2k10k while fixing the number of individuals to be N = 10 000. We repeated each experiment 100 times in order to assess the variance of each of the estimators. We estimated heritability using RHE-reg with B = 100 random vectors.

Figure 1 compares the estimates of each of the three methods (RHE-reg, HE-regression and GCTA) to the true heritability. First, we observe that all three methods obtain estimates of heritability that are quite close to each other as well as to the true heritability across the range of parameters explored. Second, RHE-reg and HE-regression are virtually indistinguishable in the variance of their estimates in each configuration. This suggests that the randomization makes a negligible contribution to the statistical accuracy of the MoM estimators. In some cases, RHE-reg even has a smaller variance than HE-regression. Third, as expected, REML obtains estimators that are closer to the true heritability compared to either of the MoM estimators for a high value of true heritability. For lower values of true heritability (h2=0.20,h2=0.50), the estimates from REML, HE-regression and RHE-reg are comparable. This result is also expected given that REML is asymptotically equivalent to MoM when the phenotypic correlation between individuals is small (Sham et al., 2000; Sham and Purcell, 2001). Finally, the sample size has a bigger effect than the number of SNPs on the accuracy of each of the methods, consistent with theory (Visscher et al., 2014).

3.3 Computational efficiency

In the second set of simulations, we compared the runtime and memory usage of different methods. We compared RHE-reg to two REML methods, GCTA (Yang et al., 2011) and BOLT-REML (Loh et al., 2015a) (a computationally efficient approximate method to compute the REML) as well as an exact MoM method MMHE (Ge et al., 2017). In this experiment, we simulated genotype data consisting of 100 000 SNPs over sample sizes of N=10k,20k,30k,50k,100k and 500 k and then simulated phenotypes corresponding to the genotype data. For each dataset, we ran RHE-reg with B = 100 random vectors. We performed all comparisons on an Intel(R) Xeon(R) CPU 2.10 GHz server with 128 GB RAM. All computations were restricted to a single core, capped to a maximum runtime of 12 h and a maximum memory of 128 GB.

Figure 3 shows that both GCTA and MMHE do not scale to large sample sizes due to the requirement of computing and operating on a GRM that scales quadratically with N. GCTA could not complete its computation when running on N=100K individuals while MMHE did not complete its computation on N=50K. BOLT-REML and RHE-reg scale linearly with sample size. However, RHE-reg is an order of magnitude faster than BOLT-REML. For example, on a dataset of a size of 500 K individuals, RHE-reg computed the heritability in about 30 min compared to 400 min for BOLT-REML. Figure 3 shows that RHE-reg is memory efficient as well.

Fig. 3.

Fig. 3.

RHE-reg is efficient: we measured the run time and memory usage of methods for heritability estimation as a function of the number of samples while fixing the number of SNPs to 100 000. We performed all comparisons on an Intel(R) Xeon(R) CPU 2.10 GHz server with 128 GB RAM. All computations were restricted to a single core, capped to a maximum runtime of 12 h and a maximum memory of 128 GB. In (a), GCTA could not finish computation on 100 K samples. For MMHE, the computation stopped at sample size of 50 k due to memory constraints. Although BOLT-REML scales linearly, RHE-reg is significantly faster. In (b), we observe RHE-reg and BOLT-REML have scalable memory requirements

3.4 Application to real data

We compared the statistical accuracy and runtime of BOLT-REML, GCTA and RHE-reg on the Northern Finland Birth Cohort (NFBC) dataset. The NFBC dataset contains 315 529 SNPs and 5326 individuals after applying standard filters (minor allele frequency >0.05 and Hardy–Weinberg equilibrium P-value <0.01) (Sabatti et al., 2009). We applied these methods to estimate the heritability of three phenotypes that were assayed in this dataset: triglycerides (TGs), high-density lipoprotein (HDL) and body mass index (BMI).

We compared the runtime, point estimates of the heritability as well as standard errors for each of the three methods. We computed RHE-reg with B = 100 random vectors. As shown in Table 1, the heritability estimates of RHE-reg are concordant with the other methods while being an order of magnitude faster to compute. We note that the NFBC dataset has a sample size N5000 so that we expect RHE-reg to be more accurate on larger datasets. The standard error estimates can also be computed in sub-linear time (see Appendix A4).

Table 1.

The estimates of heritability from RHE-reg are consistent with those from GCTA and BOLT-REML on the NFBC data while RHE-reg is substantially faster

Method
GCTA
BOLT-REML
RHE-reg
Runtime hg2 Runtime hg2 Runtime hg2
(min) (SE) (min) (SE) (min) (SE)
TG 11.28 0.145 8.87 0.148 1.61 0.145
(0.051) (0.051) (0.052)
HDL 10.81 0.325 9.72 0.326 1.30 0.349
(0.051) (0.051) (0.052)
BMI 10.85 0.237 9.29 0.235 1.29 0.200
(0.051) (0.051) (0.052)

Note: We estimate the heritability of phenotypes such as triglycerides (TGs), high-density lipoprotein (HDL) and body mass index (BMI) in the NFBC data set.

3.5 Understanding the computational efficiency of RHE-reg

Our implementation of RHE-reg relies on two ideas to obtain computational efficiency: (i) the use of a randomized estimator of the trace, and (ii) the Mailman algorithm for fast matrix-vector multiplication. To explore the contribution of each of these ideas, we compared the runtimes of a MoM estimator with no randomization (HE-reg), RHE-reg using standard matrix-vector multiplication and RHE-reg using the Mailman algorithm. Table 2 shows the runtimes of each of these variants on the NFBC data. We see that the biggest runtime gain arises from applying the randomized estimator (faster by a factor of 10–12 relative to HE-reg) while the application of the Mailman algorithm reduces the runtime further by a factor of 2 (Table 1).

Table 2.

The major gain in computational efficiency arises from the application of the randomized trace estimate

Runtime No Mailman No randomized
trace estimate
(min) (min) (min)
TG 1.61 3.70 38.5
HDL 1.30 2.60 36.2
BMI 1.29 2.68 36.7

Note: We compare the run time for HE-reg as well as the run time for RHE-reg that does not rely on the Mailman algorithm.

3.6 Accuracy of RHE-reg as a function of the number of random vectors B

To explore the impact of the choice of the number of random vectors B on the accuracy of RHE-reg, we compared the heritability estimates of RHE-reg to those obtained from GCTA for the TG phenotype as a function of B. We find good concordance between the estimates from RHE-reg and GCTA even for values of B as low as 10 suggesting that RHE-reg could be even faster in practice with little loss in accuracy (see Fig. 2).

4 Discussion

We proposed a scalable estimator of heritability which is a randomized version of the Haseman–Elston regression (RHE-reg) estimator. The RHE-reg estimator is based on performing a small number of multiplications of the genotype matrix with random vectors with mean zero and identity covariance. Using the properties of the genotype matrix, we can compute this estimator using the Mailman algorithm in O(NMBmax(log3N,log3M)) time on a dataset containing N individuals, M SNPs and with a small number of B random vectors. We show that this estimator achieves similar accuracy as REML-based methods on both simulated and real data. RHE-reg can be effectively applied to whole-genome genotype data of hundreds of thousands of individuals for rapid variance components estimation. Furthermore, RHE-reg is an unbiased estimator and thus can also be applied to ascertained case-control studies.

Acknowledgements

We thank Xiang Zhou and the reviewers for their valuable feedback.

Funding

This work was supported in part by NIH grants R00GM111744, R35GM125055, NSF Grant III-1705121, an Alfred P. Sloan Research Fellowship, and a gift from the Okawa Foundation.

Conflict of Interest: none declared.

Appendix

A1. Method-of-Moments

The MoM principle obtains estimates of the model parameters such that the theoretical moments match the sample moments. In our model, the first theoretical moment, E[y], is 0 by definition while the corresponding sample moment is also zero since we standardized the phenotypes. The second sample moment is yyT and the second theoretical moment is cov(y)=σg2K+σe2IN. Thus, the MoM estimator of (σg2,σe2) is obtained by searching for values of σg2,σe2 such that the sample and theoretical moments are close, i.e. by solving an ordinary least squares (OLS) problem:

(σg2^,σe2^)=argminσg2,σe2||yyT(σg2K+σe2I)||F2.

Since the Frobenius norm of a matrix A,||A||F=tr[AAT], the OLS problem can be re-written as:

(σg2^,σe2^)=argminσg2,σe2tr[(yyT(σg2K+σe2I))(yyT(σg2K+σe2I))T]

which leads to Equation (6).

A2. Randomized estimator of trace of a matrix

For a × N matrix, A, a randomized estimator of tr[A] is tr[A]^1BbzbTAzb, where zb are i.i.d. random vectors with each entry drawn from a standard normal distribution. To see this:

E[zTAz]=E[tr(zTAz)]zTAz is a scalar=E[tr[zzTA]]cyclic property of the trace=tr[E[zzTA]]trace and expectation are linear=tr[E[zzT]A]Ais fixed=tr[A] using the distributional assumptions on z.

A3. Bias of the RHE-reg estimator

Our estimator of tr[K2] is LBtr[K2]^=1BBzbTKKzb. The RHE-reg estimators for (σg2,σe2) are given by: [σg2˜σe2˜]=A1[yTKyyTy] where A=[LBNNN].

We first compute the expectation of this estimator:

E[σg2˜σe2˜]=E[A1[yTKyyTy]]=E[A1]E[yTKyyTy]since random vectors zb and y are independent.

We know that E[yyT]=cov(y)=σg2K+σe2I. We can compute E[yTKy]:

E[yTKy]=E[tr[yTKy]]yTKy is a scalar=E[tr[yyTK]]cyclic property of the trace=tr[E[yyTK]]expectation and trace are linear=tr[E[yyT]K]as K is constant=tr[σg2K2+σe2K]=σg2tr[K2]+Nσe2using tr[K]=N.

And for E[yTy], we have;

E[yTy]=E[tr[yTy]]yTy is a scalar
=E(tr[yyT]]cyclic property of the trace=tr[E[yyT]]expectation and trace are linear=tr[K]σg2+Nσe2=Nσg2+Nσe2.

Defining bE[1LBN] and computing A1=[1LBN1LBN1LBNLBN(LBN)], we have

E[σg2˜σe2˜]=E[A1]E[[yTKyyTy]]=[bbb1N+b][tr[K2]+Nσe2Nσg2+Nσe2]=[b(tr[K2]N)σg2b(Ntr[K2]]σg2+σg2+σe2].

We approximate b=E[1LBN] using Taylor expansion. As we have: f(y)f(x)+f(x)(yx)+12f(x)(yx)2. Let XLBN, and thus μx=E[LBN]=tr[K2]N. We have f(x)=1x,f(x)=1x2,f(x)=2x3.

Thus:

b=E[f(X)]E[f(μx)+f(μx)(Xμx)+12f(μx)(Xμx)2]=f(μx)+1μx2E[Xμx]+122μx3E[(Xμx)2=1μx+1μxσx2μx2]

where σx2=var(X).

Thus E[μxx]=1+σx2μx2. Thus E[σg2˜]=σg2+σx2μx2σg2,E[σe2]=σe2σx2μx2σg2,E[σg2˜+σe2˜]=σg2+σe2.

For σx2, we have:

σx2=E[(LBtr[K2])2]=var(LB)=var(1BBzbTK2zb)zbare independent=1B2Bvar(zbTK2zb)zbare identically distributed=1Bi,jKiTKjzizjelements of z are independent=1BiKi2=1Btr[K2].

Here Ki is the ith column of K.

Thus, substituting μx and σx2, we get E[σg2˜]=σg2+1Btr[K2](tr[K2]N)2σg2=σg2+1B1tr[K2]2N+N2tr[K2]σg2. The bias of the estimator decreases with larger number of random vectors B.

A4. Standard error estimate for the RHE-reg estimator

We define var(y)Σ=σg2K+σe2I. As we know σg2˜=yT(KI)yLBN. Let σg2˜AB where AyT(KI)y and BLBN. Define μAE[A],μBE[B],σA2var(A) and σB2var(B). From Lemma 2 (Appendix A5), we have

var(σg2˜)=var(AB)=1(μB)2σA22μA(μB)3cov(A,B)+(μA)2(μB)4σB2=1(μB)2σA2+(μA)2(μB)4σB2

as A, B are independent. By using Lemma 1 (Appendix A5), we have:

μA=E[yT(KI)y]=(tr[K2]N)σg2σA2=var(yT(KI)y)=2tr[Σ(KI)Σ(KI)]μB=tr[K2]NσB2=tr[K2]B.

Thus we have:

SE(σg2˜)=1tr[K2]N2tr[Σ(KI)Σ(KI)]+1B(σg2)2tr[K2].

In order to estimate the standard error of σg2˜, we use the plug-in estimator:

SE(σg2˜)^=1LBN2tr[yyT(KI)Σ(KI)]+1B(σg2˜)2LB. (15)

Each term in this estimator could be efficiently computed in O(NMBmax(log3N,log3M)).

A5. Useful identities

Lemma 1. For a random vector z that is distributed according to a multivariate normal distribution: zN(0,C) and for symmetric matrices A and B.

cov(zTAz,zTBz)=2tr[CACB].

Thus

E[(zTAz)(zTBz)]=2tr[CACB]+E[(zTAz)]E[(zTBz)]=2tr[CACB]+tr[AC]tr[BC].

Lemma 2. For two random variables, A and B, where B is either discrete or has support (0,), and E[A]=μA,E[B]=μB.

var(AB)1(μB)2var(A)+2μA(μB)3cov(A,B)+(μA)2(μB)4var(B).

References

  1. Almasy L., Blangero J. (1998) Multipoint quantitative-trait linkage analysis in general pedigrees. Am. J. Hum. Genet., 62, 1198–1211. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Bulik-Sullivan B. (2015) Relationship between ld score and Haseman-Elston regression. bioRxiv, 018283.
  3. Chen G.-B. (2014) Estimating heritability of complex traits from genome-wide association studies using ibs-based Haseman–Elston regression. Front. Genet., 5, 107.. [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Chen W.-M. et al. (2004) Quantitative trait linkage analysis by generalized estimating equations: unification of variance components and Haseman-Elston regression. Genet. Epidemiol., 26, 265–272. [DOI] [PubMed] [Google Scholar]
  5. Elston R.C. et al. (2000) Haseman and Elston revisited. Genet. Epidemiol., 19, 1–17. [DOI] [PubMed] [Google Scholar]
  6. Ge T. et al. (2017) Phenome-wide heritability analysis of the UK Biobank. PLoS Genet., 13, e1006711.. [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Golan D. et al. (2014) Measuring missing heritability: inferring the contribution of common variants. Proc. Natl. Acad. Sci., 111, E5272–E5281. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Haseman J., Elston R. (1972) The investigation of linkage between a quantitative trait and a marker locus. Behav. Genet., 2, 3–19. [DOI] [PubMed] [Google Scholar]
  9. Houle D. (1992) Comparing evolvability and variability of quantitative traits. Genetics, 130, 195–204. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Kang H.M. et al. (2008a) Accurate discovery of expression quantitative trait loci under confounding from spurious and genuine regulatory hotspots. Genetics, 180, 1909–1925. [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Kang H.M. et al. (2008b) Efficient control of population structure in model organism association mapping. Genetics, 178, 1709–1723. [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Liberty E., Zucker S.W. (2009) The Mailman algorithm: a note on matrix–vector multiplication. Inf. Process. Lett., 109, 179–182. [Google Scholar]
  13. Lippert C. et al. (2011) FaST linear mixed models for genome-wide association studies. Nat. Methods, 8, 833–835. [DOI] [PubMed] [Google Scholar]
  14. Loh P.-R. et al. (2015a) Contrasting genetic architectures of schizophrenia and other complex diseases using fast variance-components analysis. Nat. Genet., 47, 1385. [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. Loh P.-R. et al. (2015b) Efficient bayesian mixed-model analysis increases association power in large cohorts. Nat. Genet., 47, 284.. [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Makowsky R. et al. (2011) Beyond missing heritability: prediction of complex traits. PLoS Genet., 7, e1002051.. [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. Pirinen M. et al. (2013) Efficient computation with a linear mixed model on large-scale data sets with applications to genetic studies. Annal. Appl. Stat., 7, 369–390. [Google Scholar]
  18. Sabatti C. et al. (2009) Genome-wide association analysis of metabolic traits in a birth cohort from a founder population. Nat. Genet., 41, 35–46. [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. Sham P., Purcell S. (2001) Equivalence between Haseman-Elston and variance-components linkage analyses for sib pairs. Am. J. Hum. Genet., 68, 1527–1532. [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Sham P. et al. (2000) Power of linkage versus association analysis of quantitative traits, by use of variance-components models, for sibship data. Am. J. Hum. Genet., 66, 1616–1630. [DOI] [PMC free article] [PubMed] [Google Scholar]
  21. Speed D. et al. (2012) Improved heritability estimation from genome-wide snps. Am. J. Hum. Genet., 91, 1011–1021. [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. Visscher P.M. et al. (2008) Heritability in the genomics era? Concepts and misconceptions. Nat. Rev. Genet., 9, 255.. [DOI] [PubMed] [Google Scholar]
  23. Visscher P.M. et al. (2014) Statistical power to detect genetic (co) variance of complex traits using snp data in unrelated samples. PLoS Genet., 10, e1004269.. [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. Wray N.R. et al. (2013) Pitfalls of predicting complex traits from snps. Nat. Rev. Genet., 14, 507.. [DOI] [PMC free article] [PubMed] [Google Scholar]
  25. Yang J. et al. (2010) Common snps explain a large proportion of the heritability for human height. Nat. Genet., 42, 565.. [DOI] [PMC free article] [PubMed] [Google Scholar]
  26. Yang J. et al. (2011) GCTA: a tool for genome-wide complex trait analysis. Am. J. Hum. Genet., 88, 76–82. [DOI] [PMC free article] [PubMed] [Google Scholar]
  27. Yu J. et al. (2006) A unified mixed-model method for association mapping that accounts for multiple levels of relatedness. Nat. Genet., 38, 203.. [DOI] [PubMed] [Google Scholar]
  28. Zhou X., Stephens M. (2014) Efficient multivariate linear mixed model algorithms for genome-wide association studies. Nat. Methods, 11, 407.. [DOI] [PMC free article] [PubMed] [Google Scholar]
  29. Zhou X. et al. (2013) Polygenic modeling with bayesian sparse linear mixed models. PLoS Genet., 9, e1003264.. [DOI] [PMC free article] [PubMed] [Google Scholar]

Articles from Bioinformatics are provided here courtesy of Oxford University Press

RESOURCES