Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2018 Sep 26.
Published in final edited form as: Stat Appl Genet Mol Biol. 2017 Sep 26;16(4):259–273. doi: 10.1515/sagmb-2016-0076

Confidence intervals for heritability via Haseman-Elston regression

Tamar Sofer 1
PMCID: PMC5857391  NIHMSID: NIHMS922922  PMID: 28862991

Abstract

Heritability is the proportion of phenotypic variance in a population that is attributable to individual genotypes. Heritability is considered an important measure in both evolutionary biology and in medicine, and is routinely estimated and reported in genetic epidemiology studies. In population-based genome-wide association studies (GWAS), mixed models are used to estimate variance components, from which a heritability estimate is obtained. The estimated heritability is the proportion of the model’s total variance that is due to the genetic relatedness matrix (kinship measured from genotypes). Current practice is to use bootstrapping, which is slow, or normal asymptotic approximation to estimate the precision of the heritability estimate; however, this approximation fails to hold near the boundaries of the parameter space or when the sample size is small. In this paper we propose to estimate variance components via a Haseman-Elston regression, find the asymptotic distribution of the variance components and proportions of variance, and use them to construct confidence intervals (CIs). Our method is further developed to estimate unbiased variance components and construct CIs by meta-analyzing information from multiple studies. We demonstrate our approach on data from the Hispanic Community Health Study/Study of Latinos (HCHS/SOL).

1 Introduction

Heritability is the proportion of phenotypic variance that is due to genetic variation among individuals in a population. Heritability is often estimated using mixed models (Zaitlen and Kraft, 2012), where the genetic relatedness between any two individuals in a given study population is estimated (e.g. kinship coefficients may be calculated from GWAS data, or inferred from pedigrees) and then taken as fixed. Then, a variance component due to genetic variation is estimated, and the estimated heritability is the ratio between this variance component and the total variance in the model.

Inference about heritability when estimated from mixed models, and more generally, about other proportions of variance, usually relies on asymptotic normal approximation to the distribution of the estimators. However, multiple studies showed (e.g., Burch (2011), Kruijer et al. (2015)) that such confidence intervals are inaccurate, and may yield values that are not permissible (e.g. negative values). Recently, Schweiger et al. (2016) proposed a bootstrap approach that does not rely on asymptotic normality for estimating confidence intervals for heritability, and a numerical approximation that does not require bootstrapping under a specific way of calculating the genetic relatedness matrix. While they show that their confidence intervals are accurate, their method is limited by computation time, by requiring a single modeled variance parameter, and by requiring a specific form for the genetic relatedness matrix when using the numerical approximation. In addition, current meta-analysis approaches for heritability estimates rely on the inaccurate normal asymptotic approximation.

In this work we proposes to use Haseman-Elston regression for estimating variance components. This approach entails regressing multiplied residuals against entries of covariance matrices. We find the distribution of the variance component estimators as well as the distributions of the proportions of variance, in a general model that allows for multiple sources of variation. We provide an algorithm to estimate the confidence intervals, and to obtain an unbiased meta-analytic estimator of heritability that accurately combines information from multiple studies. In the case where genetic relatedness (or kinship) is the only sources of variation, our algorithm is very quick, with the computationally demanding step being the calculation of eigenvalues from a sub-matrix of the kinship matrix. We demonstrate our method by estimating heritability and proportion of variance attributed to house-hold and community sharing for 47 health outcomes in the Hispanics Community Health Study/Study of Latinos.

2 Materials and methods

2.1 Haseman-Elston regression

Suppose that a quantitative trait Y, measured on n individuals, follows the regression model

yi=wiTβ+bi,a++bi,k+ei=wiTβ+εi,i=1,,n

with β a vector of fixed effects of a covariates vector w, bi,l, l = a, …, k, i = 1, …, n are mean-zero random effects with bl = (bi,1, …, bn,l) and cov(bl)=σl2L, so that σa2,,σk2 are variances corresponding to a, …, k independent sources of variation, and A, …, K are n × n matrices with i, j entries ai,j, …, ki,j modeling the correlation between the individuals’ random effects. Also ei, i = 1, …, n are independent errors with variance σe2. In genetic association studies one of these matrices, say K, is a kinship, or genetic relatedness, matrix. Then

E[yiwiβ]=E[ε]=0var[ε]=σe2In×n+σa2A++σk2K=, andE[εiεj]=cov(εi,εj)=σe21(i=j)+σa2ai,j++σk2ki,j,

where here σk2/(σa2++σk2+σe2)σk2/σT2 is the heritability.

Let β̂ be an unbiased estimator of β, and let ε^i=yiwiTβ^ be an estimator εi, i = 1, …, n. We estimate the variance components in a residual regression, i.e. by taking the vector of all unique pairs of residuals ε̂i ε̂j, ij (we can do it by taking the upper diagonal sub-matrix of ε̂ ε̂T that includes the diagonal), denoted by ε̃d and regressing it according to the above model. The regression design matrix is given by:

X=(1a1,1k1,10a1,2k1,20a1,nk1,n1a2,2k2,20a2,3k2,30a2,nk2,n1an1,n1kn1,n10an1,nkn1,n1an,nkn,n)=(11110a1,2k1,20a1,nk1,n11110a2,3k2,30a2,nk2,n11110an1,nkn1,n1111)

where the second equality is because ai,i, …, ki,i = 1 for all i. Denote the vector of variance components estimated from the Haseman-Elston regression by σ^2=(σ^e2,σ^a2,,σ^k2)T.

2.2 Properties of the variance components and proportions of variance estimators

Complete mathematical derivations are provided in the Supplementary Information. Below are statements of some of the results to provide intuition to the findings and methods.

Lemma 2

Variance component estimators corresponding to the matrices A, …, K depend only on the between-observation multiplied residuals of the form ε̂i ε̂j for ij.

Lemma 3

Denote by σT2=σe2+σa2++σk2. Then σ^T2=1ni=1nε^i2.

Theorem

We say that two matrices C1 and C2 are orthogonal in the trace inner product, or “trace orthogonal” if tr (C1C2) = 0. Let the matrix L be the matrix L with all diagonal values set to 0. If a matrix L is trace orthogonal to all other matrices in the set {A, …, K} \ L, then

σ^l2=1j>ili,j2j>ili,jε^iε^j=ε^TLε^tr(LL),

and the estimator of the proportion of variance modeled in L is the ratio between two quadratic forms given by:

σ^l2σ^T2=ε^TLε^1ntr(LL)ε^Tε^.

The above theorem provides a closed form estimator for a variance component and the proportion of variance corresponding to a correlation matrix L when it represents either the only modeled source of variation in the model, or when it is orthogonal to all other modeled sources of variation. The formula in the theorem explicitly shows that in Haseman-Elston regression the estimator of the total variance of an observation equals the “natural” estimator, the mean sum of squares of the marginal regression residuals, and that variance estimators corresponding to correlation matrices depend only on between-sample correlations (and not within-sample variances). Lemma 2 in the supplementary material shows that the same holds when the various correlation matrices are not trace orthogonal (after setting the diagonal values to zero).

In general, the estimators of variance components are quadratic forms. It is possible to obtain closed form expressions in the more complicated case of multiple modeled sources of variation that are not orthogonal, but the form of the estimator is not as nice as in the case of orthogonality. We here provide intuition to these estimators. Suppose that the correlation between two matrices is the Pearson correlation between the vectorized matrices. Consider the symmetric matrix (XTX) with the l, m = 1, …, k entry being equal to tr(L M). This is the design matrix of the Haseman-Elston regression. It describes relationships between the matrices in our model (e.g. if L and M are trace orthogonal, the l, m entry will be zero). Moreover, its inverse matrix (XTX)−1 could be referred to as the “precision matrix” (Li and Gui, 2006), a term we adopt from the gaussian graphical models literature. Here it means that the l, m, lm entry of (XTX)−1 represents the partial correlation between the matrices L and M given all other matrices in the model, so that if L and M are uncorrelated given other correlation matrices, the corresponding entry of (XTX)−1 would be equal to zero. The quadratic form used for obtaining the estimator of the variance component corresponding to the matrix L = 1, …, K is of the form Q = (waA + … + wkK), with wm, m = 1, …, k being equal to the the l, m entry of (XTX)−1, or the partial correlation between L and M given all other matrices in the model.

2.3 Computation

2.3.1 Variance component estimators

While any unbiased estimator of β̂ suffices to generate residuals ε̂ and use them to obtain variance component estimators as (XTX)−1 XT ε̃d, a more efficient estimator iterates between estimating β and σ2 as follows:

  1. Initialization step: set β̂(0) = (WT W)−1WT y.

  2. Iteration step:
    1. Given the kth estimator of β, β̂(k), set ε̂ = yWβ(k). Let ε̃ denote the vector of upper diagonal matrix (including the diagonal) of ε̂ ε̂T. Set σ^2,(k)=(σ^e2,(k),σ^a2,(k),,σ^k2,(k))=(XTX)1XTε.
    2. Given the kth estimator of σ2, σ̂2,(k), let ^(k)=σ^e2,(k)In×n+σ^a2,(k)A++σ^k2,(k)K with inverse Σ̂−1,(k). Set β̂(k+1) = (WT Σ̂−1,(k) W)−1 WTΣ̂−1,(k)y

The iteration step repeats until convergence.

2.3.2 Confidence intervals for the variance components

From Lemma 4 in the Supplementary Information, any variance component (or sum of variance components) is given as a quadratic form. Let Q be the quadratic form corresponding to a variance component estimate σ^l2, such that σ^l2=ε^TQε^. This σ^l2 is distributed as the sum of independent χ(1)2 variables in i=1nλiχ(1)2, where λ1, …, λn are the eigenvalues of cov(ε̂)1/2 Qcov(ε̂)1/2. In practice, for cov(ε̂) we use the estimated ^=^(σ^e2,,σ^k2). Functions in the R package CompQuadForm (Duchesne and de Micheaux, 2010) calculate the probability function (or survival function) of this quadratic form based on λ1, …, λn. While it takes times to compute the eigenvalues, once they are computed, calculating the probabilities associated with the quadratic form is quick. We can test the hypothesis H0:σl2=0 by calculating the probability

Pr (εTQε=0)=1Pr (εTQε>0),

and calculate two-sided confidence intervals for σ^l2 by calculating the appropriate quantiles of the survival probability. For example, for a 95% confidence interval we take the values (c1, c2) for which

c1=u:Pr (εTQε>u)=0.025
c2=u:Pr (εTQε>u)=0.975.

We find these values using a binary search on the interval [0, σ^T2].

We comment here that cov(ε̂) is in fact given by = Σ̂ − W (WT Σ̂−1 W)−1 WT because ε̂ = yW β̂. In practice, we compared the coverage of confidence intervals when using and when using Σ̂ and the results were essentially the same.

2.3.3 Computing heritability estimates and their confidence intervals

Suppose that the variance component corresponding to the kinship matrix is σk2, with quadratic form denoted by Qk. We estimate heritability as h^k=σ^k2/σ^T2. However, we cannot use the confidence intervals for σk2 to construct confidence intervals for hk. Instead, we note that the point estimate ĥk is given by:

h^k=ε^TQkε^1nε^TIε^~xT^1/2Qk^1/2x1nxT^x=xTFxxTGx

where x ~ 𝒩 (0, I), for F = Σ̂1/2 QkΣ̂1/2 and G = Σ̂ /n. Thus, it is a ratio between two quadratic forms in (what we assume are) normal variables. For the squared root Σ̂1/2, we use the Cholesky decomposition of Σ̂.

We use the saddlepoint approximation for the distribution of a ratio of quadratic forms in normal variables, proposed by Lieberman (1994). Complete detailed are provided in the Supplementary Information. In brief, for each potential value of hk, say hk, we can calculate the survival probability Pr (hkhk) using the saddlepoint approximation. Each such calculation requires as input d1,,dn, the eigenvalues of the matrix D=FhkG. We apply a binary search on the potential values hk[0,1] to find end points c1 and c2 for the confidence intervals, as was done for calculating a confidence intervals for σk2.

2.3.4 Fast computation when genetic relatedness is the only modeled source of correlation

If we have only have a single kinship matrix K modeling the phenotypic variance, we can compute the eigenvalues λ1, …, λn of the matrix K once, and then transform these eigenvalues to obtain the eigenvalues d1(hk),,dn(hk) for each value hk, and save computation time. To see this, suppose that u is an eigenvector of K with an eigenvalue λ. Then, by definition, Ku = λu. Since =σk2(K+I)+σe2I, it is straightforward to see that u is also an eigenvector of Σ:

u=[σk2(K+I)+σe2I]u=(σk2λ+σk2+σe2)u.

Similarly, u is an eigenvector of Σ1/2 with an eigenvalue σk2λ+σk2+σe2, which finally leads to the transformation between an eigenvalue λi of K to an eigenvalue of D(hk):

di(hk,λi)=12i<jvij2λi(λiσk2+σk2+σe2)hk(λiσk2+σk2+σe2)/n.

As before, we use the estimated σ^k2,σ^e2 instead of the true unknown values.

2.3.5 Meta-analysis across studies when kinship is the only source of correlation

Suppose that there are S studies that we wanted to combine in meta-analysis. We assume that kinship is the only source of correlation. Each study has a vector of residuals ε̂s = (ε̂s,1, …, ε̂s,ns)T, s = 1, …, S. Consider the Haseman-Elston regression, but incomplete, so that only the pairs of multiplied residuals within study ε̂s,i ε̂s, j are regressed against entries of the kinship covariance matrix, but not ε̂s,i ε̂t, j for st. For this, we do not need to assume that a participant in one study is genetically unrelated of a participant in another study. The meta-analytic estimator of σT2 is given by σ^T2=s=1Si=1nsε^s,i2/s=1Sns. Let ε^=(ε^1T,,ε^S)T. The meta-analytic kinship variance component estimator is given by

σ^k2=1tr (KSKS)ε^TKSε^

where KS is the block diagonal matrix that have all the study-specific kinship matrix (without their diagonal values) arranged diagonally, as

KS=(K100K2000Ks)

To see that this meta-analytic estimator of σk2 is unbiased, note first that cov(ε^)=(σe2+σk2)I+σk2K, where now K has kinship coefficients for individuals across studies (and diagonals set to zero). From characteristics of quadratic forms:

E[σ^k2]=E[1tr (KSKS)ε^TKSε^]=1tr (KSKS)tr (KScov(ε^))=1tr (KSKS)tr {KS[(σe2+σk2)I+σk2K]}=1tr (KSKS)tr (KSσk2K)=1tr (KSKS)tr (KSσk2KS)=σk2.

Let K=KS+KC, where KC is the matrix of cross-study relatedness. Although the variance components estimates and their ratios depend only on KS, their distribution depend on KC as well.

Computing the meta-analytic heritability estimator and confidence intervals

Suppose that each of S independent studies calculated the residuals from a “null model” (a regression model without genetic fixed effects other than PCs). Each study s reports:

  1. 𝒦s=2i<jkij2,

  2. σ^k,s2,

  3. σ^T,s2,

  4. The number of participants in the study ns,

  5. The eigenvalues λ1s,,λnss of its matrix Ks.

The meta-analysis estimates of the kinship variance components and the total variance are:

σ^k2=s=1S𝒦sσ^k,s2s=1S𝒦s
σ^T2=s=1Snsσ^T,s2s=1Sns.

The error variance component is taken to be the difference σ^e2=σ^T2σ^k2, and the eigenvalues of the across-studies matrix K (= KS under independence between studies) are taken to be λ11,,λn11,,λ1S,,λnSS. Using these, the central location calculates heritability estimates and confidence intervals. These estimators are the estimators that one would have obtained if all regression residuals and kinship values were available at the central location and individuals were unrelated between studies. Interestingly, the estimator of the kinship variance is the weighted average of the kinship variance estimators from the various studies, weighted by the sum of squared entries of the study-specific kinship matrices with diagonal values set to zero. Therefore, a study with larger values of relatedness overall will make a stronger contribution to the kinship variance estimator. The estimator of the total variance is simply weighted by the sample sizes. Thus, as when estimating variance components from a single study, all residuals have equal contribution to the estimator of the total variance, while multiplied residual pairs with greater corresponding kinship coefficients have large influence on the kinship variance estimator.

2.4 The Hispanic Community Health Study/Study of Latinos

The HCHS/SOL (LaVange et al., 2010, Sorlie et al., 2010)) is a community based cohort study, following self-identified Hispanic individuals from four field centers (Chicago, IL; Miami, FL; Bronx, NY; and San Diego, CA). Individuals were sampled via a two-stage sampling scheme, in which households were randomly sampled from sampled community block units. Almost 13,000 study participants consented for genotyping. Correlation matrices to model environmental variance due to households and community block units were generated so that the i, j entry of a given matrix was set to 1 if the i and j individuals live in the same household (or community block unit), and 0 otherwise.

HCHS/SOL individuals were classified into ‘genetic analysis groups’: Central American, Cuban, Dominican, Mexican, Puerto Rican, and South American. These groups are based on self reported ethnicities and genetic similarity (Conomos et al., 2016a). This study was approved by the institutional review boards at each field center, where all participants gave written informed consent. The HCHS/SOL genotype and phenotype data are available on dbGaP under accession numbers phs000880.v1.p1 and phs000810.v1.p1.

2.4.1 Genotyping, imputation and quality control

Blood samples from HCHS/SOL individuals were genotyped on a custom array consisting of Illumina Omni 2.5M content plus ~150,000 custom markers selected to include ancestry-informative markers, variants characteristic of Amerindian populations, known GWAS hits and other candidate gene polymorphisms. Quality control was similar to the procedure described in Laurie et al. (2010) and included checks for sample identity, batch effects, missing call rate, chromosomal anomalies (Laurie et al., 2012), deviation from Hardy-Weinberg equilibrium, Mendelian errors, and duplicate sample discordance. 12,784 samples passed quality control, and 2,232,944 SNPs passed quality filters. Pairwise kinship coefficients and principal components reflecting ancestry were estimated in an iterative procedure which accounts for admixture (Conomos et al., 2016a). All common variants were used to estimate kinship coefficients. Finally, we removed some individuals at random to generate a set of 10,255 individuals without any pair having kinship coefficient higher than 2−11.

2.4.2 Heritability and proportion of variance estimation in the HCHS/SOL

Due to the sampling structure of the HCHS/SOL, the correlation between individuals is modeled via a kinship matrix, and two matrices modeling environmental effects: household and community block unit matrices. For each investigated trait we estimated variance components corresponding to the three correlation matrices via the Haseman-Elston regression. We estimated 95% confidence intervals for heritability, and for the proportion of variance explained by both modeled environmental effects together.

We first consider 47 traits for which previous GWAS was performed (though not necessarily published) in the HCHS/SOL. The traits were, by groups, white and red blood cells counts and indices: eosinophils (EOS), hemoglobin (HB), lymphocytes (LYMPH), neutrophils (NEUT), total white blood cell count (WBC), monocytes (MONO), total red blood cells count (RBC), hematocrit (HCT), mean corpuscular hemoglobin (MCH), mean corpuscular volume (MCV), mean corpuscular hemoglobin concentration (MCHC), red cell distribution width (RDW), and platelet count (PLT), anthropometric measures: BMI, waist circumference adjusted for BMI (WCadjBMI), waist-to-hip ratio adjusted to BMI (WHRadjBMI), hip circumference adjusted for BMI (HIPadjBMI), height, ECG measures: heart rate and its variability (HR, HRV_SD and HRV_RMS), QT and PR intervals (QT, PR), lipid measures: LDL and HDL cholesterol (HDL, LDL), total cholesterol (TC) and triglycerides (TG), measures of lung function: forced vital capacity (FVC), forced expiratory volume in one second (FVC1), and their ration (FEV1_FVC_ratio), blood pressure measures: systolic and diastolic blood pressure (SBP, DBP), mean arterial pressure (MAP) and pulse pressure (PP), iron measures: ferritin, total iron binding capacity (TIBC), transferring and its saturation (iron, Saturation), glycemic control, kidney and other metabolic traits: fasting insulin, ankle-brachial index (ABI), estimated glomerular filtration rate (eGFR), urine albumin to creatinine ratio (ACR), glycated hemoglobin (HbA1c), dental traits: periodontitis (PERIO) approximated by the cube root of the mean attachment loss interproximal teeth areas, and measures of dental caries (counts of cavities) on teeth surface (TS) and teeth (TT), and a depression score (known as CESD10, a sum of ten questionnaire items related to depression in the week prior to the clinic visit). All regression models were adjusted (via the design matrix W) to the 5 first principal components, study center, age, sex, and genetic analysis group. For some traits we used additional covariates.

We also studied the use of our method for meta-analysis when there are some related individuals across studies on a subset of five traits. We first generated a restricted data set of 7,848 individuals that none of them lived in the same house-hold. We then treated each of the genetic analysis groups as a separate study, and used the proposed procedure for calculating heritability in each of the genetic analysis groups and in meta-analysis. We also compared these analyses to the pooled analysis that modeled all 7,848 individuals together. Note that for this exercise we neglected block unit correlation, i.e. assumed that it does not contribute to the phenotypic variance.

2.5 Simulation studies

We study the accuracy of the proposed method for calculating confidence intervals in simulations, and compare it to other methods for obtaining estimates and confidence intervals. All methods under investigations are those that use pre-defined between-individuals correlation matrices. We used correlation matrices from the HCHS/SOL corresponding to kinship, household, and community block unit, to generate quantitative outcomes with realistic correlation structures. In the Supplementary Information we provide additional simulation studies with correlation matrices that are not directly based on the HCHS/SOL. In any given simulation, data were sampled by first generating an error vector eind from a standard normal distribution. We simulated the covariance structure

cov(e)=σe2I+σk2K+σh2H+σc2C=

by taking e = Σ1/2eind. The matrices K, H, and C were the kinship, household, and community matrices in the HCHS/SOL. The outcomes were simulated by

y=2+3PC1+e,

where PC1 is the first principal component of the HCHS/SOL data. All simulations were performed 1,000 times.

In the first simulation study we set σ=(σe2,σk2,σh2,σc2)=(100,40,15,2), and studied our method in settings of small sample size (n = 1,500) and large sample size (n = 12, 784). We compared the Haseman-Elston approach to a REML approach, with confidence intervals that rely on normal approximation. For this we used the GENESIS R package (Conomos et al., 2016b) that can estimate multiple variance components. In a second simulation study we set σ=(σe2,σk2,σh2,σc2)=(100,σk2,0,0), with σk2{0,40}, so that kinship is the only source of correlation. In these settings we are able to compare additional methods: a combination of REML with the GENESIS R package and the ALBI package (Schweiger et al., 2016) for estimating bootstrap confidence intervals, the REML implementation in the heritability R package (Kruijer et al., 2016) with confidence intervals based on asymptotic normal approximation of either the variance component themselves, or their log. Here we also considered small and large sample sizes, and in addition, we randomly divided the large dataset into 5 subgroups, to generate data mimicking five different studies with possible genetic relatedness between participants of different studies, and studied our meta-analysis approach in this scenario. We randomly partitioned the data to subgroups four times, to make sure that results did not depend on a specific partition.

3 Results

3.1 Simulation studies

Table 1 provides simulation results in terms of root-mean-squared-error (RMSE) of the variance proportion estimator, where RMSE is given by

1nsimi=1nsim(σ^l2σ^T2σl2σT2)2

and nsim = 1,000 is the number of simulations; coverage, which is the proportion of simulations in which the true variance proportion is within the confidence interval; and width, which is the average width of the 95% confidence interval. Table 1 is divided to two parts. On the left side, it provides results from simulation settings that included three different correlation matrices, mimicking the HCHS/SOL, for large and small sample sizes. This part only provides results computed using the proposed Haseman-Elston (HE) procedure and using the asymptotic normal approximation based on REML as implemented in the GENESIS R package. The right side of the table provides results from simulation settings in which only a single kinship matrix was used, again in large and small sample sizes. For these, results from all compared methods are provided.

Table 1.

Simulation results comparing the various methods for estimating proportions of variance and 95% confidence intervals. HE is the Haseman-Elston regression with the proposed procedure for obtaining confidence intervals. HE-meta is the implementation of the meta-analysis procedure when the data set is randomly divided into multiple studies. In this case there are correlation individuals between the studies, while the meta-analysis procedures ignores these correlations. GENESIS is an R package that implements an average-information REML procedure. GENESIS-asymp provides confidence intervals (CIs) based on asymptotic normality. When these CIs contained inadmissible values, they were truncated to include only admissible values. The size of the confidence intervals for GENESIS-asypm was calculated using only simulations that had non-zero estimate of the variance component. GENESIS-ALBI provides confidence intervals calculated in a parametric bootstrap procedure implemented in the ALBI package. ‘P heritability’ is the R package ‘heritability’ implementation of the REML procedure. CIs are based on asymptotic normality. When ‘log’ is specified, the asymptotic normality is calculated on the log transformed variance components, and 0 value is assumed when an estimate is smaller than 0.001. Note that the ‘P heritability’ had convergence problems in the large sample size, null scenario, so that a small number of simulations was used to calculate parameters.

3 correlation matrices Only kinship matrix
Large sample Small sample Large sample Smalle sample

Community
(2/157)
HH
(15/157)
kinship
(40/157)
Community
(2/157)
HH
(15/157)
kinship
(40/157)
kinship
(0/140)
kinship
(40/140)
kinship
(0/140)
kinship
(40/140)
RMSE
HE 0.004 0.016 0.030 0.016 0.092 0.180 0.015 0.028 0.132 0.182
HE - meta 0.054 0.153
GENESIS 0.004 0.014 0.026 0.016 0.089 0.180 0.015 0.025 0.134 0.181
P heritability 0.026 0.025 0.140 0.177

Coverage
HE 0.92 0.96 0.95 0.98 0.98 0.99 0.98 0.94 0.97 1.00
HE - meta 0.95 0.96
GENESIS asymp 0.94 0.96 0.95 0.72 0.78 0.86 0.98 0.95 0.98 0.91
GENESIS - ALBI 0.95 0.96 1.00 0.96
P heritability 0.92 0.96 0.98 0.97
P heritability - log 0.79 0.96 0.79 0.99

Width
HE 0.02 0.06 0.12 0.06 0.32 0.63 0.05 0.11 0.46 0.68
HE - meta 0.14 0.21
GENESIS asymp 0.01 0.06 0.10 0.06 0.32 0.64 0.06 0.10 0.52 0.64
GENESIS - ALBI 0.05 0.10 0.55 0.57
P heritability 0.06 0.10 0.48 0.63
P heritability - log 0.42 0.10 0.90 0.73

The HE estimates of proportion of variance are very similar to those obtain using REML procedures, though often slightly less efficient (usually slightly larger RMSE when compared to the GENESIS estimates). The confidence intervals obtained from the HE regression are better than the REML normal distribution based confidence intervals when the sample size is small, but are similar otherwise. Additional simulation results in the Supplementary Information demonstrate that the normal approximation based confidence intervals perform poorly also when the actual values in the correlation matrix are small, and when multiple correlation matrix are somewhat correlated. Asymptotic REML-based confidence intervals that use the log-transform do not perform well. The bootstrap confidence intervals (GENESIS-ALBI) perform well and tend to be slightly narrower than other confidence interval.

The meta-analysis procedure that ignores between-study relatedness had proper coverage of the proportion of variance, but had less efficient estimates, as seen by the large RMSE and wide confidence intervals. This is expected because we discarded some information compared to the procedure that used the entire data.

3.2 Heritability estimation in the HCHS/SOL

Figure 1 provides the estimated heritability and proportion of variance due to modeled environmental factors (household and community) for the 47 traits examined in the HCHS/SOL, together with 95% confidence intervals. The results are ordered by the estimated heritability. Height has the largest estimated heritability (almost 60%, consistent with other estimates from GWAS), while the heritability of iron (transferrin), periodontitis and the depression score are close to 0, with confidence interval containing zero. Interestingly, the proportion of variance of periodontitis explained by household and community sharing was very high, larger then 20%, and that of CESD10 was also statistically significant at the 0.05 level. The trait with the largest proportion of variance attributable to modeled environmental factors was MCHC (a measure of hemoglobin concentration in red blood cells). Perhaps this is due to environmental exposure that varies among households. For instance, it is known that smoking is associated with MCHC levels (Asif et al., 2013). While smoking status (never, past, current) was used as a covariate in the MCHC model, this variable may not have capture passive smoking that may vary by households.

Figure 1.

Figure 1

Estimated proportion of variance due to kinship (i.e., heritability) and due to modeled environmental factors (household and community sharing) for 47 traits in the HCHS/SOL data of 10,255 individuals. The investigated traits are related to blood cell count and indices (EOS, HB, LYMPH, NEUT, WBC, MONO, RBC, HCT, MCH, MVC, MCHC, RDW, PLT), anthropometric measures (BMI, height, WCadjBMI, HIPadjBMI, WHRadjBMI), ECG traits (HR, HRV_SD, HRV_RMS, QT and PR), lipid measures (LDL, HDL, TC, TG), measures of lung function (FVC, FEV1, FEV1_FVC_ratio), blood pressure measures (SBP, DBP, MAP, PP), dental traits (perio, Dental caries (TS, TT)), iron (iron, ferritin, TIBC, saturation), depression score, and other metabolic, glycemic and kidney traits (fasting_insulin, ABI, eGFR, ACR, HBA1C).

Figure 2 provides results from studying the meta-analysis procedure. For the five investigated traits, it provides estimated proportion of variance (heritability and environmental variance) and 95% confidence intervals from the Full data set, that included environmentally correlated individuals (and was used for Figure 1), and from the restricted data set that did not include environmentally correlated individuals. We used the restricted data set to compare a pooled analysis, genetic analysis group specific analyses, and meta-analysis that ignores the correlation between individuals from different genetic analysis groups, to mimic meta-analysis across different studies. Considering the restricted data set, the analyses of specific genetic analysis groups yielded wide confidence intervals, which often included zero. This is expected due to low power. In addition, the meta-analyses that did not account for the correlations between the genetic analysis groups had wider confidence intervals than the corresponding pooled analyses.

Figure 2.

Figure 2

Estimated proportions of variance from the various subsets of the HCHS/SOL data. The Full dataset included 10,255 individuals with mutual kinship coefficient smaller than 2−11/2. Using Full, we estimated both heritability and the proportion of variance that is due to modeled environmental effects: the sum of the variance components corresponding to household and community block unit sharing. A restricted data set included 7,848 individuals from separate households and was used to compare meta and pooled analysis heritability estimates, where the meta-analysis used information from each of the genetic analysis groups. Dental caries (TS) is a measure of teeth damage on teeth surfaces. Depression score is a summation of responses to questions related to depressive behavior or feelings in the week prior to a participant’s clinic visit. FEV1 is a measure of lung function. SBP is systolic blood pressure.

4 Discussion

In this manuscript we investigate the properties of Haseman-Elston regression estimators of variance components. We get a closed-form expression for the variance estimators, and use them to characterize the distribution of the estimated variance components and ratios of variance, and to compute confidence intervals. Our confidence intervals require normality of the residuals from the trait regression model after adjusting for covariates. We further show how to obtain unbiased estimates of the variance components and proportions of variance by meta-analyzing information from multiple studies. In this case, the heritability estimates are unbiased even if individuals are related between studies, but the asymptotic distribution of the estimators depends on the unknown (and non-estimated) kinship coefficients of cross-study individuals.

The Haseman-Elston regression does not naturally constrain the variance component estimators to be non-negative. In practice, if an estimator of a variance component parameter becomes negative during the algorithm iteration process, it is set to zero, as is also done in REML estimation. However, unlike REML with asymptotic confidence intervals, where there is no uncertainty associated with the parameter that was set to zero, here we can obtain a confidence interval for the variance component estimator, because we can still estimate quantiles of the distribution of the quadratic form. The solution is still somewhat ad-hoc, as we constrain the confidence interval to have 0 as its low end point, and the high end point is that of estimated 97.5% probability. Still, in simulations with null heritability values and values close to 0, the confidence intervals had good coverage.

In the simulation studies, we compared our proposed approach to the approach that calculates confidence intervals based on the asymptotic normal approximation to the distribution of the variance components obtained by maximizing the REML. Using the latter method to obtain confidence intervals is attractive, because it is straightforward to implement and has almost no computational cost. However, the normal approximation does not hold close to the boundary of the parameter space, and when the information is low, e.g. when the values of the kinship matrix are small, as is shown in simulations in the Supplementary Information. In contrast, the proposed Haseman-Elston based confidence intervals perform well, and are almost as efficient (have similar width) as the normal approximation based ones when the sample size is large. The computational cost of calculating the proposed confidence intervals is large when using multiple matrices to model the covariance structure of the outcomes. However, this can be substantially reduced using recent developments in algorithms for fast calculations of the largest eigenvalues of matrices (e.g. Lumley et al. (2016)).

Our approach for heritability estimation and for obtaining confidence intervals relies on having a pre-defined kinship matrix that models the relatedness between individuals. The same is true for other mixed-models based approaches that use individual-level data. Other, relatively new approaches such as the LD-score regression (Bulik-Sullivan et al., 2015) and MQS (Zhou (2016), unpublished manuscript) use summary statistics from GWAS and a reference panel (for calculating confidence intervals, MQS also proposes a combination of the two approaches). Therefore, these methods use actual estimated effect sizes and LD between variants, instead of genetic correlation between individuals. It is a topic of future research to study the relative advantages (e.g. power under various settings) of these manners for estimating heritability: using genetic relatedness between individuals without estimating variants’ effect sizes, versus estimating effect sizes and using correlation between the variants.

We show in simulations based on the HCHS/SOL correlation structure that the coverage of our confidence intervals is good both in pooled analysis, and in meta-analysis (even when individuals are related between studies) while being quite conservative when the sample size is small. More work is needed to study the analytic properties of the confidence intervals in meta-analysis when individuals are related between studies and when some individuals belong to multiple studies. We expect such a scenario to cause a larger deviation between the estimated and the actual distribution of the kinship variance component and heritability, potentially leading to worse coverage of the estimated confidence intervals, depending on the how many such overlaps in study participants exists.

5 Software

An R code for estimating heritability (or proportion of variances due to other modeled factors), and their confidence intervals, together with an example script and with sample code and instructions for running simulation studies can be found at https://github.com/tamartsi/Heritability_CIs.

Supplementary Material

Supplemental material

Acknowledgments

The author thanks Dr. Bruce Weir and Dr. Bill Hill for reviewing earlier versions of the manuscripts, the anonymous reviewers, and the staff and participants of HCHS/SOL for their important contributions. This work was supported in part by NHLBI HHSN268201300005C. The Hispanic Community Health Study/Study of Latinos was carried out as a collaborative study supported by contracts from the National Heart, Lung, and Blood Institute (NHLBI) to the University of North Carolina (N01-HC65233), University of Miami (N01-HC65234), Albert Einstein College of Medicine (N01-HC65235), Northwestern University (N01-HC65236), and San Diego State University (N01-HC65237). The following Institutes/Centers/Offices contribute to the HCHS/SOL through a transfer of funds to the NHLBI: National Institute on Minority Health and Health Disparities, National Institute on Deafness and Other Communication Disorders, National Institute of Dental and Craniofacial Research, National Institute of Diabetes and Digestive and Kidney Diseases, National Institute of Neurological Disorders and Stroke, NIH Institution-Office of Dietary Supplements.

References

  1. Asif M, Karim S, Umar Z, Malik A, Ismail T, Chaudhary A, Alqahtani MH, Rasool M. Effect of cigarette smoking based on hematological parameters: comparison between male smokers and nonsmokers. Turkish Journal of Biochemistry–Turk J Biochem. 2013;38:75–80. [Google Scholar]
  2. Bulik-Sullivan B, Finucane HK, Anttila V, Gusev A, Day FR, Loh P-R, Duncan L, Perry JR, Patterson N, Robinson EB, et al. An atlas of genetic correlations across human diseases and traits. Nature genetics. 2015 doi: 10.1038/ng.3406. [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Burch BD. Assessing the performance of normal-based and REML-based confidence intervals for the intraclass correlation coefficient. Computational Statistics & Data Analysis. 2011;55:1018–1028. [Google Scholar]
  4. Conomos MP, Laurie CA, Stilp AM, Gogarten SM, McHugh CP, Nelson SC, Sofer T, Fernández-Rhodes L, Justice AE, Graff M, Young KL, Seyerle A, Avery C, Taylor K, Rotter J, Talavera G, Daviglus M, Wassertheil-Smoller S, Schneiderman N, Heiss G, Kaplan R, Franceschini N, Reiner A, Shaffer G, John R, Barr, Kerr K, Browning S, Browning B, Weir B, Avilés-Santa L, Papanicolaou G, Lumley T, Szpiro A, North K, Rice K, Thornton T, Laurie C. Genetic Diversity and Association Studies in US Hispanic/Latino Populations: Applications in the Hispanic Community Health Study/Study of Latinos. The American Journal of Human Genetics. 2016a;98:165–184. doi: 10.1016/j.ajhg.2015.12.001. [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Conomos MP, Thornton T, Gogarten SM. GENESIS: GENetic EStimation and Inference in Structured samples (GENESIS): Statistical methods for analyzing genetic data from samples with population structure and/or relatedness. 2016b r package version 2.5.2. [Google Scholar]
  6. Duchesne P, de Micheaux PL. Computing the distribution of quadratic forms: Further comparisons between the liu-tang-zhang approximation and exact methods. Computational Statistics and Data Analysis. 2010;54:858–862. [Google Scholar]
  7. Kruijer W, Boer MP, Malosetti M, Flood PJ, Engel B, Kooke R, Keurentjes JJ, van Eeuwijk FA. Marker-based estimation of heritability in immortal populations. Genetics. 2015;199:379–398. doi: 10.1534/genetics.114.167916. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Kruijer W, Flood P, Kooke R. heritability: Marker-Based Estimation of Heritability Using Individual Plant or Plot Data. 2016 URL http://CRAN.R-project.org/package=heritability, r package version 1.2.
  9. Laurie C, et al. Quality control and quality assurance in genotypic data for genome-wide association studies. Genetic Epidemiology. 2010;34:591–602. doi: 10.1002/gepi.20516. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Laurie CC, Laurie CA, Rice K, Doheny KF, Zelnick LR, McHugh CP, Ling H, Hetrick KN, Pugh EW, Amos C, et al. Detectable clonal mosaicism from birth to old age and its relationship to cancer. Nature Genetics. 2012;44:642–650. doi: 10.1038/ng.2271. [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. LaVange LM, Kalsbeek WD, Sorlie PD, Avilés-Santa LM, Kaplan RC, Barnhart J, Liu K, Giachello A, Lee DJ, Ryan J, et al. Sample design and cohort selection in the hispanic community health study/study of latinos. Annals of epidemiology. 2010;20:642–649. doi: 10.1016/j.annepidem.2010.05.006. [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Li H, Gui J. Gradient directed regularization for sparse gaussian concentration graphs, with applications to inference of genetic networks. Biostatistics. 2006;7:302–317. doi: 10.1093/biostatistics/kxj008. [DOI] [PubMed] [Google Scholar]
  13. Lieberman O. Saddlepoint approximation for the distribution of a ratio of quadratic forms in normal variables. Journal of the American Statistical Association. 1994;89:924–928. [Google Scholar]
  14. Lumley T, Brody JA, Peloso G, Rice K. Sequence kernel association tests for large sets of markers: tail probabilities for large quadratic forms. bioRxiv. 2016 doi: 10.1002/gepi.22136. URL http://www.biorxiv.org/content/early/2016/11/04/085639. [DOI] [PMC free article] [PubMed]
  15. Schweiger R, Kaufman S, Laaksonen R, Kleber ME, März W, Eskin E, Rosset S, Halperin E. Fast and Accurate Construction of Confidence Intervals for Heritability. The American Journal of Human Genetics. 2016;98:1181–1192. doi: 10.1016/j.ajhg.2016.04.016. URL http://dx.doi.org/10.1016/j.ajhg.2016.04.016. [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Sorlie PD, Avilés-Santa LM, Wassertheil-Smoller S, Kaplan RC, Daviglus ML, Giachello AL, Schneiderman N, Raij L, Talavera G, Allison M, LaVange L, Chambless LE, Heiss G. Design and implementation of the hispanic community health study/study of latinos. Annals of epidemiology. 2010;20:629–641. doi: 10.1016/j.annepidem.2010.03.015. [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. Zaitlen N, Kraft P. Heritability in the genome-wide association era. Human genetics. 2012;131:1655–1664. doi: 10.1007/s00439-012-1199-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. Zhou X. A unified framework for variance component estimation with summary statistics in genome-wide association studies. bioRxiv. 2016 doi: 10.1214/17-AOAS1052. URL http://biorxiv.org/content/early/2016/03/08/042846. [DOI] [PMC free article] [PubMed]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplemental material

RESOURCES