Analytical and computational solution for the estimation of SNP-heritability in biobank-scale and distributed datasets

Guo-An Qi; Qi-Xin Zhang; Jingyu Kang; Tianyuan Li; Xiyun Xu; Zhe Zhang; Zhe Fan; Siyang Liu; Guo-Bo Chen

doi:10.1371/journal.pcbi.1013568

. 2025 Oct 21;21(10):e1013568. doi: 10.1371/journal.pcbi.1013568

Analytical and computational solution for the estimation of SNP-heritability in biobank-scale and distributed datasets

Guo-An Qi ^1,^2,^#, Qi-Xin Zhang ^3,^#, Jingyu Kang ⁴, Tianyuan Li ⁴, Xiyun Xu ⁴, Zhe Zhang ⁵, Zhe Fan ⁶, Siyang Liu ⁶, Guo-Bo Chen ^2,^7,^*

Editor: Androniki Psifidi⁸

PMCID: PMC12539748 PMID: 41118419

Author summary

For a complex trait, heritability ( $h^{2}$ ) gives the genetic determination of its variation. Given the emergence of biobank-scale data, a more powerful method is needed to estimate $h^{2}$ . Based on the framework of Haseman-Elston regression (RHE-reg), we integrate a fast randomization algorithm to estimate $h^{2}$ , and RHE-reg can tackle biobank-scale data, such as UK Biobank (UKB), very efficiently. Furthermore, we present an analytical solution that balances computational cost and precision of the estimation, a property that is important in dealing with biobank-scale data. We investigated the performance of the RHE-reg in simulated data and also applied it for 81 UKB quantitative traits; as tested in UKB data of nearly 300,000 unrelated individuals, it took on average about 4.5 hours to complete an estimation when used 10 CPUs. We extended the application of RHE-reg into distributed datasets when privacy is not compromised. As shown in UKB and simulated data the performance of RHE-reg was accurate in estimating $h^{2}$ . The software for estimating SNP-heritability for biobank-scale data is released.

Abstract

Estimation of heritability has been a routine in statistical genetics, in particular with the increasing sample size such as biobank-scale data and distributed datasets, the latter of which has increasing concerns of privacy. Recently a randomized Haseman-Elston regression (RHE-reg) has been proposed to estimate SNP-heritability, and given sufficient iteration ( $B$ ) RHE-reg can tackle biobank-scale data, such as UK Biobank (UKB), very efficiently. In this study, we present an analytical solution that balances iteration $B$ and RHE-reg estimation, which resolves the convergence of the proposed RHE-reg in high precision. We applied the method for 81 UKB quantitative traits and estimated their SNP-heritability and test statistics precisely. Furthermore, we extended RHE-reg into distributed datasets and demonstrated their utility in real data application and simulated data.

Introduction

Estimating heritability has been one of the central tasks in statistical genetics [1]. Given the increasing sequencing capability, high-throughput genetic data have been emerging in the form of biobank-scale [2], which challenges statistical computation, in particular, such as the estimation of heritability for complex traits. Conventional methods, such as REML, for estimating heritability, like the linear mixed model, often takes computational cost of $O (n^{2} m + n^{3})$ , where $n$ is the sample size and $m$ is the number of markers. These costs can become infeasible in the context of biobank-scale data. Haseman-Elston regression (HE-reg) was originally proposed for linkage analysis [3]. After the nuclear correlation between sib pairs is replaced by the linkage disequilibrium (LD) for unrelated samples, the modified HE-reg can be used to estimate heritability and is much faster than REML [4]. Given that $m$ is often greater than $n$ given the current data, any calculation that is upon genetic relationship matrix (GRM) will be unfavorable even for HE-reg. To reduce the computational cost of estimating heritability, a randomized estimation of heritability has been introduced [5], called randomized Haseman-Elston regression (RHE-reg), which is a promising method that can be used for both single-trait and bi-trait analyses [6,7].

RHE-reg is built on a hybrid framework, which has favored analytical properties of the Haseman-Elston regression and the feasible computational cost of $O (n m B)$ for biobank-scale data; $B$ is the round of iteration for RHE-reg. As pointed out by a recent systematic review, iteration control poses one of the challenges for RHE-reg [8]. However, the original report by Wu and Sankararaman did not give a clear solution for the round of iteration [5]. In this study, we investigated RHE-reg and found an analytical procedure to control $B$ , which can provide customized iteration for a given data.

Now, one of the trends is that genomic cohorts are mushroomed such as emerging non-invasive prenatal testing cohorts [9,10], but the bottleneck is how to share genomic data without compromising personal privacy [11]. As recently practiced, when genotypes have been masked in randomization, the randomized method has proven to be reliable in addressing genetic problems for distributed data, such as searching relatives [12,13]. Following this idea, it is found that after the randomization step, RHE-reg can be modified to estimate heritability for distributed datasets, reminiscent of vertical or horizontal federated learning [14].

Method description

Wu and Sankararaman proposed a randomized implementation for the Haseman-Elston regression (RHE-reg), which dramatically reduced the computational time from $O (n^{2} m)$ to $O (n m B)$ in dealing with $t r (K^{2})$ [5]; $K$ is the genetic relationship matrix for $n$ individuals on $m$ markers, and see its detailed definition in the section below. It is clear that a large $B$ , indicating more iteration of the presented algorithm, is helpful in improving precision, but it is unsolved how to get an estimate for $B$ and its role in determining the boundaries of key statistics, upon the standard errors of the randomized estimator [8]. This work is in general consistent with Wu and Sankararaman’s work, but we present the analytical sampling variance of the estimated $h^{2}$ and its corresponding test statistics after correction of some technical errors in their original work. We can consequently evaluate how $B$ influences the estimation of heritability and its corresponding $z$ score, and, as data can be very large, the control of $B$ is of theoretical as well as practical importance. An analytical resolution crystallizes a computational procedure, and we further extend the method to another two new scenarios, called vertical-RHE-reg, which is a global implementation for LD score regression [15], and horizontal-RHE-reg, which enables Federated Learning but we estimate heritability in distributed data without compromise of privacy [14].

Materials and methods

A framework for Randomized Haseman-Elston regression (RHE-reg)

In essence, Haseman-Elston regression is a kind of method of moments (MoM) estimation for heritability, and can provide equivalent estimates of heritability for complex traits after IBD is replaced with IBS [4,11]. As we extend the work of Wu and Sankararaman [5], we similarly assume that

y = X β + e; β ~ N (0, \frac{h^{2}}{m} I_{m}); e ~ N (0, σ_{e}^{2} I_{n})

in which $y$ is the standardized phenotype of the traits of interest, $X$ is the standardized genotype matrix of $n$ individuals, $m$ is the number of double allelic markers, $β$ is the cumulative effect related to each of the markers, $e$ is the residual effect, $I_{m}$ is an $m \times m$ identity matrix, $h^{2}$ is the SNP heritability, and $I_{n}$ is an $n \times n$ identity matrix, $σ_{e}^{2}$ is the residual variance. Under the general assumption for a polygenic trait, it is easy to see that

v a r (y) = E (y y^{T}) - E (y) E (y^{T}) = \frac{h^{2}}{m} X X^{T} + σ_{e}^{2} I_{n} = h^{2} K + σ_{e}^{2} I_{n}

$K = \frac{1}{m} X X^{T}$ is the genetic relationship matrix (GRM); the moment estimator, or randomized Haseman-Elston regression, is to minimize $Q = t r {{[y y^{T} - (h^{2} K + σ_{e}^{2} I_{n})]}^{2}}$ . Of $Q$ , by differentiating $h^{2}$ and $σ_{e}^{2}$ , respectively, we have the following normal equations:

[\begin{matrix} t r (K^{2}) & t r (K) \\ t r (K) & n \end{matrix}] [\begin{matrix} {\hat{h}}^{2} \\ {\hat{σ}}_{e}^{2} \end{matrix}] = [\begin{matrix} y^{T} K y \\ y^{T} I_{n} y \end{matrix}]

(1)

The preliminary estimators for ${\hat{h}}^{2}$ and ${\hat{σ}}_{e}^{2}$ are given as

[\begin{matrix} {\hat{h}}^{2} \\ {\hat{σ}}_{e}^{2} \end{matrix}] = {\begin{matrix} \frac{y^{T} [n K - t r (K) I_{n}] y}{n [t r (K^{2}) - n]} \\ \frac{y^{T} [t r (K^{2}) I_{n} - K t r (K)] y}{n t r (K^{2}) - n^{2}} \end{matrix}

(2)

For ease of discussion, we now only focus on the expression without adjustment of covariates. The denominator involves $t r (K^{2})$ , a high-order function for GRM. Alternatively, according to the trace property of a matrix, it can be calculated that ${t r (K^{2}) = Σ}_{i, j}^{n} K_{i, j}^{2}$ , a summation of the square of each element in $K$ . We proved that the expectation of $\begin{matrix} t r (K^{2}) = \frac{n (n + 1)}{m_{e}} + n \end{matrix}$ , where $m_{e}$ is the effective number of markers that depicts the average squared Pearson’s correlations among all genomic markers as often used for measuring linkage disequilibrium [12,16,17]; a brief sketch of how $t r (K^{2})$ can be transferred into $m_{e}$ is also presented in the section “Estimation for the effective number of markers” below. Therefore, the expectation for the preliminary estimator of $h^{2}$ is $E ({\hat{h}}^{2}) = \frac{m_{e}}{n^{2}} (y^{T} K y - n) = \frac{{\overset{―}{r}}_{m q}^{2}}{{\overset{―}{r}}_{m}^{2}} h^{2}$ for a typical polygenic trait as established [4,18]; ${\overset{―}{r}}_{m q}^{2}$ is the averaged LD between a marker and a causal variant, and ${\overset{―}{r}}_{m}^{2} = \frac{1}{m_{e}} = \frac{\sum_{k, l}^{m} ρ_{k l}^{2}}{m^{2}}$ is the averaged LD between any pair of markers – including the LD of a marker with itself. At first glance at Eq 2, it seems inevitable to compute $K$ , the computational cost of which is $O (n^{2} m)$ , a substantial cost given a large sample size, such as for UKB of about 500,000 samples. We obtain the estimate of $t r (K^{c})$ according to the properties of matrix algebra, and $c$ is the exponential index and $c$ takes the value of 1, 2, 3, or 4 upon the application in this study.

\begin{matrix} {\begin{matrix} L_{c, B} = \frac{1}{B} \sum_{b}^{B} z_{b}^{T} K^{c} z_{b} \\ E (L_{c, B}) = t r (K^{c}) \\ v a r (L_{c, B}) = \frac{2 t r (K^{2 c})}{B} \end{matrix} \end{matrix}

(3)

Where $L_{c, B}$ is a linear estimator for $t r (K^{c})$ , $z_{b}$ is a vector of length $n$ and each element of $z_{b}$ is sampled from the standard normal distribution, and $B$ is the round of iterations. Of note, the sampling variance of $L_{2, B} = \frac{2 t r (K^{4})}{B}$ . As will be shown below, $t r (K^{4})$ will be a plugin parameter in the analysis below, and we suggest a robust estimation of $t r (K^{4})$ from $L_{4, B} = \frac{1}{B} \sum_{b}^{B} z_{b}^{T} K^{4} z_{b}$ rather than $\frac{B}{2} v a r (L_{2, B})$ . Eq 3 is the most innovative part in the work of Wu and Sankararaman, and it is known as Girard-Hutchinson estimation for stochastic trace estimation [19, 20]. Of note, $v a r (L_{2, B}) = \frac{2 t r (K^{4})}{B}$ , which was incorrectly derived as $v a r (L_{2, B}) = \frac{2 t r (K^{2})}{B}$ in Wu and Sankararaman’s work [5], and to fix their problem directly led to the present work.

Randomized estimation for $h^{2}$ via RHE-Reg

When there is random mating, $E [t r (K)] = n$ , and substituting the expressions given as Eq 3 into Eq 2, a randomized estimator of heritability is

{\hat{h}}^{2} = \frac{y^{T} [n K - t r (K) I_{n}] y}{n [L_{2, B} - n]} \approx \frac{y^{T} [K - I_{n}] y}{L_{2, B} - n} = \frac{y^{T} K y - n}{L_{2, B} - n}

(4)

The component $L_{2, B} = \frac{1}{B} \sum_{b}^{B} z_{b}^{T} K^{c} z_{b}$ in the denominator is no other than a shuffling nature of the estimation with $B$ rounds of resampling.

Sampling variance of RHE-reg

Of Eq 4, we have $a = y^{T} [K - I_{n}] y$ and $b = L_{2, B} - n$ , and their respective mean and variance are

{\begin{matrix} a {\begin{matrix} μ_{a} = E (y^{T} K y - n) = [t r (K^{2}) - n] h^{2} \\ σ_{a}^{2} = v a r (y^{T} (K - I) y) = 2 t r [Σ (K - I) Σ (K - I)] \end{matrix} \\ b {\begin{matrix} μ_{b} = L_{2, B} - n = t r (K^{2}) - n = \frac{n (n + 1)}{m_{e}} \\ σ_{b}^{2} = \frac{2}{B} t r (K^{4}) \end{matrix} \end{matrix}

The randomized estimator of $h^{2}$ can be seen as a ratio of $\frac{a}{b}$ , in which both $a$ and $b$ are variables, and, according to Delta method, its sampling variance can be expressed as $v a r (\frac{a}{b}) = \frac{1}{μ_{b}^{2}} σ_{a}^{2} - 2 \frac{μ_{a}}{μ_{b}^{3}} c o v (a, b) + \frac{μ_{a}^{2}}{μ_{b}^{4}} σ_{b}^{2}$ , in which the covariance term can be zeroed out in this scenario [21]. So, we can obtain

v a r ({\hat{h}}^{2}) = 2 {(\frac{m_{e}}{n^{2}})}^{2} (Λ_{1} + \frac{t r (K^{4})}{B} \cdot h^{4})

(5)

For the definition of $Λ_{1}$ please refer to the section “Estimation for key parameters”. As $L_{2, B}$ is a random variable, using Taylor approximation ${\hat{h}}^{2}$ can be obtained by $E ({\hat{h}}^{2}) = E (a) E (\frac{1}{b})$ . $E (\frac{1}{b}) \approx \frac{1}{L_{2, B} - n} - \frac{1}{{(L_{2, B} - n)}^{2}} E [\frac{1}{b} - (L_{2, B} - n)] + \frac{1}{{(L_{2, B} - n)}^{3}} {[\frac{1}{b} - (L_{2, B} - n)]}^{2} = \frac{1}{L_{2, B} - n} + \frac{1}{{(L_{2, B} - n)}^{3}} σ_{b}^{2}$ .

E ({\hat{h}}^{2}) = E (a) E (\frac{1}{b}) = h^{2} + 2 {(\frac{m_{e}}{n^{2}})}^{2} \cdot \frac{t r (K^{4})}{B} \cdot h^{2}

(6)

in which the second term is the bias of the RHE-reg estimator. At the same time, we can also find the mean squared error (MSE), the summation of the sampling variance and squared bias, for ${\hat{h}}^{2}$ as below

M S E ({\hat{h}}^{2}) = v a r ({\hat{h}}^{2}) + {[E ({\hat{h}}^{2}) - h^{2}]}^{2} = 2 {(\frac{m_{e}}{n^{2}})}^{2} (Λ_{1} + \frac{t r (K^{4})}{B} \cdot h^{4}) + 4 {(\frac{m_{e}}{n^{2}})}^{4} {\cdot [\frac{t r (K^{4})}{B}]}^{2} \cdot h^{4}

(7)

In this polynomial expression, as will be shown in the simulation and real data analysis, $M S E ({\hat{h}}^{2})$ is largely upon the sampling variance, which can be further reduced with sufficient iterations ( $B$ ). As will be shown for UK Biobank examples, $B$ dynamically ranges from 10 to 200, even greater upon many factors.

Constructing test statistics

Given the estimation of heritability, we can construct the z-score statistic below:

z_{1} = \frac{{\hat{h}}^{2}}{{\hat{σ}}_{h^{2}}} = (\frac{n^{2}}{{\sqrt{2} m}_{e}}) \frac{{\hat{h}}^{2}}{\sqrt{Λ_{1}} \sqrt{1 + \frac{η}{B}}}

(8)

in which $η = \frac{t r (K^{4}) h^{4}}{Λ_{1}}$ , a quantity that will be zeroed out after sufficient iterations, and ${\hat{σ}}_{h^{2}}$ can be estimated from Eq 5. Obviously, when $B$ is large enough, the optimal z score is the following:

z_{2} = (\frac{n^{2}}{{\sqrt{2} m}_{e}}) \cdot \frac{1}{\sqrt{Λ_{1}}} \cdot {\hat{h}}^{2} (B \to \infty)

(9)

There is an obvious relationship between two z scores in Eq 8 and Eq 9 (practically $B \approx 50$ ). Given $z_{1}$ we can predict optimal test statistic $z_{3}$ as below:

z_{3} = z_{1} \sqrt{1 + \frac{η}{B}}

(10)

It means that after $B$ iteration the expectation of the test statistic is predictable in certain degree.

In summary, given $B$ iterations the test statistic observed is $z_{1}$ , which is subject to the realized values of $m_{e}$ , $Λ_{1}$ , and ${\hat{h}}^{2}$ . $z_{2}$ is the expected optimized test statistic when $B$ is very large and zeroed out all uncertainty due to iteration. $z_{3}$ is a reconstruction of $z_{1}$ , and $\frac{z_{3}}{z_{1}} = \sqrt{1 + \frac{η}{B}}$ , indicating how a larger $B$ seem to bring out advantage in such as a more significant p-value.

Estimation for key parameters

There are several key quantities/parameters involved in the above equations for RHE-reg, and we present how to estimate them. These parameters are $m_{e}$ – effective number of markers, $t r (K^{4})$ the trace of fourth-order GRM, and $Λ_{1}$ .

Estimation for the effective number of markers ( $m_{e}$ )

E [t r (K^{2})] = E {\frac{1}{m^{2}} \sum_{i, j}^{n} {[\sum_{k}^{m} (x_{i k} x_{j k})]}^{2}} = E {\frac{1}{m^{2}} \sum_{i, j}^{n} {[\sum_{k}^{m} (x_{i k} x_{j k}) \rightleft [\sum_{l}^{m} (x_{i l} x_{j l})]}}

$\begin{matrix} E [t r (K^{2})] = \frac{1}{m^{2}} \sum_{i, j}^{n} {[\sum_{k}^{m} (x_{i k} x_{j k})] [\sum_{l}^{m} (x_{i l} x_{j l})]} \end{matrix}$ can be decomposed into four terms $\begin{matrix} E [t r (K^{2})] = \frac{1}{m^{2}} [\sum_{i}^{n} \sum_{k}^{m} x_{i k}^{4} + \sum_{i}^{n} \sum_{k \neq l}^{m} x_{i k}^{2} x_{i l}^{2} + \sum_{i \neq j}^{n} \sum_{k}^{m} x_{i k}^{2} x_{j k}^{2} + \sum_{i \neq j}^{n} \sum_{k \neq l}^{m} x_{i k} x_{j k} x_{i l} x_{j l}] \end{matrix}$ upon $i = j$ (or $i \neq j$ ) and $k = l$ (or $k \neq l$ ), and according to Isserlis’s Theory [22], having integrated these four terms, we have

E [t r (K^{2})] = \frac{1}{m^{2}} [3 n m + n \sum_{k \neq l}^{m} (1 + 2 ρ_{k l}^{2}) + n (n - 1) m + n (n - 1) \sum_{k \neq l}^{m} ρ_{k l}^{2}] = n (n + 1) \frac{\sum_{k, l}^{m} ρ_{k l}^{2}}{m^{2}} + n = \frac{n (n + 1)}{m_{e}} + n

in which $m_{e} = \frac{m^{2}}{\sum_{k, l}^{m} ρ_{k l}^{2}}$ the effective number of markers and $ρ_{k l}^{2}$ the squared Pearson’s correlation of LD between a pair of SNPs [23]. Often $m_{e} \leq m$ , and $m_{e} = m$ if all markers are in linkage equilibrium (see the note of Table 1). Here, $m_{e}$ is a population parameter, a summary statistic that encompasses allelic frequencies and linkage disequilibrium of makers. According to Eq 3, $E (L_{2, B}) = t r (K^{2}) = \frac{n (n + 1)}{m_{e}} + n$ , we consequently propose a randomization algorithm, which estimates $m_{e}$ as below

Table 1. Table for high-order moments for different coding scheme for genotypes.

Genotype $x_{i, k} x_{i, l}$	Coding scheme^1,2,3	Frequencies for $x_{i, k} x_{i, l}$ ( $f_{v_{1} v_{2}}$ )⁴
$A_{k} A_{k} B_{l} B_{l}$	$α_{1} β_{1}$	$f_{1, 1} = p_{k}^{2} R_{k l}^{2} = p_{k}^{2} p_{l}^{2} + 2 p_{k} p_{l} D_{k l} + D_{k l}^{2}$
$A_{k} A_{k} B_{l} b_{l}$	$α_{1} β_{2}$	$f_{1, 2} = p_{k}^{2} \cdot 2 R_{k l} {\overset{―}{R}}_{k l} = 2 p_{k}^{2} p_{l} q_{l} + 2 p_{k} (p_{l} - q_{l}) D_{k l} - 2 D_{k l}^{2}$
$A_{k} A_{k} b_{l} b_{l}$	$α_{1} β_{3}$	$f_{1, 3} = p_{k}^{2} {\overset{―}{R}}_{k l}^{2} = p_{k}^{2} q_{l}^{2} - 2 p_{k} q_{l} D_{k l} + D_{k l}^{2}$
$A_{k} a_{k} B_{l} B_{l}$	$α_{2} β_{1}$	$f_{2, 1} = 2 p_{k} q_{k} R_{k l} {\overset{―}{r}}_{k l} = 2 p_{k} q_{k} p_{l}^{2} + 2 p_{l} (p_{k} - q_{k}) D_{k l} - 2 D_{k l}^{2}$
$A_{k} a_{k} B_{l} b_{l}$	$α_{2} β_{2}$	$f_{2, 2} = 2 p_{k} q_{k} ({\overset{―}{R}}_{k l} {\overset{―}{r}}_{k l} + R_{k l} r_{k l}) = 4 p_{k} q_{k} p_{l} q_{l} + 2 (p_{k} - q_{k} \rightleft (p_{l} - q_{l}) D_{k l} + 4 D_{k l}^{2}$
$A_{k} a_{k} b_{l} b_{l}$	$α_{2} β_{3}$	$f_{2, 3} = 2 p_{k} q_{k} {\overset{―}{R}}_{k l} r_{k l} = 2 p_{k} q_{k} q_{l}^{2} + 2 q_{l} (p_{k} - q_{k}) D_{k l} - 2 D_{k l}^{2}$
$a_{k} a_{k} B_{l} B_{l}$	$α_{3} β_{1}$	$f_{3, 1} = q_{k}^{2} {\overset{―}{r}}_{k l}^{2} = q_{k}^{2} p_{l}^{2} - 2 q_{k} p_{l} D_{k l} + D_{k l}^{2}$
$a_{k} a_{k} B_{l} b_{l}$	$α_{3} β_{2}$	$f_{3, 2} = 2 p_{k} q_{k} {\overset{―}{r}}_{k l} r_{k l} = 2 q_{k}^{2} p_{l} q_{l} + 2 q_{k} (p_{l} - q_{l}) D_{k l} - 2 D_{k l}^{2}$
$a_{k} a_{k} b_{l} b_{l}$	$α_{3} β_{3}$	$f_{3, 3} = q_{k}^{2} r_{k l}^{2} = q_{k}^{2} q_{l}^{2} + 2 q_{k} q_{l} D_{k l} + D_{k l}^{2}$

Open in a new tab

¹For additive effect, under the coding scheme of 0 ( $a a$ ), 1 ( $A a$ ), and 2 ( $A A$ ) that counts the number of reference allele ( $A$ ), which has allele frequency of $p$ ; $q = 1 - p$ is the frequency of the alternative allele. After standardizing each genotype, we have $[α_{1}, α_{2}, α_{3}] = [\frac{2 q_{k}}{\sqrt{2 p_{k} q_{k}}}, \frac{q_{k} - p_{k}}{\sqrt{2 p_{k} q_{k}}}, \frac{- 2 p_{k}}{\sqrt{2 p_{k} q_{k}}}]$ for $A A$ , $A a$ , and $a a$ , and $[β_{1}, β_{2}, β_{3}] = [\frac{2 q_{l}}{\sqrt{2 p_{l} q_{l}}}, \frac{q_{l} - p_{l}}{\sqrt{2 p_{l} q_{l}}}, \frac{- 2 p_{l}}{\sqrt{2 p_{l} q_{l}}}]$ for $B B$ , $B b$ , and $b b$ . It leads to $\sum_{v_{1}, v_{2}}^{3} f_{v_{1} v_{2}} α_{v_{1}} β_{v_{2}} = \frac{D_{k l}}{\sqrt{2 p_{k} q_{k} 2 p_{l} q_{l}}} = ρ_{k l}$ , in which the subscript $v$ indexes for the three genotypes of a locus.

²For dominance effect, under the coding scheme of 0 ( $a a$ ), $2 p_{l}$ ( $A a$ ), and $4 p_{l} - 2$ ( $A A$ ) for 0, 1, and 2 reference alleles, we have $[α_{1}, α_{2}, α_{3}] = [\frac{- 2 q_{k}^{2}}{\sqrt{4 p_{k}^{2} q_{k}^{2}}}, \frac{2 p_{k} q_{k}}{\sqrt{4 p_{k}^{2} q_{k}^{2}}}, \frac{- 2 p_{k}^{2}}{\sqrt{4 p_{k}^{2} q_{k}^{2}}},]$ for $A A$ , $A a$ , and $a a$ , and $[β_{1}, β_{2}, β_{3}] = [\frac{- 2 q_{l}^{2}}{\sqrt{4 p_{l}^{2} q_{l}^{2}}}, \frac{2 p_{l} q_{l}}{\sqrt{4 p_{l}^{2} q_{l}^{2}}}, \frac{- 2 p_{l}^{2}}{\sqrt{4 p_{l}^{2} q_{l}^{2}}}]$ for $B B$ , $B b$ , and $b b$ . It leads to $\sum_{v_{1}, v_{2}}^{3} f_{v_{1} v_{2}} α_{v_{1}} β_{v_{2}} = \frac{4 D_{k l}^{2}}{\sqrt{4 p_{k}^{2} q_{k}^{2} \cdot 4 p_{l}^{2} q_{l}^{2}}} = ρ_{k l}^{2}$ .

³For alternative dominance coding scheme of 0, 1, and 0 for 0, 1, and 2 reference alleles, we have $[α_{1}, α_{2}, α_{3}] = [- \sqrt{\frac{2 p_{k} q_{k}}{1 - 2 p_{k} q_{k}}}, \sqrt{\frac{1 - 2 p_{k} q_{k}}{2 p_{k} q_{k}}}, - \sqrt{\frac{2 p_{k} q_{k}}{1 - 2 p_{k} q_{k}}}]$ and $[β_{1}, β_{2}, β_{3}] = [- \sqrt{\frac{2 p_{l} q_{l}}{1 - 2 p_{l} q_{l}}}, \sqrt{\frac{1 - 2 p_{l} q_{l}}{2 p_{l} q_{l}}}, - \sqrt{\frac{2 p_{l} q_{l}}{1 - 2 p_{l} q_{l}}}]$ . It leads to $\sum_{v_{1}, v_{2}}^{3} f_{v_{1} v_{2}} α_{v_{1}} β_{v_{2}} = ρ_{k l} \frac{(p_{k} - q_{k}) (p_{l} - q_{l})}{\sqrt{(1 - 2 p_{k} q_{k}) (1 - 2 p_{l} q_{l})}} + ρ_{k l}^{2} \sqrt{\frac{2 p_{l} q_{l} \cdot 2 p_{k} q_{k}}{(1 - 2 p_{k} q_{k}) (1 - 2 p_{l} q_{l})}}$ .

⁴The four elements $r_{k l} = q_{l} + \frac{D_{k l}}{q_{k}}$ , ${\overset{―}{r}}_{k l} = p_{l} - \frac{D_{k l}}{q_{k}}$ , ${\overset{―}{R}}_{k l} = q_{l} - \frac{D_{k l}}{p_{k}}$ , and $R_{k l} = p_{l} + \frac{D_{k l}}{p_{k}}$ represent for conditional probabilities for the four haplotypes $a_{k} b_{l}$ , $a_{k} B_{l}$ , $A_{k} b_{l}$ , and $A_{k} B_{l}$ , respectively. See Note V in S1 Text for detailed calculation.

{\begin{matrix} {\hat{m}}_{e} = \frac{n (n + 1)}{L_{2, B} - n} \\ v a r ({\hat{m}}_{e}) = \frac{{2 m}_{e}^{4}}{n^{4}} \frac{t r (K^{4})}{B} \end{matrix}

(11)

A more detailed estimation procedure for $m_{e}$ can be found in our recent work [16]. See Note I in S1 Text for more details.

Estimation for $\begin{matrix} \bold t r (K^{4}) \end{matrix}$

The benchmark estimation for $t r (K^{4})$ is $\begin{matrix} t r (K^{4}) = \sum_{i = 1}^{n} λ_{i}^{4} \end{matrix}$ , a fourth-order summation of the eigenvalues of $X$ . However, it is computationally expensive when $X$ is large. There are two alternative choices to estimate $t r (K^{4})$ . Method I: $\hat{t r (K^{4})} = \frac{B}{2} v a r (L_{2, B})$ , and $B$ would affect its precision. Method II: $\begin{matrix} \hat{t r (K^{4})} = \frac{1}{B} \sum_{b}^{B} z_{b}^{T} K^{4} z_{b} \end{matrix}$ , which uses the fourth-order randomized estimation in Eq 3. Both Method I and Method II can be realized via Eq 3. As will be shown below, Method II provides more stable estimates than Method I.

About $Λ_{1}$ —high-dimension structure of genetic architecture

For $Λ_{1}$ , $Λ_{1} = t r {{[\sum (K - I_{n})]}^{2}} = t r {\sum (K - I_{n}) \sum (K - I_{n})}$ , in which if $\sum = K h^{2} + I σ_{e}^{2}$ is replaced by $\sum = y y^{T}$ because $K$ is too expensive to constructed as aforementioned, we consequently have

\begin{matrix} Λ_{1} \approx {y^{T} (K - I) K (K - I) y {\hat{h}}^{2} + y^{T} (K - I) (K - I) y {\hat{σ}}_{e}^{2}} \\ \approx [L_{3, 0} - 2 L_{2, 0} + L_{1, 0}] {\hat{h}}^{2} + [L_{2, 0} - 2 L_{1, 0} + n] {\hat{σ}}_{e}^{2} = L_{h^{2}} {\hat{h}}^{2} + L_{σ_{e}^{2}} {\hat{σ}}_{e}^{2} \end{matrix}

(12)

$L_{3, 0} = y^{T} K^{3} y$ can be estimated as in Eq 3 if $z$ is replaced by $y$ the phenotype itself; it is similarly for $L_{2, 0} = y^{T} K^{2} y$ and $L_{1, 0} = y^{T} K y$ . They reflect high-dimensional structure between $y$ and $X$ . So the sampling variance of $h^{2}$ is not only related to $h^{2}$ itself, but is eventually upon the high-order structure between $y$ and $X$ . See Note II in S1 Text for more discussion about $Λ_{1}$ .

About $η$ — the term determines the iteration B

We define the ratio $η$ as below

η = \frac{t r (K^{4}) h^{4}}{Λ_{1}}

(13)

in which $t r (K^{4})$ can be estimated as $\begin{matrix} \hat{t r (K^{4})} = L_{4, B} = \frac{1}{B} \sum_{b}^{B} z_{b}^{T} K^{4} z_{b} \end{matrix}$ as above. However, it should be noticed that $h^{4}$ is a heavy penalty for higher heritability; for example, comparing with $h^{2} = 0.01$ , $h^{2} = 0.1$ leads to a 100-fold penalty for the latter in the numerator of Eq 13. Easily, we can estimate $B$ if we want to know how many iterations are needed to reach the preset ratio of $η_{0}$

B = \frac{η}{η_{0}} = \frac{1}{η_{0}} \frac{t r (K^{4}) h^{4}}{Λ_{1}}

(14)

In practice, $η_{0}$ can take the value of 0.1 or 0.05 as in our simulation and real data analysis below.

Extended utilities for distributed GWAS datasets

Because datasets are often distributed across institutes, we consequently consider two scenarios for the application of RHE-reg in distributed datasets. As the estimation for $h^{2}$ (Eq 4) can be split into the numerator and the denominator, the numerator and the denominator are estimated from two different sources. In the other scenario, the whole dataset has been distributed into small slices at $s$ different institutes. We call the first scenario the vertical RHE-reg and the latter horizontal RHE-reg.

Vertical RHE-reg

Estimation for $h^{2}$ can be implemented in summary statistics that the numerator and the denominator can be from different components [18]. We denote the correspondingly heritability ${\tilde{h}}^{2}$ for this subtle difference, as well as all tilded symbols from a reference panel that is related to genotypes. Alternatively, Eq 4 can be rewritten as

{\hat{\tilde{h}}}^{2} = \frac{{\tilde{n}}^{2} (y^{T} K y - n)}{n^{2} ({\tilde{L}}_{2, B} - \tilde{n})} = {\tilde{m}}_{e} \cdot \frac{(y^{T} K y - n)}{n^{2}}

(15)

the denominator ${\tilde{L}}_{2, B} = \frac{1}{B} \sum_{b}^{B} z_{b}^{T} {\tilde{K}}^{2} z_{b}$ . $\tilde{K} = \frac{1}{m} \tilde{X} {\tilde{X}}^{T}$ , in which $\tilde{X}$ has the dimension of $\tilde{n} \times m$ ; $\tilde{X}$ is the genotype matrix of the reference sample that is employed to estimate ${\tilde{L}}_{2, B}$ . So, ${v a r (\hat{\tilde{h}}}^{2})$ is (assuming $n \approx \tilde{n}$ )

v a r ({\hat{\tilde{h}}}^{2}) = \frac{2}{{[t r ({\tilde{K}}^{2}) - \tilde{n}]}^{2}} \cdot {Λ_{1} + \frac{{[t r (K^{2}) - n]}^{2}}{{[t r ({\tilde{K}}^{2}) - \tilde{n}]}^{2}} \cdot {\tilde{h}}^{4} \cdot \frac{t r ({\tilde{K}}^{4})}{B}} = 2 {(\frac{{\tilde{m}}_{e}}{{\tilde{n}}^{2}})}^{2} \cdot {Λ_{1} + {(\frac{{\tilde{m}}_{e}}{m_{e}})}^{2} \cdot {\tilde{h}}^{4} \cdot \frac{t r ({\tilde{K}}^{4})}{B}}

The bias is $E [{\hat{\tilde{h}}}^{2}] = h^{2} + 2 \cdot {(\frac{{\tilde{m}}_{e}}{{\tilde{n}}^{2}})}^{2} \cdot \frac{t r ({\tilde{K}}^{4})}{B} \cdot h^{2}$ , which will be zeroed out when $B$ increases (see Note III in S1 Text for more general situations). The corresponding test statistic is

{\tilde{z}}_{1} = (\frac{{\tilde{n}}^{2}}{{\sqrt{2} \tilde{m}}_{e}}) \frac{{\tilde{h}}^{2}}{\sqrt{Λ_{1}} \sqrt{1 + {(\frac{{\tilde{m}}_{e}}{m_{e}})}^{2} \frac{\tilde{η}}{B}}}

in which $\tilde{η} = \frac{t r ({\tilde{K}}^{4}) {\tilde{h}}^{4}}{Λ_{1}}$ . For the population of similar ancestry, the ratio $\frac{{\tilde{m}}_{e}}{m_{e}} \approx 1$ is cancelled out after sufficient iteration, and leads to

{\tilde{z}}_{2} = (\frac{{\tilde{n}}^{2}}{{\sqrt{2} \tilde{m}}_{e}}) \frac{{\tilde{h}}^{2}}{\sqrt{Λ_{1}}}

Horizontal RHE-reg

For this application, it is assumed that the entire dataset is divided into $s$ institutes ( $v$ is subscript for $y$ and $X$ ). Consequently, $y^{T} = [y_{1}^{T} ⋮ y_{2}^{T} ⋮ \dots ⋮ y_{s}^{T}]$ , the whole data $y$ and $X$ are distributed in $s$ institutes, and the length of $y_{v}$ upon how the proportion of data has in the $v^{t h}$ institute; $X^{T} = [X_{1}^{T} ⋮ X_{2}^{T} ⋮ \dots ⋮ X_{s}^{T}]$ , similarly the dimension of $X_{v}$ is $n_{v} \times m$ in which $n_{v}$ is the number of individuals in the $v^{t h}$ institute. One only needs to receive the mean and summation of square for each $y_{v}$ , and similarly for receiving the allele frequencies of the $m$ reference alleles of $X_{v}$ . So after scaling for $y_{v}$ and $X_{v}$ ,

h^{2} = \frac{{‖ {[y_{1}^{T} ⋮ y_{2}^{T} ⋮ \dots ⋮ y_{s}^{T}]}^{T} [X_{1}^{T} ⋮ X_{2}^{T} ⋮ \dots ⋮ X_{s}^{T}] ‖}_{F}^{2} - n}{{\frac{1}{m^{2}} ‖ {[Z_{1}^{T} ⋮ Z_{2}^{T} ⋮ \dots ⋮ Z_{s}^{T}]}^{T} {[X_{1}^{T} ⋮ X_{2}^{T} ⋮ \dots ⋮ X_{s}^{T} \rightleft [X_{1}^{T} ⋮ X_{2}^{T} ⋮ \dots ⋮ X_{s}^{T}]}^{T} ‖}_{F}^{2} - n} = \frac{{‖ \sum_{v = 1}^{s} y_{v}^{T} X_{v} ‖}_{F}^{2} - n}{{\frac{1}{m^{2}} ‖ \sum_{v = 1}^{c} Z_{v}^{T} X_{v} X_{v}^{T} ‖}_{F}^{2} - n}

(16)

$Z_{v}$ , a $B \times n_{v}$ matrix, can be generated from $N (0, 1)$ , by each institute, and consequently independently generate $y_{v}^{T} X_{v}$ and $Z_{v}^{T} X_{v} X_{v}^{T}$ without compromise of privacy; the subscript $F$ indicates Frobenius norm of a matrix that ${‖ A ‖}_{F}^{2} = {(\sqrt{t r (A A^{T})})}^{2} = t r (A A^{T})$ . Upon the precision requirement, after $B$ rounds of iterations, $η$ can be calculated so as to evaluate whether further iterations are needed. Unlike the vertical RHE-reg, the horizontal RHE-reg is identical to the RHE-reg under this simple scenario. An R script is attached for its detailed implementation (S1 Data).

Summary for RHE-reg

Now we discuss some computational issues about RHE-reg. So, eventually $B$ will creep into the RHE-reg. The focus here is to investigate how $B$ would affect the RHE-reg, in particular the stability of $h^{2}$ and z scores. All the above analyses are based on three computational units, $y$ , $X$ , and $W$ – if covariates are taken into account, and the operation between them lead to the whole computational procedure, of which their elementary operations can be implemented hierarchically (Table 2). We give an atlas for the computational route. Furthermore, we have $w$ covariates, and the covariate matrix $W$ is of $n \times w$ dimensions. After inclusion of the covariates, the equations for stopping rules can be updated accordingly (see Note IV in S1 Text). We finish the description of the statistical approaches and go to their applications now.

Table 2. Analytical results for RHE-reg.

	Individual-level data	Vertical RHE-reg
$h^{2}$ estimation	${\begin{matrix} h^{2} = \frac{y^{T} K y - n}{L_{2, B} - n} \\ v a r ({\hat{h}}^{2}) = 2 {(\frac{m_{e}}{n^{2}})}^{2} (Λ_{1} + \frac{t r (K^{4})}{B} h^{4}) \end{matrix}$ $h^{2}$	${\begin{matrix} {\tilde{h}}^{2} = \frac{y^{T} K y - n}{{\tilde{L}}_{2, B} - \tilde{n}} \\ v a r ({\tilde{h}}^{2}) = 2 {(\frac{{\tilde{m}}_{e}}{n^{2}})}^{2} {Λ_{1} + {(\frac{{\tilde{m}}_{e}}{m_{e}})}^{2} \frac{t r ({\tilde{K}}^{4})}{B} h^{4}} \end{matrix}$ $h^{2}$
Key statistics estimation	${\begin{matrix} Intermediate parameters {Λ_{1} = L_{h^{2}} h^{2} + L_{σ_{e}^{2}} σ_{e}^{2}, η = \frac{t r (K^{4}) \cdot h^{4}}{Λ_{1}} \\ R a n d o m i z a t i o n {\begin{matrix} \hat{t r (K^{2})} = L_{2, B} = \frac{1}{B} \sum_{b}^{B} z_{b}^{T} K^{2} z_{b} \\ \hat{t r (K^{4})} = L_{4, B} = \frac{1}{B} \sum_{b}^{B} z_{b}^{T} K^{4} z_{b} \\ {\hat{m}}_{e} {\begin{matrix} {\hat{m}}_{e} = \frac{n (n + 1)}{L_{2, B} - n} \\ v a r ({\hat{m}}_{e}) = 2 {(\frac{m_{e}}{n})}^{4} \cdot \frac{t r (K^{4})}{B} \end{matrix} \end{matrix} \end{matrix}$	${\begin{matrix} Intermediate parameters {{\tilde{Λ}}_{1} = L_{h^{2}} {\tilde{h}}^{2} + L_{σ_{e}^{2}} {\tilde{σ}}_{e}^{2}, \tilde{η} = \frac{t r ({\tilde{K}}^{4}) \cdot h^{4}}{Λ_{1}} \\ R a n d o m i z a t i o n {\begin{matrix} \hat{t r ({\tilde{K}}^{2})} = L_{2, B} = \frac{1}{B} \sum_{b}^{B} z_{b}^{T} {\tilde{K}}^{2} z_{b} \\ \hat{t r ({\tilde{K}}^{4})} = L_{4, B} = \frac{1}{B} \sum_{b}^{B} z_{b}^{T} {\tilde{K}}^{4} z_{b} \\ {\tilde{m}}_{e} {\begin{matrix} {\hat{\tilde{m}}}_{e} = \frac{\tilde{n} (\tilde{n} + 1)}{{\tilde{L}}_{2, B} - \tilde{n}} \\ v a r ({\hat{\tilde{m}}}_{e}) = 2 {(\frac{{\tilde{m}}_{e}}{\tilde{n}})}^{4} \cdot \frac{t r ({\tilde{K}}^{4})}{B} \end{matrix} \end{matrix} \end{matrix}$
	$\begin{matrix} {\hat{L}}_{h^{2}} = \frac{1}{m^{3}} y^{T} K^{3} y - \frac{2}{m^{2}} y^{T} K^{2} y + \frac{1}{m} y^{T} K y \\ {\hat{L}}_{σ_{e}^{2}} = \frac{1}{m^{2}} y^{T} K^{2} y - \frac{2}{m} y^{T} K y + n \end{matrix}$

Open in a new tab

Notes: left and right parts of the table give how to implement RHE-reg directly in individual-level data or vertical RHE-reg. In order to show the difference between individual-level data estimation and vertical RHE-reg, tilded symbols are introduced to indicate any genotypes from a reference panel. For example, ${\tilde{L}}_{2, B} = \sum_{b = 1}^{B} z_{b} \tilde{K} {\tilde{K}}^{T} z_{b}$ , in which $K = \frac{1}{m} \tilde{X} {\tilde{X}}^{T}$ , and $\tilde{X}$ is from the reference panel of dimension $\tilde{n} \times m$ .

Software

We have developed computer software that can handle biobank-scale algorithm presented in this study. The software reads genotype in binary format as defined in such as PLINK. For fast vector-matrix multiplication, Mailman algorithm is employed here [24]. We adapt the implementation of the Mailman algorithm from Agrawal’s fast PCA project [25]. It is known that using the Mailman algorithm the vector-matrix multiplication in $L_{2, B}$ is reduced from $O (n m B)$ to $O (\frac{n m B}{max ({log}_{3} n, {log}_{3} m)})$ . There is no conceptual obstacle to applying the method for genotype data in dosage format, but the Mailman algorithm cannot proceed in such a scenario. There are many matrix multiplication included, and in programming some suggested tips to take them out is as shown (S1 Fig).

Results

Simulation results

We conducted simulations to evaluate the aforementioned theoretical results under various parameters. The reference allele frequency was evenly sampled from 0.1 ~ 0.5, and $h^{2}$ was set three values of 0, 0.1, and 0.25, and all SNPs were considered causal after a typical polygenic model, which follows Normal distribution. 1) The linkage disequilibrium (Lewontin’s $D^{'}$ ) for each pair of consecutive SNPs were $D^{'} =$ 0, 0.2, 0.4, 0.6, and 0.8 for consecutive SNPs. 2) We set three levels of unrelated samples $n =$ 1,000, 5,000, and 10,000, respectively. 3) Three levels of SNP numbers $m =$ 10,000, 50,000, and 100,000. These five parameters could totally carry out 45 simulation scenarios for each $h^{2}$ by our in-house simulation code, and its detailed implementation can be found in Zhang et al [16]. For each simulation scenario, we set $B$ the value of 10, 20, and 50 in order to find proper $B$ . $n$ , $m$ (as well as their allele frequencies), $D^{'}$ , and $h^{2}$ were considered to investigate how to determine $B$ . Although neither $n$ nor $m$ reaches real biobank-scale data, we investigate and summarize certain properties of RHE-reg under these 135 scenarios in the results below. The biobank-scale test is to be investigated in UK Biobank examples.

Result 1: Randomized estimation for $t r (K^{4})$

As shown in the method section, $t r (K^{4})$ was appeared as one of the key parameters in determining the performance of the sampling of RHE-reg. The direct estimation of $t r (K^{4})$ from the eigenvalues of $K$ was the golden standard, and we consequently compared Method I, $\hat{t r (K^{4})} = \frac{B}{2} v a r (L_{2, B})$ , and Method II, $\hat{t r (K^{4})} = L_{4, B}$ , with its direct estimation. As shown in Fig 1, the above 135 simulation scenarios were compared with the direct estimation for $t r (K^{4}) = \sum_{i = 1}^{n} λ_{i}^{4}$ . For Method I, increasing $B$ from 10 to 50 could increase the precision of the estimation. In contrast, Method II showed very consistent and high precision for the estimation of $t r (K^{4})$ regardless of the sample size, an increasing of $B$ from 10 to 50 did not help improve precision. The advantage of Method II was probably because $L_{4, B}$ estimated $t r (K^{4})$ as its mean, whereas $v a r (L_{2, y})$ as its sampling variance. So, hereafter we used Method II $L_{4, B}$ to estimate $t r (K^{4})$ . Of note, as $t r (K^{4}) = \sum_{i = 1}^{n} λ_{i}^{4}$ is computational expensive when $K$ is large so that only limited sample size and SNP numbers were tested in these 135 simulations; however, the principal results should be retained for an even larger sample size, as well as $K$ , but with more expensive computational cost in solving eigenvalues. Biobank-scale performance of the proposed method will be illustrated in UK Biobank 81 traits in Result 4.

Fig 1 — The x-axis represents benchmark estimation for $t r (K^{4})$ directly, and y-axis represents the estimation of $t r (K^{4})$ using Method I or Method II respectively. The diagonal line (solid black) is for comparison. Each fitted line shows the correlation between all 135 estimations with their benchmark estimation $t r (K^{4})$ .

Result 2: MSE of RHE-reg

In Eq 7, $M S E ({\hat{h}}^{2}) = 2 {(\frac{m_{e}}{n^{2}})}^{2} (Λ_{1} + \frac{t r (K^{4} c d o t h^{4}}{B}) + 4 {(\frac{m_{e}}{n^{2}})}^{4} {\cdot [\frac{t r (K^{4})}{B}]}^{2} \cdot h^{4}$ , and we defined $R = \frac{\frac{t r (K^{4}) \cdot h^{4}}{B}}{Λ_{1}} = \frac{η}{B}$ according to Eq 14. In Fig 2, we showed how MSE and $R$ could be reduced by $B$ for these 90 simulated scenarios (excluded 45 scenarios under $n = 1, 000$ ). We only illustrated the results for $n = 5, 000$ and 10,000, respectively, because $n = 1, 000$ was too small a sample size here for efficient convergence. The top row of Fig 2 illustrated how MSE were reduced by $B$ , and obviously a much larger $B$ reduced MSE because $\frac{t r (K^{4} c d o t h^{4}}{B}$ was turned down. Actually the bias term $4 {(\frac{m_{e}}{n^{2}})}^{4} {\cdot [\frac{t r (K^{4})}{B}]}^{2} \cdot h^{4}$ played little weight in MSE, which was dominantly determined by $2 {(\frac{m_{e}}{n^{2}})}^{2} (Λ_{1} + \frac{t r (K^{4} c d o t h^{4}}{B})$ . $2 {(\frac{m_{e}}{n^{2}})}^{2} (Λ_{1} + \frac{t r (K^{4} c d o t h^{4}}{B})$ was at least one or two order of magnitude compared with $4 {(\frac{m_{e}}{n^{2}})}^{4} {\cdot [\frac{t r (K^{4})}{B}]}^{2} \cdot h^{4}$ . In the second row of Fig 2 $R = η / B$ reflected how quickly $\frac{t r (K^{4} c d o t h^{4}}{B}$ vanished after $B$ iterations. Neither LD nor $h^{2}$ played an important role in determining MSE for the simulated scenarios, but the ratio between $n$ and $m$ mattered much as under the same sample size, more SNPs always inflated MSE.

Fig 2 — The top row (A-C) represents the comparison for MSE under different $B$ for 90 simulated scenarios, and the bottom row (D-F) represents the comparison for the ratio between $Λ_{1}$ and $\frac{t r (K^{4} c d o t h^{4}}{B}$ . In each panel, 90 simulated scenarios are split into 6 groups given different combination for sample sizes ( $n$ = 5,000, and 10,000) and SNP numbers ( $m$ = 10,000, 50,000, and 100,000). In each group 15 points can be split into 5 groups from left to right for different LD levels ( $D^{'}$ = 0, 0.2, 0.4, 0.6, and 0.8) and each LD group has three simulated $h^{2}$ (0, 0.1, and 0.25), respectively. In each panel, the grey line indicates the mean of the investigated values for the corresponding 15 scenarios.

Result 3: Randomized estimation for $h^{2}$ and z-score

In result 3, we studied how $B$ could influence $h^{2}$ and its z-score. As the sampling variance of $h^{2}$ was reciprocal to the sample size $Λ_{0} = \frac{2 m_{e}}{n^{2}}$ under the null hypothesis and $B$ , it was obviously to see in simulation that: greater $n$ , and greater $B$ would help to bring out a more stable estimation for $h^{2}$ (Fig 3A-C). If we employed ${\hat{h}}^{2}$ from $B = 50$ as the benchmark, when sample size $n = 10, 000$ , there was very high consistent estimation for ${\hat{h}}^{2}$ even $B$ $B = 20$ (Fig 3C). $\frac{2 m_{e}}{n^{2}}$ is the sampling variance of REML when $h^{2} = 0$ [26].

Fig 3 — Each plot illustrates the comparison of the estimated heritability **(A-C**) and z-score (**D-F**) given $B = 10$ (x-axis) vs $B =$ 20 and 50 (y-axis) under different sample size $n = 1, 000$ , 5,000, and 10,000, respectively. The black solid line is the reference line of y = x, and the coloured solid line is the fitted regression, which is printed in each plot. In each plot, there are 45 points in each colour as simulated.

The availability of the $z$ score of the estimated heritability was important for statistical inference. We evaluated the influence of $B$ in determining the performance of the randomized algorithm (Fig 3D-F). It was known from the above analysis $E (z_{h^{2}}) \approx \frac{n^{2}}{\sqrt{2} m_{e}} \frac{{\hat{h}}^{2}}{\sqrt{L_{h^{2}} {\hat{h}}^{2} + L_{σ_{e}^{2}} {\hat{σ}}_{e}^{2}}}$ , so when the estimation of $h^{2}$ became stable the test statistic was stable too. So, $z_{1}$ was relative stable when $n = 5, 000$ (Fig 3E) or $n = 10, 000$ (Fig 3F). When the sample size was sufficiently large, a few iteration could guarantee high accuracy of the estimation. In addition, we also tested the estimation by setting $h^{2} = 0.5$ , 0.75, and 0.9, respectively, and the results, as promised by our theory, were consistent to what observed.

Result 4: Application of horizontal RHE-reg

This study was to estimate heritability for distributed data as exact as a single piece of data. Two cohorts with $n_{1} =$ 4,000 and $n_{2} =$ 6,000 individuals, respectively, were generated to verify h-RHE-reg. $h^{2}$ was set the value of 0, 0.1, and 0.25, respectively. The effects of all $m =$ 10,000 SNPs were sampled from the distribution $N (0, \frac{h^{2}}{m})$ . Heritability and z scores were estimated using individual-level RHE-reg as well as h-RHE-reg. $B$ was set of 10, 20, and 50. The genotypes of the two simulated cohorts were standardized by ${\tilde{x}}_{1 j} = \frac{x_{1 j} - 2 {\overset{―}{p}}_{j}}{\sqrt{2 {\overset{―}{p}}_{j} (1 - {\overset{―}{p}}_{j})}}$ and ${\tilde{x}}_{2 j} = \frac{x_{2 j} - 2 {\overset{―}{p}}_{j}}{\sqrt{2 {\overset{―}{p}}_{j} (1 - {\overset{―}{p}}_{j})}}$ for the $j$ -th locus, where ${\overset{―}{p}}_{j} = \frac{n_{1}}{n_{1} + n_{2}} {\overset{―}{p}}_{1 j} + \frac{n_{2}}{n_{1} + n_{2}} {\overset{―}{p}}_{2 j}$ was the average allele frequency. The phenotypes of the two cohorts were standardized by ${\tilde{y}}_{1} = \frac{y_{1} - \overset{―}{y}}{σ_{y}}$ and ${\tilde{y}}_{2} = \frac{y_{2} - \overset{―}{y}}{σ_{y}}$ , where $\overset{―}{y} = \frac{n_{1}}{n_{1} + n_{2}} {\overset{―}{y}}_{1} + \frac{n_{2}}{n_{1} + n_{2}} {\overset{―}{y}}_{2}$ and $σ_{y}^{2} = \frac{n_{1}}{n_{1} + n_{2} - 1} \overset{―}{y_{1}^{2}} + \frac{n_{2}}{n_{1} + n_{2} - 1} \overset{―}{y_{2}^{2}} - \frac{n_{1} + n_{2}}{n_{1} + n_{2} - 1} {\overset{―}{y}}^{2}$ . Each simulation scenario had 10 repeats (see Source1.R and Source2.R in S1 Data for its implementation).

The estimates of heritability and its z score were consistent using individual-level RHE-reg and h-RHE-reg in all scenarios when the random vectors were the same (Fig 4). We also split the data into $n_{1} =$ 2,000 and $n_{2} =$ 8,000 individuals, and, as expected, the results were nearly identical and unbiased.

Fig 4 — Estimated heritability (**A-C**) and z-scores (**D-F**) obtained from RHE-reg ( $x$ axis) and h-RHE-reg ( $y$ axis) under different settings of $B =$ 10, 20, and 50, respectively. Point colors represent the simulated heritability. Each scenario was repeated 10 times. The dashed line represents the identity line ( $y = x$ ).

Real data analysis for UK Biobank

We chose the unrelated 292,223 British white who have no kinship found, as indicated by the genetic kinship provided in the UK Biobank (field 22021) for real data test [2]. After quality control, the inclusion criteria were: MAF > 0.01, missing call rate < 0.05 and Hardy-Weinberg proportion test p-value > 1e-6, whose genotype call rate > 0.95, and 525,460 autosome SNPs were included for analysis. We estimated heritability of the 81 quantitative traits, and included the top two principal components and sex as covariates.

We used two strategies to estimate heritability. In strategy I, denoted as B+ strategy hereafter, we set $B_{0} = 10$ as a warm-up step to evaluate $t r (K^{4})$ and $η_{0}$ was set of 0.05. After the warm-up of $B_{0}$ iteration, we then increased iteration by a step of 10, We then estimate final realized $η$ , $m_{e}$ , $h^{2}$ , and three kinds of $z$ scores until the convergence ratio of $η_{0} = 0.05$ $Λ$ ; however, we set a hard stop for $B_{1} = 200$ even if $η$ was still greater than 0.05. In strategy II, we directly set $B_{0} = 10$ , 20, or 50 without further considering additional iteration anymore, and consequently denoted as $B 10$ , $B 20$ , and $B 50$ strategies hereafter.

Comparison of the UKB results between strategy I and II

Even a couple of traits were set to take a hard stop because their $B_{1}$ were greater than 200, the estimated $\hat{η}$ for the 81 traits had a mean of 0.0518, which was very close to the preset $η_{0} = 0.05$ (Fig 5A1). It indicated that our theory worked to control the precision of the sampling variance of RHE-reg. The trait “age of diabetes diagnosed” had $h^{2} = 1.21 \pm 0.533$ , extremely large standard error compared to other traits, because of its smallest sample size of $n = 12, 658$ (Fig 5B1). In Fig 5C1, we got three z scores, which are $z_{1} = \sqrt{\frac{{\hat{Λ}}_{0}}{2 {\hat{Λ}}_{1}}} \frac{{\hat{h}}^{2}}{\sqrt{1 + \frac{\hat{η}}{B_{1}}}}$ score directly calculated given $B_{1}$ iterations (green colored, via Eq 8), the optimal z score $z_{2} = \sqrt{\frac{{\hat{Λ}}_{0}}{2 {\hat{Λ}}_{1}}} {\hat{h}}^{2}$ when $B$ was infinitive (blue colored, via Eq 9), and the predicted $z_{3} = \hat{z} \sqrt{1 + \frac{\hat{η}}{B}}$ score (pink colored, via Eq 10).

Fig 5 — A1-D1) The performance of RHE-reg given the respective $B$ number for each trait, $η$ , effective number of markers using randomized estimation ( $m_{e}$ ), ${\hat{h}}^{2}$ (the vertical line covers 95% confidence interval), and $z$ scores estimated in three methods. Three z scores are plotted, the green colored z scores are directly estimated given $B$ iterations for each trait (Eq 8), the pink colored z scores are optimal z score (Eq 9), and the blue colored z scores are directly estimated given $z \sqrt{1 + η}$ (Eq 10). A2-A4) Comparison for $η$ between that of $B +$ and $B = 10$ , 20, and 50, respectively. B2-B4) Comparison for $m_{e}$ between that of $𝐁 +$ and $B = 10$ , 20, and 50, respectively; the vertical and horizontal lines are the means of $m_{e}$ from x-axis and y-axis, respectively. C2-C4) Comparison for $𝐡^{2}$ between $B +$ and $B = 10$ , 20, and 50, respectively; the fitted lines is printed on the top left corner of each plot. D2-D4) Comparison for the three pairwise $z$ scores. The green colored z scores are estimated in Eq 8 given B+ and the number of B as shown on the x-axis label, the pink colored z score are estimated in Eq 9, and blue colored ones in Eq 10, respectively.

For comparison, we examined the corresponding statistics that were estimated under $B 10$ , $B 20$ , and $B 50$ , respectively. In strategy II, A larger $B_{0}$ led a smaller $η$ as expected (Fig 5A2-4). Interestingly, regardless the change of $B_{0}$ in strategy II, ${\hat{h}}^{2}$ were very consistent to those estimated from strategy I, as shown that the fitted regression lines were very close to 1 (Fig 5B2-B4). Three types of z scores were compared (Fig 5C2-C4), and the optimal z scores from both strategies were nearly perfect (blue points and blue dashed lines). Then, as shown in Fig 5C4, the three kinds of z scores were nearly completely matched.

In addition, the estimates were also consistent with our previous results using a less efficient method [27], and see S1 Table for more details. The heritability estimated by the randomization algorithm exhibited a relative high degree of correlation (Pearson’s correlation coefficient of 0.77) with the previous estimates for 81 traits. Compared to the previous results, the $m_{e}$ was nearly consistent with the GRM-based estimates, and is with averaged 1.38% deviation after 10 iterations and further decreased to 1.23% deviation after 50 iterations (S1 Table).

We also compared the computational efficiency of RHE-reg with GCTA [28] and BOLT-REML [29] in estimating the heritability on BMI. The comparison was conducted on a sub-dataset in UKB with randomly selected 10,000 individuals and 523,945 SNP markers after filtration. The results indicate significant efficiency improvement in estimating the heritability of complex traits in biobank-scale datasets for RHE-reg, with computation times reduced by 96.6% and 83.8% compared to GCTA and BOLT-REML, respectively (S2 Table). More benchmark comparison of the computational performance could be found in earlier studies [18,5]. Even using a complete dataset, RHE-reg could also complete heritability estimation within an acceptable time (S3 Table). In our tested 81 UKB traits, with 10 threads, it on average took 453 mins to finish the analysis of a trait and the average iteration of $B = 90$ .

Application of vertical RHE-reg

Of Eq 15, ${\hat{\tilde{h}}}^{2} = {\tilde{m}}_{e} \cdot \frac{(y^{T} K y - n)}{n^{2}}$ indicates that ${\tilde{m}}_{e}$ and $\frac{(y^{T} K y - n)}{n^{2}}$ can be from two independent sources. Consequently, we split each UKB trait evenly into halves to test the v-RHE-reg, and Eq 15 had four possible combinations: 1) split 1/1: both ${\tilde{m}}_{e}$ and $\frac{(y^{T} K y - n)}{n^{2}}$ were estimated from split 1; 2) split 2/1: ${\tilde{m}}_{e}$ was estimated from split 2 and $\frac{(y^{T} K y - n)}{n^{2}}$ split 1; 3) split 1/2: ${\tilde{m}}_{e}$ was estimated from split 1 and $\frac{(y^{T} K y - n)}{n^{2}}$ split 2; 4) both ${\tilde{m}}_{e}$ and $\frac{(y^{T} K y - n)}{n^{2}}$ were estimated from split 2. So, we had four estimators as below

{\begin{matrix} \begin{matrix} h_{1, 1}^{2} = {[{\tilde{m}}_{e}]}_{1} \cdot {[\frac{(y^{T} K y - n)}{n^{2}}]}_{1}, split 1 / 1 \\ h_{1, 2}^{2} = {[{\tilde{m}}_{e}]}_{1} \cdot {[\frac{(y^{T} K y - n)}{n^{2}}]}_{2}, split 2 / 1 \end{matrix} \\ \begin{matrix} h_{2, 1}^{2} = {[{\tilde{m}}_{e}]}_{2} \cdot {[\frac{(y^{T} K y - n)}{n^{2}}]}_{1}, split 1 / 2 \\ h_{2, 2}^{2} = {[{\tilde{m}}_{e}]}_{2} \cdot {[\frac{(y^{T} K y - n)}{n^{2}}]}_{2}, split 2 / 2 \end{matrix} \end{matrix}

Of each trait, its heritability and z score tests could be constructed within each split and between each split by exchanging the $L_{B}$ estimation, and consequently brought out v-RHE-reg. As shown in Eq 15, we compared the result for $B$ =10, 20, and 50, respectively, and observed consistent results between split 1 and split 2, and between split 1/2 and split 2/1.

Fig 6 showed the results of these four estimators under different $B$ . It illustrated that pairwise estimates ${\hat{h}}_{1, 1}^{2}$ against ${\hat{h}}_{2, 2}^{2}$ , and ${\hat{h}}_{1, 2}^{2}$ against ${\hat{h}}_{2, 1}^{2}$ , and as observed the pairwise estimates were quite consistent with each other both within and between splits.

Discussion

The presented study is developed on the randomized Haseman-Elston regression for the estimation of SNP-heritability proposed recently by Wu and Sankararaman (2018) [5]. They very smartly used a randomization approach – Girard-Hutchinson estimation, which significantly reduces the computational cost in estimating $t r (K^{2})$ from $O (n^{2} m)$ to $O (n m B)$ [20,19]. However, the drawbacks of their method may be its unclear property for $B$ , which further leads to obscure sampling variance of the estimated heritability. As discussed in a recent review, it has been obscure in the original RHE-reg since no closed-form solutions were provided to quantify the connection between $B$ and the estimation procedure [8]. After integrating analytical results for Haseman-Elston regression into this randomized framework [4], we present here a close-form solution for RHE-reg. Having provided the sampling variance, we are able to evaluate how $B$ influences the estimation procedure of RHE-reg precisely. In particular, a key element that is related to the sampling variance of $L_{2, B}$ , which is proportional to $\frac{2 t r (K^{4})}{B}$ . It should be noticed $v a r ({\hat{h}}^{2}) = \frac{2 m_{e}}{n^{2}}$ under the null hypothesis that $h^{2} = 0$ as established previously [4,26,18]. The quantity of $\frac{2 m_{e}}{n^{2}}$ is identical to the sampling variance of REML under the null hypothesis or that of modified Haseman-Elston regression [26,4,18]. Of note, the present study is focused on the presence of typical polygenic architecture because counterexample, albeit pathological, can be found when causal variants are distributed not random as discussed [4,27].

A nature extension of the method is to include multi-component, such as for the estimation for each chromosome. It is obvious that the method for deriving sampling variance should be extended for multi-components estimation if their corresponding $X_{i}$ and $X_{j}$ are in global linkage, or nearly, equilibrium, which is often the case for human populations [17]. Much advanced numerical tools, such as condition numbers, are needed to evaluate the approximation of the randomized algorithm [30]. Some inconsistency between GRM-based estimation and randomization estimation, such as the overall correlation of 0.77 for estimated heritability between Xu et al.’s results and the current result, may arise from the different covariates chosen [27]. In Xu et al.’s work, the heritability was estimated under the first two PCs corrected, while the current randomization method further took gender as extra covariate, this may cause the observed discrepancy especially in the gender-related traits.

In summary, the purpose of the present study is two-fold. First, we provide a method to balance iteration and precision of estimation, and an improved implementation of RHE-reg is realized. Secondly, we extend RHE-reg into the estimation of SNP-heritability for distributed data, which uses the controlled $B$ to synchronize the estimation across datasets. With increasing genomic cohorts but distributed in different institutes, it is now a trend to propose computational solutions without compromising privacy [31]. The enhanced RHE-reg framework can consequently have computational and analytical merits, and, as demonstrated, we further extend its utilities such as vertical- and horizontal RHE-reg, as demonstrated in this study. Given the increasing cry for genomic privacy, both vertical and horizontal RHE-reg will be meaningful in securing genomic information. However, given its traditionally very quantitative origin of statistical genetics, statistical routines may have competing, if not superior, solutions than those derived from available information technology [32,12].

It is straightforward to apply the estimation procedure for the estimation of dominance variance components both for individual-level data and summary statistics. The only update of the equation $h_{d}^{2} = \frac{y^{T} K_{d} y - n}{t r (K_{d}^{2}) - n}$ is to replace $K$ with $K_{d} = \frac{1}{m} X_{d} X_{d}^{T}$ . For each SNP, $x_{i, d}$ is coded 0, 2p, and 4p-2 for the genotype that counts 0, 1 and 2 reference alleles; and furthermore, $X_{d}$ is further scale by $\frac{X_{d, l} - 2 p_{l}^{2}}{2 p_{i} (1 - p_{i})}$ [33,34]. So for a pair of individual $i$ and $\begin{matrix} j, \end{matrix}$ $\begin{matrix} K_{d [i, j]} = \frac{1}{m} Σ_{l}^{m} \frac{{(X}_{d, i, l} - 2 p_{l}^{2}) {(X}_{d, j, l} - 2 p_{l}^{2})}{4 p_{i}^{2} {(1 - p_{i})}^{2}} . \end{matrix}$ After replacing $K$ with $K_{d}$ , all the above estimation procedure can be applied for $h_{d}^{2}$ . Furthermore, $\begin{matrix} t r (K_{d}^{2}) = n (n + 1) \frac{\sum_{k, l}^{m} ρ_{k l}^{4}}{m^{2}} + n = \frac{n (n + 1)}{m_{e . d}} + n . \end{matrix}$ The effective number of markers in terms of $X_{d}$ is $m_{e . d} = \frac{m^{2}}{m + \sum_{k \neq l}^{m} ρ_{k l}^{4}}$ , a tetradic form of LD for a pair of SNPs.

Supporting information

S1 Text. Technical details on some mathematical derivations.

“Effective number of markers” (Note I); “Discussion of $Λ_{1}$ ” (Note II); “Sampling variance for vertical RHE-reg” (Note III); “Adjustment for covariates” (Note IV); “Coding scheme and LD” (Note V).

(DOCX)

pcbi.1013568.s001.docx^{(54.1KB, docx)}

S1 Fig. This figure gives the computational tips on how to program the software.

Some sequential operation of the matrix is suggested to make the program easy to write.

(TIFF)

pcbi.1013568.s002.tiff^{(3.8MB, tiff)}

S1 Data. Implementation for Horizontal RHE-reg (R code: Source1. R and Source2.R).

(ZIP)

pcbi.1013568.s003.zip^{(3.8KB, zip)}

S1 Table. SNP-heritability estimation for 81 UKB traits (XLSX).

(XLSX)

pcbi.1013568.s004.xlsx^{(30.1KB, xlsx)}

S2 Table. The time cost for heritability estimation on BMI for RHE-reg, BOLT-REML and GCTA.

(XLSX)

pcbi.1013568.s005.xlsx^{(11.6KB, xlsx)}

S3 Table. Computational time for 81 continuous traits using RHE-reg.

(XLSX)

pcbi.1013568.s006.xlsx^{(14.6KB, xlsx)}

Acknowledgments

We thank the participants of the included cohorts and of UK Biobank for making this work possible (UKB application 41376).

Data Availability

All codes for simulation study and practical protocol are available on GitHub (https://github.com/gc5k/gear2). The genotype-phenotype data used in our analyses are available from UK Biobank (https://www.ukbiobank.ac.uk). The UKB data can be accessed following successful application at https://www.ukbiobank.ac.uk/enable-your-research/apply-for-access.

Funding Statement

This work was supported by the National Natural Science Foundation of China (32272832 to ZZ, and 31771392 to GBC), Shenzhen Basic Research Foundation (20220818100717002 to SL), Guangdong Basic and Applied Basic Research Foundation (2022B1515120080 to SL). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

1.Visscher PM, Hill WG, Wray NR. Heritability in the genomics era--concepts and misconceptions. Nat Rev Genet. 2008;9(4):255–66. doi: 10.1038/nrg2322 [DOI] [PubMed] [Google Scholar]
2.Bycroft C, Freeman C, Petkova D, Band G, Elliott LT, Sharp K, et al. The UK Biobank resource with deep phenotyping and genomic data. Nature. 2018;562(7726):203–9. doi: 10.1038/s41586-018-0579-z [DOI] [PMC free article] [PubMed] [Google Scholar]
3.Haseman JK, Elston RC. The investigation of linkage between a quantitative trait and a marker locus. Behav Genet. 1972;2(1):3–19. doi: 10.1007/BF01066731 [DOI] [PubMed] [Google Scholar]
4.Chen G-B. Estimating heritability of complex traits from genome-wide association studies using IBS-based Haseman-Elston regression. Front Genet. 2014;5:107. doi: 10.3389/fgene.2014.00107 [DOI] [PMC free article] [PubMed] [Google Scholar]
5.Wu Y, Sankararaman S. A scalable estimator of SNP heritability for biobank-scale data. Bioinformatics. 2018;34(13):i187–94. doi: 10.1093/bioinformatics/bty253 [DOI] [PMC free article] [PubMed] [Google Scholar]
6.Wu Y, et al. Fast estimation of genetic correlation for biobank-scale data. Am J Hum Genet. 2022;24:24–32. [DOI] [PMC free article] [PubMed] [Google Scholar]
7.Patel V, et al. Scientific applications leveraging randomized linear algebra. arXiv. 2025. doi: 2506.16457 [Google Scholar]
8.Tang M, Wang T, Zhang X. A review of SNP heritability estimation methods. Brief Bioinform. 2022;23(3):bbac067. doi: 10.1093/bib/bbac067 [DOI] [PubMed] [Google Scholar]
9.Yang Z, Hu L, Zhen J, Gu Y, Liu Y, Huang S, et al. Genetic basis of pregnancy-associated decreased platelet counts and gestational thrombocytopenia. Blood. 2024;143(15):1528–38. doi: 10.1182/blood.2023021925 [DOI] [PMC free article] [PubMed] [Google Scholar]
10.Xiao H, et al. Genetic analysis of 104 pregnancy phenotypes in 39, 194 Chinese women. medRxiv. 2023. doi: 23298979 [Google Scholar]
11.Chen G-B, Liu S, Zhang L, Huang T, Tang X, Li Y, et al. Building and sharing medical cohorts for research. Innovation (Camb). 2024;5(3):100623. doi: 10.1016/j.xinn.2024.100623 [DOI] [PMC free article] [PubMed] [Google Scholar]
12.Zhang Q-X, Liu T, Guo X, Zhen J, Yang M-Y, Khederzadeh S, et al. Searching across-cohort relatives in 54,092 GWAS samples via encrypted genotype regression. PLoS Genet. 2024;20(1):e1011037. doi: 10.1371/journal.pgen.1011037 [DOI] [PMC free article] [PubMed] [Google Scholar]
13.Chen G-B, Lee SH, Robinson MR, Trzaskowski M, Zhu Z-X, Winkler TW, et al. Across-cohort QC analyses of GWAS summary statistics from complex traits. Eur J Hum Genet. 2016;25(1):137–46. doi: 10.1038/ejhg.2016.106 [DOI] [PMC free article] [PubMed] [Google Scholar]
14.McMahan HB, et al. Communication-efficient learning of deep networks from decentralized data. arXiv. 2017. doi: 1602.05629 [Google Scholar]
15.Bulik-Sullivan BK, Loh P-R, Finucane HK, Ripke S, Yang J, Schizophrenia Working Group of the Psychiatric Genomics Consortium, et al. LD Score regression distinguishes confounding from polygenicity in genome-wide association studies. Nat Genet. 2015;47(3):291–5. doi: 10.1038/ng.3211 [DOI] [PMC free article] [PubMed] [Google Scholar]
16.Zhang Q-X, Jayasinghe D, Zhang Z, Lee SH, Xu H-M, Chen G-B. Precise estimation of in-depth relatedness in biobank-scale datasets using deepKin. Cell Rep Methods. 2025;5(6):101053. doi: 10.1016/j.crmeth.2025.101053 [DOI] [PMC free article] [PubMed] [Google Scholar]
17.Huang X, Zhu T-N, Liu Y-C, Qi G-A, Zhang J-N, Chen G-B. Efficient estimation for large-scale linkage disequilibrium patterns of the human genome. Elife. 2023;12:RP90636. doi: 10.7554/eLife.90636 [DOI] [PMC free article] [PubMed] [Google Scholar]
18.Zhou X. A unified framework for variance component estimation with summary statistics in genome-wide association studies. Ann Appl Stat. 2017;11(4):2027–51. doi: 10.1214/17-AOAS1052 [DOI] [PMC free article] [PubMed] [Google Scholar]
19.Hutchinson MF. A stochastic estimator of the trace of the influence matrix for Laplacian smoothing splines. Commun Stat Comput. 1989;18:1059–76. [Google Scholar]
20.Girard A. A fast ?Monte-Carlo cross-validation? procedure for large least squares problems with noisy data. Numer Math. 1989;56(1):1–23. doi: 10.1007/bf01395775 [DOI] [Google Scholar]
21.Lynch M, Walsh B. Genetics and analysis of quantitative traits. Sunderland, MA, USA: Sinauer Associates, Inc; 1998. [Google Scholar]
22.Isserlis L. On a formula for the product-moment coefficient of any order of a normal frequency distribution in any number of variables. Biometrika. 1918;12(1/2):134. doi: 10.2307/2331932 [DOI] [Google Scholar]
23.Goddard M. Genomic selection: prediction of accuracy and maximisation of long term response. Genetica. 2009;136(2):245–57. doi: 10.1007/s10709-008-9308-0 [DOI] [PubMed] [Google Scholar]
24.Liberty E, Zucker SW. The mailman algorithm: A note on matrix-vector multiplication. Inf Process Lett. 2009;109:179–82. [Google Scholar]
25.Agrawal A, Chiu AM, Le M, Halperin E, Sankararaman S. Scalable probabilistic PCA for large-scale genetic variation data. PLoS Genet. 2020;16(5):e1008773. doi: 10.1371/journal.pgen.1008773 [DOI] [PMC free article] [PubMed] [Google Scholar]
26.Visscher PM, Hemani G, Vinkhuyzen AAE, Chen G-B, Lee SH, Wray NR, et al. Statistical power to detect genetic (co)variance of complex traits using SNP data in unrelated samples. PLoS Genet. 2014;10(4):e1004269. doi: 10.1371/journal.pgen.1004269 [DOI] [PMC free article] [PubMed] [Google Scholar]
27.Xu T, Qi G-A, Zhu J, Xu H-M, Chen G-B. Subsampling Technique to Estimate Variance Component for UK-Biobank Traits. Front Genet. 2021;12:612045. doi: 10.3389/fgene.2021.612045 [DOI] [PMC free article] [PubMed] [Google Scholar]
28.Yang J, Lee SH, Goddard ME, Visscher PM. GCTA: a tool for genome-wide complex trait analysis. Am J Hum Genet. 2011;88(1):76–82. doi: 10.1016/j.ajhg.2010.11.011 [DOI] [PMC free article] [PubMed] [Google Scholar]
29.Loh P-R, Bhatia G, Gusev A, Finucane HK, Bulik-Sullivan BK, Pollack SJ, et al. Contrasting genetic architectures of schizophrenia and other complex diseases using fast variance-components analysis. Nat Genet. 2015;47(12):1385–92. doi: 10.1038/ng.3431 [DOI] [PMC free article] [PubMed] [Google Scholar]
30.Horn RA, Johnson CR. Matrix Analysis. 2 ed. New York: Cambridge University Press. 1994. [Google Scholar]
31.Elhussein A, Baymuradov U, NYGC ALS Consortium, Elhadad N, Natarajan K, Gürsoy G. A framework for sharing of clinical and genetic data for precision medicine applications. Nat Med. 2024;30(12):3578–89. doi: 10.1038/s41591-024-03239-5 [DOI] [PMC free article] [PubMed] [Google Scholar]
32.Wang S, Kim M, Li W, Jiang X, Chen H, Harmanci A. Privacy-aware estimation of relatedness in admixed populations. Brief Bioinform. 2022;23(6):bbac473. doi: 10.1093/bib/bbac473 [DOI] [PMC free article] [PubMed] [Google Scholar]
33.Zhu Z, Bakshi A, Vinkhuyzen AAE, Hemani G, Lee SH, Nolte IM, et al. Dominance genetic variation contributes little to the missing heritability for human complex traits. Am J Hum Genet. 2015;96(3):377–85. doi: 10.1016/j.ajhg.2015.01.001 [DOI] [PMC free article] [PubMed] [Google Scholar]
34.Vitezica ZG, Legarra A, Toro MA, Varona L. Orthogonal estimates of variances for additive, dominance, and epistatic effects in populations. Genetics. 2017;206(3):1297–307. doi: 10.1534/genetics.116.199406 [DOI] [PMC free article] [PubMed] [Google Scholar]

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1013568.r001

Decision Letter 0

Androniki Psifidi

10 Jun 2025

-->PCOMPBIOL-D-25-00059

Analytical and computational solution for the estimation of SNP-heritability in biobank-scale and distributed datasets

PLOS Computational Biology

Dear Dr. Chen,

Thank you for submitting your manuscript to PLOS Computational Biology. After careful consideration, we feel that it has merit but does not fully meet PLOS Computational Biology's publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

Please submit your revised manuscript within 60 days Aug 10 2025 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at ploscompbiol@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pcompbiol/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

* A rebuttal letter that responds to each point raised by the editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'. This file does not need to include responses to formatting updates and technical items listed in the 'Journal Requirements' section below.

* A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.

* An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, competing interests statement, or data availability statement, please make these updates within the submission form at the time of resubmission. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter

We look forward to receiving your revised manuscript.

Kind regards,

Androniki Psifidi, DVM, PhD

Guest Editor

PLOS Computational Biology

Ilya Ioshikhes

Section Editor

PLOS Computational Biology

Journal Requirements: -->1) Please provide an Author Summary. This should appear in your manuscript between the Abstract (if applicable) and the Introduction, and should be 150-200 words long. The aim should be to make your findings accessible to a wide audience that includes both scientists and non-scientists. Sample summaries can be found on our website under Submission Guidelines:-->-->https://journals.plos.org/ploscompbiol/s/submission-guidelines#loc-parts-of-a-submission-->--> -->-->2) Please upload all main figures as separate Figure files in .tif or .eps format. For more information about how to convert and format your figure files please see our guidelines: -->-->https://journals.plos.org/ploscompbiol/s/figures-->--> -->-->3) We have noticed that you have uploaded Supporting Information files, but you have not included a list of legends. Please add a full list of legends for your Supporting Information files after the references list.-->--> -->-->4) Please note that your Data Availability Statement is currently missing the repository name, and the DOI/accession number of each dataset OR a direct link to access each dataset. If your manuscript is accepted for publication, you will be asked to provide these details on a very short timeline. We therefore suggest that you provide this information now, though we will not hold up the peer review process if you are unable.-->--> -->-->5) Please provide a detailed Financial Disclosure statement. This is published with the article. It must therefore be completed in full sentences and contain the exact wording you wish to be published.-->-->1) Please clarify all sources of financial support for your study. List the grants, grant numbers, and organizations that funded your study, including funding received from your institution. Please note that suppliers of material support, including research materials, should be recognized in the Acknowledgements section rather than in the Financial Disclosure-->-->2) State the initials, alongside each funding source, of each author to receive each grant. For example: "This work was supported by the National Institutes of Health (####### to AM; ###### to CJ) and the National Science Foundation (###### to AM)."-->-->3) State what role the funders took in the study. If the funders had no role in your study, please state: "The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript."-->-->4) If any authors received a salary from any of your funders, please state which authors and which funders..-->-->If you did not receive any funding for this study, please simply state: u201cThe authors received no specific funding for this work.u201d-->--> -->-->6) Your current Financial Disclosure states, "The author(s) received no specific funding for this work.".-->-->However, your funding information on the submission form indicates National Natural Science Foundation of China 31771392 to Dr. Guo-Bo Chen, Natural Science Foundation of Jilin Province 32102503to Zhe Zhang-->-->, Shenzhen Basic Research Foundation 20220818100717002 to Siyang Liu and Basic and Applied Basic Research Foundation of Guangdong Province 2022B1515120080 to Siyang Liu-->--> -->-->Please indicate by return email the full and correct funding information for your study and confirm the order in which funding contributions should appear. Please be sure to indicate whether the funders played any role in the study design, data collection and analysis, decision to publish, or preparation of the manuscript.-->--> -->-->7) Please send a completed 'Competing Interests' statement, including any COIs declared by your co-authors. If you have no competing interests to declare, please state "The authors have declared that no competing interests exist". Otherwise please declare all competing interests beginning with the statement "I have read the journal's policy and the authors of this manuscript have the following competing interests"-->--> -->Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Authors:

Please note here if the review is uploaded as an attachment.

Reviewer #1: In the manuscript "Analytical and computational solution for the estimation of SNP-heritability in biobank- scale and distributed datasets " by Qi et al., the authors describe an improved implementation of the randomized Haseman-Elston regression (RHE-reg) to estimate SNP heritability on biobank-scale data (i.e., genomic analysis with hundreds of thousands of samples genotyped with hundreds of thousands SNP). The methodology is extended to distributed data (i.e., analysis on genomic data coming from different sources under privacy). The work presents an analytical procedure to control the number of iterations of the RHE-reg, which has been shown to be a limitation of the RHE-reg. A free software is mentioned (https://github.com/gc5k/gear2), albeit without further information on the software. The work is based on previous works on the RHE-reg trying to address limitation on the number of iterations needed to provide with accurate solutions. Moreover, given the plethora of available software that can handle biobank-scale genomic data, the novelty of the work is questionable.

Overall, the topic is of interest to the readership of PLOS Computational Biology in the field of genetics/genomics. The manuscript is well written. Despite this, there are some critical points that need to be addressed:

The authors stated that the aim of the work is to estimate SNP-heritability for biobank-scale data. However, in the simulation the umber of samples was set to 1,000, 5,000, and 10,000, and the number of SNPs to 10,000, 50,000, and 100,000. None of this combinations is close to what is known as biobank-scale. Moreover, more details should be given for the simulation scenario, e.g., software used and relationship among samples (related or unrelated individuals were simulated?). Further, authors simulated that all SNPs were considered causal after a typical polygenic model. This is not enough information. Exact distributions used to sample SNP effects should be provided. Were the 10k SNPs included in the 50k and 100k sets? If yes, did they have the same effects? Did all SNP had an effect even small or the effect of some SNPs was set to zero? Values of heritability were set to 0, 0.1, and 0.25. Is there any reason that medium to high heritability was not tested? I would strongly recommend to test model performance also with h2 of 0.5, 0.75 and 1 (or close to 1). Regarding the case of distributed data, how the model performs in the case of unbalanced data across institutes?

Some extra information that is missing and is of interest is the capacity of the computer used to run the analysis, the time required to run each analysis and a comparison with at least two independent and well-known software, as a base-line, that can analyse biobank-scale genomic data. Moreover, as mentioned above, although a github repository is mentioned for the software, no information is provided on the software. It is not clear finally how many iterations are needed to run biobank-scale analysis. In the real data analysis, could you explain the reason to use only unrelated individuals?

The Discussion is too short. All Tables and Figures need to be self-explanatory. Regarding the analysis with the UK-Biobank data (ExData2) there are some discrepancies that need to be discussed. Overall, the standard errors of the h2 estimates are higher compared to those reported in Xu et al. For the trait “Age of primiparous women at birth of child” no estimates were provided. In some cases, there are considerable differences in h2 estimates between Xu et al and the current study, e.g., 0.42 vs 0.17 for the trait “Trunk predicted mass” and 0.73 vs 0.50 for the trait “Trunk predicted mass”. Could you provide with explanations?

Overall, I support the publication of this manuscript, but only after addressing all my comments and/or suggestions point by point. Thus, my conclusion is major revision.

Minor comments

• In all equations double check the correct use of “~” and “^” for the predicted and estimated values.

• L 44 “much faster than REML” – be more precise

• L 46 “recently a randomized” – I think that a work back in 2018 cannot be considered a s a recent work.

• L 63 “horizontal federated learning” – could you elaborate more?

• L 68 “a large B” – be more precise

• L 69 “boundaries of key statistics” – could you explain more?

• L 95 “of the square of each element” – replace with “of the square of each diagonal element

• L95 “We proved that” – citation is missing

• L 97 “correlations” – you mean pearson correlations?

• L 100 tr(Kc) could you provide with the space of the values for c?

• Equation 3 – denote “L”

• L 101 “of of L2,B” remove second “of”

• L 108 “random mating or little inbreeding“ – up to which degree of inbreeding? Be more precise

• L 114, what is a and b?

• L 115, explain μ in the equation

• L 118 replace “Tylor” with “Taylor”

• L 126 “sufficient iterations” – how many?

• L 127 “large enough” – do you mean going to infinity?

• L 152 in the equation explain i and j.

• L 202 “c institutes” – consider to change the letter “c” in order not to be confused with “c” used in previous equations

• L 204-205 “yν” and “Xν”, explain subscript “ν”

• L 206 explain “F”

• L 217 “Furthermore, if we have c covariates, and the covariate matrix W is of n x c dimensions” – double check this sentence

• L 258 “because = 1,000 was too small a sample size here” – what does this mean? Were there any convergence issues?

• L 265 “much higher h2 took a much greater B” – please be more precise

• Figure 2 -consider to use same y-axis scale for fair comparisons

• Figure 3 – explain the blue and red lines. Consider to change h2(B20/B50) to h2(B20) – h2(B50) or h2(B20) / h2(B50) etc. Why negative h2 values are reported?

• Figure 4 – what are the negative h2 values?

• L 337 “a high degree” – consider to change to “a relative high degree”, since the pearson correlation reported is 0f 0.77

• Figure 5 – what is the meaning of h2 estimates > 1?

• L 348 and 349 should 30 be replaced with 50?

• Figure 6 - what is the meaning of h2 estimates > 1?

• L 385 – 391 are coming “out of the blue” in the discussion. Consider to make a subheading.

Reviewer #2: In this manuscript, the authors developed an analytical solution for scalable estimator of SNP heritability. They conduct simulations and real data application to illustrate the accuracy of their methods. The manuscript is well structured. I have a few questions and suggestions for the authors, which I listed below.

Major:

1. Equations and Notation. I have several questions regarding the notation used throughout the manuscript. Please review the notation carefully to ensure consistency and clarity across the entire text.

a. It would be helpful to define all notations at their first occurrence. For instance, the symbol c in Equation (3), the distribution of z_b, and the variables listed in Table 1 (e.g., q, v) are not clearly defined.

b. Line 100: I believe the term m^c in the denominator of L_{c,b} should be removed, given that K is defined as XX^t / m. This also differs from Equation (10) in Wu and Sankararaman (2018). Please check for similar inconsistencies in Table 2 and elsewhere in the manuscript.

c. Line 158: I find the notation in Equation (11) confusing—particularly the use of tilde m_e, which appears to correspond to vertical RHE-reg in Table 2. Also, is the second part of Equation (11) meant to represent the variance of the estimator \hat{m_e}? If so, the corresponding entry in Table 2 should be updated accordingly. Please use hats (^) to indicate estimated values throughout the text.

d. Line 242: Should the expression be B/2 rather than 1/B?

2. It would strengthen the manuscript to include a comparison between your method and existing approaches, such as that of Wu (2018), in your simulations. Comparing metrics like mean squared error (MSE), computation time, and memory usage would be especially informative.

3. Figures. The figure captions should be more self-explanatory. Please clarify what the lines and data points represent.

a. Figure 2. I was expecting to see a direct comparison of MSE across the simulation scenarios (or with methods from prior work such as Wu's), which would be more informative than only comparing Lambda_1, Lambda_2, and Lambda_3.

b. Line 261-262, The caption states that Figure 2 shows how quickly B can reduce Lambda_2/B, but the y-axis shows only Lambda_2. Please clarify.

c. Line 262-266. Please elaborate on how Lambda_1, Lambda_2, and Lambda_3 relate to MSE, and what insights Figure 2 provides. The differing axis scales make it hard to interpret your conclusions.

Minor:

Line 287. “The fitted regression” — of what? It would be better to plot the data point and provide a more detailed explanation in the caption. Plus, the color legend of B is missing.

Figure 4. Including a 45-degree reference line would be helpful. Also, please explain what the data points represent.

Line 359. Please define what “split 1/2” and “split 2/1” refer to.

Reviewer #3: Qi at al developed an analytical and computational solution for estimating SNP-heritability at biobank-scale scale data, which is an extremely important problem but often challenged because of computational burden. I very much appreciate the authors’ theoretical effort to attack this important problem and the work is also impressive. My major comments are trying to help the presentation of the manuscript.

I found the current manuscript is not easy to understand. In the main text, there are a lot of mathematical formulas and derivations, but many steps have been missed, which are difficult to follow. I suggest the authors present the final formula with clear notation definitions and leaving detailed mathematical derivations in Supplementary Note. This should improve readability. I will try to give my specific suggestions in my comments. I also suggest the authors carefully check all the mathematical formulas and make sure they are correct.

Line 96-97, it will be good to point out where the prove of E(tr(K2)), and E(h ^2) are. It is also confused why E(h ^2) is still involved y? It will be good to add more details for the derivations, such as a Supplementary note.

Line 100, In equation (3), it will be better to introduce L_(c,B) first before writing the equation. From my calculation, the current definition of L_(c,B) leads to (〖E(L〗_(c,B))=1/m^c tr(K^c), which is inconsistent with the second equation. I think m^c only presents when K is writing as XX(T). I think the current definition of L_(c,B) (the first equation in equations (3)) does not have the term 1/m^c .

In addition, the equation var(L_(c,B) )=(2tr(K^2c))/B in equations (3) is not the same as the equation in Wu and Sankararaman (2018). I believe the authors’ equation is correct, but I think the authors should point out the inconsistence if this is true.

Again, line 110, L_(2,B) does not have 1/m2.

Equations (5), (6) and (7) need additional details. Again, a supplementary note will be good.

In equation (8), what is σ _(h^2 ) refers? There is no definition.

Line 129 and equation (10), why need z3? Should z3 be the same as z2? Confused.

Line 151, the title is “Estimation for the effective number of markers…”. But it is actually for estimating tr(K2).

Line 152, it should be E(tr(K2)) rather than tr(K2). The derivation should add additional detail, perhaps a Supplementary note.

Line 153, the fourth term summation should be m rather than n. I guess this term comes from ∑_(i≠j)^n▒∑_(k≠l)^m▒〖x_(i,k)^2 x_(j,l)^2 〗. Since i≠j, the expectation of this term should be n(n-1)m(m-1). I am not clear whether this difference will affect the conclusion.

Line 176, y _b is not defined.

Line 191, what is the numerator of Eq 5 refers?

Line 194, additional detail will be good.

Line 233-234, How LD was simulated? Was the LD randomly sample from 0, 0.2, …, 0.8?

Line 163-264, it states “Λ3 was seemed a less important… and vanished much faster than Λ2 “. Why it did vanish?

Figure 2 includes many symbols which cannot be recognized. It is possible due to the format.

Line 273, it states “the sample size Λ0=…”. Why does it refer to sample size?

Figure 3 only plotted the regression lines. Should be all the points be added so the readers can see how many data points were used in the regressions?

Line 336-338, it states the Pearson’s correlation coefficient 0.77. Should this correlation suggest the discrepancy between the current estimates and that by Xu at all?

I think the authors should add computational time for different B to see the improvement.

**********

Have the authors made all data and (if applicable) computational code underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data and code underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data and code should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data or code —e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: Yes

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean? ). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy .

Reviewer #1: Yes: Christos Dadousis

Reviewer #2: No

Reviewer #3: No

Figure resubmission:

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step. If there are other versions of figure files still present in your submission file inventory at resubmission, please replace them with the PACE-processed versions.

Reproducibility:

To enhance the reproducibility of your results, we recommend that authors of applicable studies deposit laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. Additionally, PLOS ONE offers an option to publish peer-reviewed clinical study protocols. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols

-->

PLoS Comput Biol. 2025 Oct 21;21(10):e1013568. doi: 10.1371/journal.pcbi.1013568.r002

Author response to Decision Letter 1

14 Jul 2025

Attachment

Submitted filename: PLoSCompBiol_Review.Jul13.cgb.docx

pcbi.1013568.s007.docx^{(517.8KB, docx)}

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1013568.r003

Decision Letter 1

Androniki Psifidi

10 Sep 2025

PCOMPBIOL-D-25-00059R1

Analytical and computational solution for the estimation of SNP-heritability in biobank-scale and distributed datasets

PLOS Computational Biology

Dear Dr. Chen,

Please submit your revised manuscript within 30 days Nov 10 2025 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at ploscompbiol@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pcompbiol/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

* A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.

* An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

We look forward to receiving your revised manuscript.

Kind regards,

Androniki Psifidi, DVM, PhD

Guest Editor

PLOS Computational Biology

Ilya Ioshikhes

Section Editor

PLOS Computational Biology

Additional Editor Comments:

Reviewer #1:

Reviewer #2:

Reviewer #3:

Journal Requirements:

If the reviewer comments include a recommendation to cite specific previously published works, please review and evaluate these publications to determine whether they are relevant and should be cited. There is no requirement to cite these works unless the editor has indicated otherwise.

1) We noticed that you used the phrase 'data not shown' in the manuscript. We do not allow these references, as the PLOS data access policy requires that all data be either published with the manuscript or made available in a publicly accessible database. Please amend the supplementary material to include the referenced data or remove the references.

2) Please amend your detailed Financial Disclosure statement. This is published with the article. It must therefore be completed in full sentences and contain the exact wording you wish to be published.

1) Please clarify all sources of financial support for your study. List the grants, grant numbers, and organizations that funded your study, including funding received from your institution. Please note that suppliers of material support, including research materials, should be recognized in the Acknowledgements section rather than in the Financial Disclosure

2) State the initials, alongside each funding source, of each author to receive each grant. For example: "This work was supported by the National Institutes of Health (####### to AM; ###### to CJ) and the National Science Foundation (###### to AM)."

3) State what role the funders took in the study. If the funders had no role in your study, please state: "The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript."

4) If any authors received a salary from any of your funders, please state which authors and which funders..

If you did not receive any funding for this study, please simply state: u201cThe authors received no specific funding for this work.u201d

3) Your current Financial Disclosure states, "Yes ↳ Please add funding details. national natural science foundation of China ↳ Please select the country of your main research funder (please select carefully as in some cases this is used in fee calculation). CHINA - CN".

However, your funding information on the submission form indicates different funders.

Please indicate by return email the full and correct funding information for your study and confirm the order in which funding contributions should appear. Please be sure to indicate whether the funders played any role in the study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Authors:

Please note here if the review is uploaded as an attachment.

Reviewer #1: Dear authors,

I would like to thank you very much for your work and apologise for any delays in the reviewing process.

There are few grammar errors that I believe will be corrected during final checks from PLOS computational biology.

At line 235 replace c institutes with s institutes.

Reviewer #2: Thank you for your efforts in revising the manuscript. My previous concerns have been greatly addressed. I just have one remaining question regarding new Figure 2.

Could you clarify why the scale of λ₂/B in the current Figure 2 reaches into the hundreds or thousands, while λ₂ was previously shown to be in the range of hundredths or lower (similar concern with λ3)? Also, the color coding appears off. In the fourth panel, the green dashed is supposed to indicate the mean along the x-axis, but there is no green dots on its right side. Panel 1 has the same issue.

As my previously comment, it would be better to draw the MSE value across the simulation scenarios.

And it would be better to add a color legend in Figure 3.

Reviewer #3: The authors addressed my concerns well. Thanks.

**********

Have the authors made all data and (if applicable) computational code underlying the findings in their manuscript fully available?

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: Yes

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean? ). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy .

Reviewer #1: Yes: Christos Dadousis

Reviewer #2: No

Reviewer #3: No

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

Figure resubmission:

-->While revising your submission, we strongly recommend that you use PLOS’s NAAS tool (https://ngplosjournals.pagemajik.ai/artanalysis) to test your figure files. NAAS can convert your figure files to the TIFF file type and meet basic requirements (such as print size, resolution), or provide you with a report on issues that do not meet our requirements and that NAAS cannot fix.-->-->

After uploading your figures to PLOS’s NAAS tool - https://ngplosjournals.pagemajik.ai/artanalysis, NAAS will process the files provided and display the results in the "Uploaded Files" section of the page as the processing is complete. If the uploaded figures meet our requirements (or NAAS is able to fix the files to meet our requirements), the figure will be marked as "fixed" above. If NAAS is unable to fix the files, a red "failed" label will appear above. When NAAS has confirmed that the figure files meet our requirements, please download the file via the download option, and include these NAAS processed figure files when submitting your revised manuscript.-->

Reproducibility:

PLoS Comput Biol. 2025 Oct 21;21(10):e1013568. doi: 10.1371/journal.pcbi.1013568.r004

Author response to Decision Letter 2

15 Sep 2025

Attachment

Submitted filename: LetterRebuttle.docx

pcbi.1013568.s008.docx^{(93.7KB, docx)}

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1013568.r005

Decision Letter 2

Androniki Psifidi

29 Sep 2025

Dear Dr. Chen,

We are pleased to inform you that your manuscript 'Analytical and computational solution for the estimation of SNP-heritability in biobank-scale and distributed datasets' has been provisionally accepted for publication in PLOS Computational Biology.

Before your manuscript can be formally accepted you will need to complete some formatting changes, which you will receive in a follow up email. A member of our team will be in touch with a set of requests.

Please note that your manuscript will not be scheduled for publication until you have made the required changes, so a swift response is appreciated.

IMPORTANT: The editorial review process is now complete. PLOS will only permit corrections to spelling, formatting or significant scientific errors from this point onwards. Requests for major changes, or any which affect the scientific understanding of your work, will cause delays to the publication date of your manuscript.

Should you, your institution's press office or the journal office choose to press release your paper, you will automatically be opted out of early publication. We ask that you notify us now if you or your institution is planning to press release the article. All press must be co-ordinated with PLOS.

Thank you again for supporting Open Access publishing; we are looking forward to publishing your work in PLOS Computational Biology.

Best regards,

Androniki Psifidi, DVM, PhD

Guest Editor

PLOS Computational Biology

Ilya Ioshikhes

Section Editor

PLOS Computational Biology

***********************************************************

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1013568.r006

Acceptance letter

Androniki Psifidi

PCOMPBIOL-D-25-00059R2

Analytical and computational solution for the estimation of SNP-heritability in biobank-scale and distributed datasets

Dear Dr Chen,

I am pleased to inform you that your manuscript has been formally accepted for publication in PLOS Computational Biology. Your manuscript is now with our production department and you will be notified of the publication date in due course.

The corresponding author will soon be receiving a typeset proof for review, to ensure errors have not been introduced during production. Please review the PDF proof of your manuscript carefully, as this is the last chance to correct any errors. Please note that major changes, or those which affect the scientific understanding of the work, will likely cause delays to the publication date of your manuscript.

Soon after your final files are uploaded, unless you have opted out, the early version of your manuscript will be published online. The date of the early version will be your article's publication date. The final article will be published to the same URL, and all versions of the paper will be accessible to readers.

You will receive an invoice from PLOS for your publication fee after your manuscript has reached the completed accept phase. If you receive an email requesting payment before acceptance or for any other service, this may be a phishing scheme. Learn how to identify phishing emails and protect your accounts at https://explore.plos.org/phishing.

Thank you again for supporting PLOS Computational Biology and open-access publishing. We are looking forward to publishing your work!

With kind regards,

Anita Estes

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

S1 Text. Technical details on some mathematical derivations.

(DOCX)

pcbi.1013568.s001.docx^{(54.1KB, docx)}

S1 Fig. This figure gives the computational tips on how to program the software.

Some sequential operation of the matrix is suggested to make the program easy to write.

(TIFF)

pcbi.1013568.s002.tiff^{(3.8MB, tiff)}

S1 Data. Implementation for Horizontal RHE-reg (R code: Source1. R and Source2.R).

(ZIP)

pcbi.1013568.s003.zip^{(3.8KB, zip)}

S1 Table. SNP-heritability estimation for 81 UKB traits (XLSX).

(XLSX)

pcbi.1013568.s004.xlsx^{(30.1KB, xlsx)}

S2 Table. The time cost for heritability estimation on BMI for RHE-reg, BOLT-REML and GCTA.

(XLSX)

pcbi.1013568.s005.xlsx^{(11.6KB, xlsx)}

S3 Table. Computational time for 81 continuous traits using RHE-reg.

(XLSX)

pcbi.1013568.s006.xlsx^{(14.6KB, xlsx)}

Attachment

Submitted filename: PLoSCompBiol_Review.Jul13.cgb.docx

pcbi.1013568.s007.docx^{(517.8KB, docx)}

Attachment

Submitted filename: LetterRebuttle.docx

pcbi.1013568.s008.docx^{(93.7KB, docx)}

Data Availability Statement

[pcbi.1013568.ref001] 1.Visscher PM, Hill WG, Wray NR. Heritability in the genomics era--concepts and misconceptions. Nat Rev Genet. 2008;9(4):255–66. doi: 10.1038/nrg2322 [DOI] [PubMed] [Google Scholar]

[pcbi.1013568.ref002] 2.Bycroft C, Freeman C, Petkova D, Band G, Elliott LT, Sharp K, et al. The UK Biobank resource with deep phenotyping and genomic data. Nature. 2018;562(7726):203–9. doi: 10.1038/s41586-018-0579-z [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1013568.ref003] 3.Haseman JK, Elston RC. The investigation of linkage between a quantitative trait and a marker locus. Behav Genet. 1972;2(1):3–19. doi: 10.1007/BF01066731 [DOI] [PubMed] [Google Scholar]

[pcbi.1013568.ref004] 4.Chen G-B. Estimating heritability of complex traits from genome-wide association studies using IBS-based Haseman-Elston regression. Front Genet. 2014;5:107. doi: 10.3389/fgene.2014.00107 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1013568.ref005] 5.Wu Y, Sankararaman S. A scalable estimator of SNP heritability for biobank-scale data. Bioinformatics. 2018;34(13):i187–94. doi: 10.1093/bioinformatics/bty253 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1013568.ref006] 6.Wu Y, et al. Fast estimation of genetic correlation for biobank-scale data. Am J Hum Genet. 2022;24:24–32. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1013568.ref007] 7.Patel V, et al. Scientific applications leveraging randomized linear algebra. arXiv. 2025. doi: 2506.16457 [Google Scholar]

[pcbi.1013568.ref008] 8.Tang M, Wang T, Zhang X. A review of SNP heritability estimation methods. Brief Bioinform. 2022;23(3):bbac067. doi: 10.1093/bib/bbac067 [DOI] [PubMed] [Google Scholar]

[pcbi.1013568.ref009] 9.Yang Z, Hu L, Zhen J, Gu Y, Liu Y, Huang S, et al. Genetic basis of pregnancy-associated decreased platelet counts and gestational thrombocytopenia. Blood. 2024;143(15):1528–38. doi: 10.1182/blood.2023021925 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1013568.ref010] 10.Xiao H, et al. Genetic analysis of 104 pregnancy phenotypes in 39, 194 Chinese women. medRxiv. 2023. doi: 23298979 [Google Scholar]

[pcbi.1013568.ref011] 11.Chen G-B, Liu S, Zhang L, Huang T, Tang X, Li Y, et al. Building and sharing medical cohorts for research. Innovation (Camb). 2024;5(3):100623. doi: 10.1016/j.xinn.2024.100623 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1013568.ref012] 12.Zhang Q-X, Liu T, Guo X, Zhen J, Yang M-Y, Khederzadeh S, et al. Searching across-cohort relatives in 54,092 GWAS samples via encrypted genotype regression. PLoS Genet. 2024;20(1):e1011037. doi: 10.1371/journal.pgen.1011037 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1013568.ref013] 13.Chen G-B, Lee SH, Robinson MR, Trzaskowski M, Zhu Z-X, Winkler TW, et al. Across-cohort QC analyses of GWAS summary statistics from complex traits. Eur J Hum Genet. 2016;25(1):137–46. doi: 10.1038/ejhg.2016.106 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1013568.ref014] 14.McMahan HB, et al. Communication-efficient learning of deep networks from decentralized data. arXiv. 2017. doi: 1602.05629 [Google Scholar]

[pcbi.1013568.ref015] 15.Bulik-Sullivan BK, Loh P-R, Finucane HK, Ripke S, Yang J, Schizophrenia Working Group of the Psychiatric Genomics Consortium, et al. LD Score regression distinguishes confounding from polygenicity in genome-wide association studies. Nat Genet. 2015;47(3):291–5. doi: 10.1038/ng.3211 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1013568.ref016] 16.Zhang Q-X, Jayasinghe D, Zhang Z, Lee SH, Xu H-M, Chen G-B. Precise estimation of in-depth relatedness in biobank-scale datasets using deepKin. Cell Rep Methods. 2025;5(6):101053. doi: 10.1016/j.crmeth.2025.101053 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1013568.ref017] 17.Huang X, Zhu T-N, Liu Y-C, Qi G-A, Zhang J-N, Chen G-B. Efficient estimation for large-scale linkage disequilibrium patterns of the human genome. Elife. 2023;12:RP90636. doi: 10.7554/eLife.90636 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1013568.ref018] 18.Zhou X. A unified framework for variance component estimation with summary statistics in genome-wide association studies. Ann Appl Stat. 2017;11(4):2027–51. doi: 10.1214/17-AOAS1052 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1013568.ref019] 19.Hutchinson MF. A stochastic estimator of the trace of the influence matrix for Laplacian smoothing splines. Commun Stat Comput. 1989;18:1059–76. [Google Scholar]

[pcbi.1013568.ref020] 20.Girard A. A fast ?Monte-Carlo cross-validation? procedure for large least squares problems with noisy data. Numer Math. 1989;56(1):1–23. doi: 10.1007/bf01395775 [DOI] [Google Scholar]

[pcbi.1013568.ref021] 21.Lynch M, Walsh B. Genetics and analysis of quantitative traits. Sunderland, MA, USA: Sinauer Associates, Inc; 1998. [Google Scholar]

[pcbi.1013568.ref022] 22.Isserlis L. On a formula for the product-moment coefficient of any order of a normal frequency distribution in any number of variables. Biometrika. 1918;12(1/2):134. doi: 10.2307/2331932 [DOI] [Google Scholar]

[pcbi.1013568.ref023] 23.Goddard M. Genomic selection: prediction of accuracy and maximisation of long term response. Genetica. 2009;136(2):245–57. doi: 10.1007/s10709-008-9308-0 [DOI] [PubMed] [Google Scholar]

[pcbi.1013568.ref024] 24.Liberty E, Zucker SW. The mailman algorithm: A note on matrix-vector multiplication. Inf Process Lett. 2009;109:179–82. [Google Scholar]

[pcbi.1013568.ref025] 25.Agrawal A, Chiu AM, Le M, Halperin E, Sankararaman S. Scalable probabilistic PCA for large-scale genetic variation data. PLoS Genet. 2020;16(5):e1008773. doi: 10.1371/journal.pgen.1008773 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1013568.ref026] 26.Visscher PM, Hemani G, Vinkhuyzen AAE, Chen G-B, Lee SH, Wray NR, et al. Statistical power to detect genetic (co)variance of complex traits using SNP data in unrelated samples. PLoS Genet. 2014;10(4):e1004269. doi: 10.1371/journal.pgen.1004269 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1013568.ref027] 27.Xu T, Qi G-A, Zhu J, Xu H-M, Chen G-B. Subsampling Technique to Estimate Variance Component for UK-Biobank Traits. Front Genet. 2021;12:612045. doi: 10.3389/fgene.2021.612045 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1013568.ref028] 28.Yang J, Lee SH, Goddard ME, Visscher PM. GCTA: a tool for genome-wide complex trait analysis. Am J Hum Genet. 2011;88(1):76–82. doi: 10.1016/j.ajhg.2010.11.011 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1013568.ref029] 29.Loh P-R, Bhatia G, Gusev A, Finucane HK, Bulik-Sullivan BK, Pollack SJ, et al. Contrasting genetic architectures of schizophrenia and other complex diseases using fast variance-components analysis. Nat Genet. 2015;47(12):1385–92. doi: 10.1038/ng.3431 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1013568.ref030] 30.Horn RA, Johnson CR. Matrix Analysis. 2 ed. New York: Cambridge University Press. 1994. [Google Scholar]

[pcbi.1013568.ref031] 31.Elhussein A, Baymuradov U, NYGC ALS Consortium, Elhadad N, Natarajan K, Gürsoy G. A framework for sharing of clinical and genetic data for precision medicine applications. Nat Med. 2024;30(12):3578–89. doi: 10.1038/s41591-024-03239-5 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1013568.ref032] 32.Wang S, Kim M, Li W, Jiang X, Chen H, Harmanci A. Privacy-aware estimation of relatedness in admixed populations. Brief Bioinform. 2022;23(6):bbac473. doi: 10.1093/bib/bbac473 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1013568.ref033] 33.Zhu Z, Bakshi A, Vinkhuyzen AAE, Hemani G, Lee SH, Nolte IM, et al. Dominance genetic variation contributes little to the missing heritability for human complex traits. Am J Hum Genet. 2015;96(3):377–85. doi: 10.1016/j.ajhg.2015.01.001 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1013568.ref034] 34.Vitezica ZG, Legarra A, Toro MA, Varona L. Orthogonal estimates of variances for additive, dominance, and epistatic effects in populations. Genetics. 2017;206(3):1297–307. doi: 10.1534/genetics.116.199406 [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Analytical and computational solution for the estimation of SNP-heritability in biobank-scale and distributed datasets

Guo-An Qi

Qi-Xin Zhang

Jingyu Kang

Tianyuan Li

Xiyun Xu

Zhe Zhang

Zhe Fan

Siyang Liu

Guo-Bo Chen

Roles

Author summary

Abstract

Introduction

Method description

Materials and methods

A framework for Randomized Haseman-Elston regression (RHE-reg)

Randomized estimation for h2 via RHE-Reg

Sampling variance of RHE-reg

Constructing test statistics

Estimation for key parameters

Estimation for the effective number of markers (me)

Table 1. Table for high-order moments for different coding scheme for genotypes.

Estimation for \boldtr(K4)

About Λ1—high-dimension structure of genetic architecture

About η — the term determines the iteration B

Extended utilities for distributed GWAS datasets

Vertical RHE-reg

Horizontal RHE-reg

Summary for RHE-reg

Table 2. Analytical results for RHE-reg.

Software

Results

Simulation results

Result 1: Randomized estimation for tr(K4)

Fig 1. Comparison for the estimation of tr(K4).

Result 2: MSE of RHE-reg

Fig 2. Evaluation for the MSE of RHE-reg under the different simulation scenarios.

Result 3: Randomized estimation for h2 and z-score

Fig 3. Estimation of 𝐡2 and z-score after different 𝐁.

Result 4: Application of horizontal RHE-reg

Fig 4. Application of horizontal RHE-reg in simulation studies.

Real data analysis for UK Biobank

Comparison of the UKB results between strategy I and II

Fig 5. Randomized estimation for heritability for UK Biobank 81 quantitative traits.

Application of vertical RHE-reg

Fig 6. Application of v-RHE-reg for 81 quantitative traits in UKB.

Discussion

Supporting information

Acknowledgments

Data Availability

Funding Statement

References

Decision Letter 0

Androniki Psifidi

Roles

Author response to Decision Letter 1

Decision Letter 1

Androniki Psifidi

Roles

Author response to Decision Letter 2

Decision Letter 2

Androniki Psifidi

Roles

Acceptance letter

Androniki Psifidi

Roles

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

Randomized estimation for $h^{2}$ via RHE-Reg

Estimation for the effective number of markers ( $m_{e}$ )

Estimation for $\begin{matrix} \bold t r (K^{4}) \end{matrix}$

About $Λ_{1}$ —high-dimension structure of genetic architecture

About $η$ — the term determines the iteration B

Result 1: Randomized estimation for $t r (K^{4})$

Fig 1. Comparison for the estimation of $t r (K^{4})$ .

Result 3: Randomized estimation for $h^{2}$ and z-score

Fig 3. Estimation of $𝐡^{2}$ and z-score after different $𝐁$ .