Abstract
Narrow sense heritability is a key concept in quantitative genetics, as it expresses the proportion of the observed phenotypic variation that is transmissible from parents to offspring. determines the resemblance among relatives, and the rate of response to artificial and natural selection. Classical methods for estimating use random samples of individuals with known relatedness, as well as response to artificial selection, when it is called realized heritability. Here, we present a method for estimating realized based on a simple assessment of a random-mating population with no artificial manipulation of the population structure, and derive SE of the estimates. This method can be applied to arbitrary phenotypic segments of the population (for example, the top-ranking p parents and offspring), rather than random samples. It can thus be applied to nonpedigreed random mating populations, where relatedness is determined from molecular markers in the p selected parents and offspring, thus substantially saving on genotyping costs. Further, we assessed the method by stochastic simulations, and, as expected from the mathematical derivation, it provides unbiased estimates of We compared our approach to the regression and maximum-likelihood approaches utilizing Galton’s dataset on human heights, and all three methods provided identical results.
Keywords: quantitative genetics, Hardy-Weinberg equilibrium, panmictic population
NARROW sense heritability is a key concept in quantitative genetics. is technically defined as the ratio of additive genetic variance (the variance of breeding values) to the total phenotypic variance and represents the fraction of the phenotypic variation of a quantitative trait that is transmissible from one generation to the next. determines the degree of resemblance between relatives, and the rate of response to artificial and natural selection; therefore, estimating is often the first step in applied plant and animal breeding programs as well as evolutionary genetics studies. is also the upper limit for the accuracy of predicting phenotypes from molecular marker data (genomic prediction), and, hence, is required knowledge for common diseases in humans in the context of precision medicine (Yang et al. 2010).
Because determines the degree of resemblance among relatives, classical methods for estimating use the observed phenotypic correlation among closely related individuals, and their average coefficient of relationship, to estimate One such classical method is regression of offspring phenotypic values on midparent phenotypic values (offspring-parent regression), for which the regression coefficient, b, provides an unbiased estimate of Other classical methods involve ANOVA of full and half-sibling families, and analysis of relatives with different degrees of relatedness using maximum-likelihood-based approaches. These methods require designed experiments in which pedigrees are constructed by mating specific males and females, or natural matings for which at least one parent is known are utilized. These designs can maximize the precision of the estimates given sample size constraints by manipulating both the numbers of families and family sizes.
However, these designs are not always applicable. For example, pedigrees are not usually known in natural populations. In this case, can be estimated using molecular markers to reconstruct pedigrees (Lynch and Walsh 1998; Aykanat et al. 2014). However, this is followed by variance component analysis based on the reconstructed pedigree, as is done for designed experiments, and, hence, also requires genotyping a large random sample of the population. More recently, “genomic heritability” has been estimated in populations of individuals that have been genotyped for large numbers of molecular markers using whole genome marker regressions (Yang et al. 2010). Genomic heritability is the fraction of the genetic variance that can be explained by regression on the markers, and will only be equivalent to the true when all causal variants are genotyped, such as with sequence data; or when the true causal variants are in perfect linkage disequilibrium with the markers (de los Campos et al. 2015). A major disadvantage of all marker-based methods for estimating is the expense of genotyping, so methods that can reduce this cost are desirable.
Here, we propose a method to estimate from natural populations that is related to the concept of realized Traditionally, realized is estimated by comparing the response to selection to the selection differential The estimate of requires specific matings. We show here that the realized concept can be generalized to a randomly mating natural population in which phenotypic values of a quantitative trait have been scored for the parental and offspring generations. The is estimated from the proportion of offspring from a defined range of phenotypes (for example, the top p offspring) that were produced by the p parents from the same defined phenotypic range. Thus, one only needs to determine relatedness of the selected subset of p parents and p offspring from the same truncated fraction of the phenotypic distribution, saving substantially on genotyping costs. Of course, one needs to assume neutrality of the DNA markers (such as highly polymorphic SSRs) to estimate pedigree within the truncated subsets. Assuming this approach, we derive the estimate and its associated SE.
Methods
We assume a population in Hardy-Weinberg equilibrium, where the offspring () are derived from the parental population (P) by random union of gametes, and there is no selection. We assume the phenotype of an individual for a quantitative trait (X) is the sum of an independent additive genetic value (a) and environmental deviation (e), and that a and e are normally distributed in the population. It follows that the phenotypic variance of X corresponds to and thus We now rank the individuals in the P and populations by their respective phenotypic values. From this ranking, we can determine the top-ranking proportion p of the population, as well as r, the percentage of cases when one or both parents of the top-ranking offspring individuals belong to the corresponding truncated p fraction of the P population (Figure 1a). It should be emphasized that the parents in P mate randomly, producing the offspring, and, therefore, this scenario differs from conventional truncation selection as the top p proportion of the parental population were not “selected” and constrained to mate among themselves. Rather, both top-ranking parents and offspring are only “inspected” to calculate r. We postulate that may be calculated based on the three parameters: p, r, and
It follows that the case for truncation selection (Figure 1a) can be generalized to the case of a two-sided truncation with (Figure 1b).
This approach differs from classical methods to estimate in that it does not rely on the degree of resemblance among relatives or variance component decomposition. This method resembles a realized estimate, but it differs from classical realized because it is applicable to panmictic populations, and does not require experimentally controlled crosses among selected parents. This strategy was motivated by a mathematical analysis of the effect of phenotypic preselection on reducing contamination rate in seed orchards of forest trees (Lstibůrek et al. 2012). This approach is similar to methodology developed to estimate for binomial (threshold) traits (Crittenden 1961; Falconer 1965), but here we are concerned with normally distributed quantitative traits.
Mathematical derivation
We begin with the derivation of the more general two-sided truncation scenario. Let and then we follow the transformation properties of the normal distribution and where
(1) |
For given and one may calculate the area p of the truncated distribution (Figure 1) as
(2) |
where and are the cumulative distribution functions (cdf) of the normal distribution The corresponding truncated distribution is then (Johnson et al. 1994)
(3) |
The selection intensity (i) is then defined as
(4) |
where and are the probability density functions of the standardized normal distribution, and the coefficient k is
(5) |
We can trace the likelihood that one or both parents of a given offspring originate from the truncated subset p. The number of parents satisfying this origin is binomially distributed as when tracking one side of the parentage, and when tracking both sides of the parentage. In these cases, the respective likelihoods that one parent or both parents, respectively, originate from p are enumerated in Table 1.
Table 1. Likelihoods and coefficients.
B | ||
---|---|---|
p | 0.5 | |
1 |
and are the number of parents originating from the truncated subset p when tracing one or two sites of the parentage, respectively. is the corresponding likelihood, and B is the coefficient introduced in Equation 6.
Then, for one of the two above scenarios, the expected value and variance of the offspring in are
(6) |
where the coefficient B is provided in Table 1.
Let be the proportion of the above offspring (Equation 6) belonging to the interval of In our remaining analyses, we utilize the central limit theorem, i.e., the normal approximation of the above distribution. Utilizing the same transformation properties of the normal distribution (Equation 1), the area is given by
(7) |
Let N be the size of the population. Then, is the number of individuals in the interval is the subset of where the parent(s) originate from the interval in P, according to Y. Next, we introduce a variable r, the relative frequency of offspring in the truncated proportion in with parent(s) originating from the truncated proportion of P, according to Y, where
(8) |
Next, we utilize the earlier binomial expansion, and
(9) |
With the knowledge of p and r, we can ascertain the unknown thus
(10) |
Therefore, is the implicit function of r, p, and (Additional variables in the equation are all functions of these three input parameters.) The above formulation is general, and could be used with mathematical solver software to calculate an estimate of Computer code to solve for using Equation 10 written in R language (R Core Team 2013) is provided in a public repository (Lstibůrek 2017). Next, we derive analytical solutions (i.e., deterministic equations) to estimate for two specific cases.
Case 1: centered two-sided truncation:
Provided both and are symmetrically allocated around the mean of P and is calculated as (see Supplemental Material, File S1 for derivation)
(11) |
where
Case 2: right-sided truncation:
When (Figure 1), one may calculate as (see File S1 for derivation)
(12) |
where There are two solutions of the above equation, and verification using Equation S1.7 provides the unique one.
Variance of the estimate:
The SE associated with the general estimate (Equation 10) is (see File S1 for derivation)
(13) |
where n is the number of progeny in p, under symmetric two-sided truncation, and a and b are
(14) |
The SE of the estimate for the scenario of right-sided truncation (Equation 12) is
(15) |
Equations 13 and 15 are both products of two components, where the first (leftmost) is associated with the type and intensity of truncation, and the second (rightmost) depicts the variance of r, including the sample size n.
Utilizing the mathematical derivation outlined above, we claim that estimates of and the corresponding SE are asymptotically unbiased.
The SE are provided for the two likelihoods ( and ) and one- or two-sided truncations in Figure 2 and Figure 3. The SE is comparable to existing methods, i.e., regression analysis; thus, the number of samples providing reliable estimate of is in the magnitude of “hundreds.”
Data availability
The authors state that all data necessary for confirming the conclusions presented in the article are represented fully within the article.
Computer simulation
As noted above, our formal theoretical derivations using normal distributions are not strictly true with truncation. Therefore, we verified our analytical approach by stochastic simulation, where was estimated for a hypothetical panmictic population as follows. Additive polygenetic effects of 10,000 unrelated and noninbred parental individuals were sampled from the normal distribution assuming the infinitesimal genetic model. The environmental deviation was drawn from Subsequently, 10,000 individual offspring were generated. Parents were randomly mated and polygenic additive genetic values in offspring were drawn from where is the respective midparental additive genetic value. The environmental deviation was assigned as for the parental population.
Next, 1000 offspring with trait’ values were inspected, and r was calculated for each of the likelihood scenarios by evaluating the trait value of parents. was then estimated by solving Equation 10. Simulations were repeated for 100 independent stochastic iterations, and means and variances were calculated across all runs, and compared to the true parameter Computer code to perform the simulation written in R language is provided in a public repository (Lstibůrek 2017).
The simulations showed that the normal approximation of the assumed distribution (Equation 6) is justified, since the expected value of estimates were nearly identical to the true parameters across 100 independent stochastic simulations for all assumed scenarios ( from 0 to 1), 0, 0.5, and 1, and across the two likelihoods (Table 2).
Table 2. Results of computer simulations.
SD | SD | P value | SD | SD | P value | ||
0.0 | 0.762 | 0.577 | |||||
0.2 | 0.329 | 0.139 | |||||
0.4 | 0.185 | 0.323 | |||||
0.6 | 0.463 | 0.461 | |||||
0.8 | 0.661 | 0.739 | |||||
1.0 | 0.117 | 0.067 | |||||
0.0 | 0.316 | 0.595 | |||||
0.2 | 0.775 | 0.753 | |||||
0.4 | 0.596 | 0.946 | |||||
0.6 | 0.907 | 0.793 | |||||
0.8 | 0.741 | 0.631 | |||||
1.0 | 0.250 | 0.612 | |||||
0.0 | 0.317 | 0.624 | |||||
0.2 | 0.835 | 0.876 | |||||
0.4 | 0.213 | 0.052 | |||||
0.6 | 0.894 | 0.792 | |||||
0.8 | 0.953 | 0.545 | |||||
1.0 | 0.630 | 0.957 |
and are the number of parents originating from the truncated subset p when tracing one or two sites of the parentage, respectively. α denotes the left truncation, is the input narrow-sense heritability (simulation parameter), and are the corresponding estimates of r and in a given scenario, reported with respective SD (calculated across independent stochastic iterations), and P-value is the respective level of significance in comparing and
Demonstration using human height
We used Galton’s well-known dataset of human heights (Galton 1886). The dataset includes 205 families with adult heights of 934 children, in family sizes range from 1 to 15 children. The dataset is publicly available as R package (Friendly et al. 2017), and we used it for our demonstration. Hanley (2004) analyzed this original dataset, and he was particularly addressing the issue of Galton’s original adjustment for the gender, i.e., all heights of women were multiplied by 1.08. First, we reproduced Hanley’s calculations, and received identical estimates of the regression coefficient b (Table 1 in Hanley 2004). Then, we considered Galton’s original adjustment, where = 0.71 (regression of the heights of children on midparent values), which provides the direct estimate of We used restricted maximum-likelihood approach to verify these findings using the animal genetic model in ASReml (Gilmour et al. 2009), where the random pedigree effect was the single factor in the model.
As expected, we received identical estimate of and SE of 0.04 (identical to the SE of b in our regression analysis). Then, we followed our approach and analyzed offspring in the right site of the distribution (truncation >0.5 of the standardized normal distribution), i.e., approximately one-third of Galton’s dataset. We assumed tracking one side of the parentage, as it leads to the lower SE (compare Figure 2b and Figure 3b). All children values (adjusted dataset by Galton) were sorted. Then we started assessing the tallest child and compared, whether the mother’s height was above the same truncation. We recorded all positive values (as 1s); the sum for one parentage across all truncated children (278) was 122 and for the second parentage 137. We averaged these values and divided the result by 278 and obtained the value of Then, assuming Equation 10, we estimated to be 0.73 with SE 0.07 (the error is actually <0.07 since we averaged two values).
Discussion
The main methodological message of this study is that it is possible to estimate within a panmictic population without a planned experiment, and without performing a complete pedigree reconstruction across the entire phenotypic distribution of parental and offspring populations in order to implement traditional variance component estimation. Our approach provides a way to estimate by genotyping only a small proportion of the parental and offspring populations (e.g., the right-hand tail of the distributions), and estimating the proportion of that subset of offspring that arose from that subset of parents. It is thus an estimate of realized derived from quantifying the relatedness among exact phenotypic subsets of parental and offspring populations.
Much formal quantitative genetics theory follows a simple biological principle: the additive genetic value of an offspring is the midparent additive genetic value plus the Mendelian sampling term. The phenotype is further determined by adding an error term, where each offspring is allocated a random environmental variate from the corresponding normal distribution. This reasoning is the key to understanding the approach presented here: parents are sampled from the population P independently and randomly (with replacement) and they form the corresponding normal distribution Thus, all parental combinations occur with the same probability, i.e., random mating. Variation in family sizes is therefore an integral part of the model and the model is thus experimentally tailored to truly panmictic populations.
The classical methods for estimating using offspring–parent or offspring–midparent regression, and ANOVA-based approaches promote experimental efficiency by artificially manipulating the family structure with respect to the actual family numbers and sizes, so that the SE are minimized. Similarly, we can assess the experimental parameters α, β, and progeny sample size n that minimize SE determined by our method. An extension of the classical methods is to use restricted maximum likelihood approach (Patterson and Thompson 1971) for variance decomposition, both in synthetic and wild populations, provided variance-covariance of phenotypic records is specified by use of a relationship matrix (i.e., complete pedigree linking respective records to the base population), and assumptions of the infinitesimal model hold (Sorensen and Kennedy 1984).
The approach outlined here is different, as it allows to be estimated based on the truncated, rather than a random, subset, thus pedigree relationship is needed only within the truncated subsets of the distribution. Regression based on a truncated subset produces a biased estimate because the entire population sample (and particularly extreme values) are needed to estimate the slope of the regression line. It is possible to modify the regression procedure where most of the effort is applied to families of parents with phenotypes at both ends of the distribution, which would yield a precise estimate (Hill 1990). The increase in efficiency is because parents with phenotypes near the mean provide little information on the slope of the regression. However, we show here that can be estimated based on either phenotypes near the mean, or from phenotypic observations in only one extreme of the distribution. These arbitrary truncations applied to the normal distribution are different from the traditional approaches and could offer experimental scenarios that have not yet been considered in genetic studies. Our approach is based on determining the means and variances of P and populations and the relative frequency of offspring in the truncated proportion in with parent(s) originating from the truncated proportion of P. Provided the panmictic assumptions hold, our approach is therefore experimentally very simple and general.
First, it is clear that the precision of estimate will be much higher for one-sided truncation than for centered two-sided truncation. Comparing Figure 2 and Figure 3, we see that the SE from centered two-sided truncation are on the order of 5–10 those from one-sided truncation. In fact, close inspection of Equations 13 and 15 reveals that the minimum SE is obtained from the one-side truncation scenario. Next, we can assess the optimum α, the standardized truncation point. For (tracking only one side of the parentage), the optimum α will generally be between 1 and 2 SD (Figure 2b). For (tracking both sides of the parentage), the optimum α is between 0 and 1 SD across a range of (Figure 3b).
We now consider how this method could be implemented in practice. The first step is to calculate phenotypic means and variances in both the P and populations. The second step is to genotype a random sample of individuals in the right-truncated tail of the (say in the phenotypic range from SD above the mean up to ) as well as parents belonging to the equivalent phenotypic range in the P population. The pedigree can then be inferred from these genotype data. The coefficient r can then be calculated by evaluating if i.e., tracking one side of the parentage. We can then use Equation 12 to calculate If the true then the SE (Equation 15 and Figure 2b). If α is 2 SD above the mean, the corresponding SE i.e., identical, but the amount of genotyping effort in the parental population is greatly reduced. Note that traditional regression on one parent provides SE, and that on midparent values gives SE (Falconer and Mackay 1996).
We will now explain how (in theory) Galton could have utilized our approach to collect the dataset. First, he would need estimates of the mean and variance of the trait in a population (we assume that records would be available without much investment). Then, he could choose to measure 278 children >0.5 SD above the average and check in the same manner parental values. With unknown parentage, pedigree assembly could be reduced from 934 to 278 children, reducing significantly the cost of sample collection, DNA extraction, and genotyping. Reduced sample size could also be possible with regression analysis, assuming a random sample (of say 278 children) across the entire range of the distribution with higher SE. However, the main methodological message of the current study is that we can estimate based on the tail of the distribution. Regression in such a case would result in a biased estimate (e.g., the same truncated dataset would produce ).
We could speculate that our approach is less sensitive to the presence of outliers as all individuals above the truncation contribute equally to r. In regression analysis, outliers are most influential in either extreme of the distribution with respect to the slope of the regression line. In the ASReml analysis (Gilmour et al. 2009) that we performed on the same dataset, three possible outliers were suggested. When we removed these from the analysis, changed to 0.72 (SE was identical).
In summary, we propose an alternative formulation of Under the assumptions of Hardy-Weinberg equilibrium, the of a quantitative trait in a given population is directly related to the likelihood of a phenotypic subset of parents passing their respective alleles onto the corresponding phenotypic subset of offspring. Future work is needed to compare the efficiency of this method to other existing methods; to estimate nonadditive genetic effects; to assess robustness with respect to the genetic architecture of the trait; and to assess the sensitivity of the approach to assumptions of Hardy-Weinberg equilibrium.
Supplementary Material
Supplemental material is available online at www.genetics.org/lookup/suppl/doi:10.1534/genetics.117.300508/-/DC1.
Acknowledgments
We thank Jan Stejskal for his assistance with the data analysis. M.L. is supported by grant “EXTEMIT - K,” No. CZ.02.1.01/0.0/0.0/15_003/0000433 financed by Operational Programme Research, Development, and Education (OP RDE). The research of J.P. is supported by the Czech Science Foundation, project No. 15-00243S. G.R.H. is supported by Camcore, Department of Forestry and Environmental Resources, North Carolina State University.
Footnotes
Communicating editor: G. Churchill
Literature Cited
- Aykanat T., Johnston S. E., Cotter D., Cross T. F., Poole R., et al. , 2014. Molecular pedigree reconstruction and estimation of evolutionary parameters in a wild Atlantic salmon river system with incomplete sampling: a power analysis. BMC Evol. Biol. 14: 1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Crittenden L. B., 1961. An interpretation of familial aggregation based on multiple genetic and environmental factors. Ann. N. Y. Acad. Sci. 91: 769–780. [DOI] [PubMed] [Google Scholar]
- de los Campos G., Sorensen D., Gianola D., 2015. Genomic heritability: what is it? PLoS Genet. 11: e1005048. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Falconer D. S., 1965. The inheritance of liability to certain diseases, estimated from the incidence among relatives. Ann. Hum. Genet. 29: 51–76. [Google Scholar]
- Falconer D. S., Mackay T. F. C., 1996. The Introduction to Quantitative Genetics, Ed. 4 Longmans Green, Harlow, UK. [Google Scholar]
- Friendly, M., S. Dray, H. Wickham, J. Hanley, D. Murphy et al., 2017 HistData: data sets from the history of statistics and data visualization. R package version 0.8–2, dataset ’GaltonFamilies’. Available at: https://cran.r-project.org/web/packages/HistData/HistData.pdf. Accessed September 25th, 2017.
- Galton F., 1886. Regression towards mediocrity in hereditary stature. J. Anthropol. Inst. G. B. Irel. 15: 246–263. [Google Scholar]
- Gilmour A. R., Gogel B., Cullis B., Thompson R., Butler D., et al. , 2009. ASReml User Guide Release 3.0. VSN International Ltd, Hemel Hempstead, UK. [Google Scholar]
- Hanley J. A., 2004. “Transmuting” women into men: Galton’s family data on human stature. Am. Stat. 58: 237–243. [Google Scholar]
- Hill W. G., 1990. Considerations in the design of animal breeding experiments, in Advances in Statistical Methods for Genetic Improvement of Livestock (Advanced Series in Agricultural Sciences), chapter 4, pp. 59–76, edited by Gianola D., Hammond K. Springer-Verlag, Berlin. [Google Scholar]
- Johnson N. L., Kotz S., Balakrishnan N., 1994. Continuous Univariate Distributions. Wiley Series in Probability and Mathematical Statistics, Ed. 2 John Wiley & Sons, New York. [Google Scholar]
- Lstibůrek M., 2017. Estimating realized h2 in panmictic populations, computer code. Available at: https://github.com/mlstiburek/genetics-heritability.git. Accessed September 25th, 2017.
- Lstibůrek M., Klápště J., Kobliha J., El-Kassaby Y. A., 2012. Breeding without breeding: effect of gene flow on fingerprinting effort. Tree Genet. Genomes 8: 873–877. [Google Scholar]
- Lynch M., Walsh B., 1998. Genetics and Analysis of Quantitative Traits, Vol. 1 Sinauer, Sunderland MA. [Google Scholar]
- Patterson H. D., Thompson R., 1971. Recovery of inter-block information when block sizes are unequal. Biometrika 58: 545–554. [Google Scholar]
- R Core Team , 2013. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna. [Google Scholar]
- Sorensen D., Kennedy B., 1984. Estimation of genetic variances from unselected and selected populations. J. Anim. Sci. 59: 1213–1223. [Google Scholar]
- Yang J., Benyamin B., McEvoy B. P., Gordon S., Henders A. K., et al. , 2010. Common SNPs explain a large proportion of the heritability for human height. Nat. Genet. 42: 565–569. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
The authors state that all data necessary for confirming the conclusions presented in the article are represented fully within the article.