Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2014 Jan 1.
Published in final edited form as: Ann Hum Genet. 2012 Nov 20;77(1):80–84. doi: 10.1111/j.1469-1809.2012.00733.x

Impact on Modes of Inheritance and Relative Risks of using Extreme Sampling When Designing Genetic Association Studies

Gang Zheng 1,*, Jinfeng Xu 2, Ao Yuan 3, Colin O Wu 1
PMCID: PMC3535545  NIHMSID: NIHMS409141  PMID: 23163532

Summary

Using extreme phenotypes for association studies can improve statistical power. We study the impact of using samples with extremely high or low traits on the alternative model space, the genotype relative risks, and the genetic models in association studies. We prove the following results: when the risk allele causes high trait values, the more extreme the high traits, the larger the genotype relative risks, which is not always true for using extreme low traits; we also prove that a genetic model theoretically changes with more extreme trait except for the recessive or dominant models. Practically, however, the impact of deviations from the true genetic model at a functional locus due to selective sampling is virtually negligible. The implications of our findings are discussed. Numerical values are reported for illustrations.

Keywords: Association studies, Extreme sampling, Genetic models, Genotype relative risks, Replication

Introduction

Designs that use extreme high and low traits have been employed in both genetic linkage and association studies (Risch & Zhang, 1995; Forest & Feingold, 2000; Zheng et al., 2006). It has been shown that using extreme traits can improve statistical power for genetic association studies compared to the random sampling of all traits (Slatkin, 1999; Abecasis et al., 2001; Xiong et al., 2002; Chen et al., 2005; Chen & Li, 2011). The use of extreme high and low traits can be applied to design a case-control association study based on a threshold model. One such an example is described in Sims et al. (2008), who studied Wnt pathway genes for bone mass density (BMD) using 170 cases and 174 controls selected with high BMDs (Z-score from 1.51 to 3.97) and low BMDs (Z-score from −1.5 to −3.33), respectively, and demonstrated that using the extreme sampling can robustly detect genes of relevant effect sizes. A comparison between the use of extreme sampling and case-control data was reported by Yang et al. (2010). Extreme sampling has recently been applied to detect genes for rare variants (Guey et al., 2011; Li et al., 2011).

In this paper, we study the impact of extreme sampling on the alternative model space, the genotype relative risks (GRRs), and the modes of inheritance (genetic models). The GRRs are used in designing genetic association studies and genetic models play important roles in testing associations. The additive model, counting the number of the minor alleles in the genotype, is commonly used. Common non-additive models include the recessive and dominant models. When a genetic model is correctly specified, an optimal test is obtained and applied, which is sub-optimal when the model is mis-specified (Freidlin et al., 2002). If the genetic model changes substantially using extreme traits, which is likely because it is defined based on both phenotypes and genotypes, it may a ect the power to detect association even though the samples with extreme traits are used. Moreover, it may also a ect the interpretation of replication results when extreme traits are used in the replication.

Methods

Notation

Assume the marker of interest is in complete linkage disequilibrium (LD) with a disease locus. The alleles of the marker are denoted as b and B. Without loss of generality (WLG), when the marker is associated with a quantitative trait, let b be the risk allele, which causes a higher trait value. Three genotypes are denoted as Gi, where i = 0, 1, 2 counts allele b in the genotype. The frequency of Gi is denoted as gi = pr(Gi) (i = 0, 1, 2). Let X be a quantitative trait, given by X|G = μ + g + e, where μ is the overall mean, g is the genetic value for the genotype G, and e is a non-genetic random error with mean E(e) = 0 and variance var(e) = 1 (WLG). Assume that G and e are independent. The value of g is given by g = −a, d and a (a > 0) for G = G0, G1 and G2, respectively, where d = −a, 0, or a if the genetic model under the random sampling is recessive, additive, or dominant. We do not consider any under-dominant (d < −a) or over-dominant (d > a) models. WLG, let μ = 0. Denote E(X|Gi) = μi and (μ0, μ1, μ2) = (−a, d, a). Denote the conditional distribution of X given Gi as X|Gi ~ F (x - μi). The marginal distribution of X is X~i=02giF(xμi). The null hypothesis of no association is given by H0: a = d = 0, under which μ0 = μ1 = μ2 = 0.

We consider a threshold model with the truncation points (u, v), that is, we only sample individuals with X > v or X < u, where u < v are pre-specified in the design stage. If the distribution of X can be estimated using previous data (Xu et al., 1999), u and v can be chosen as its 100c1th and 100(1 - c2)th percentiles, respectively, where c1 (c2) can be, say, 0.01, 0.10, or 0.20. If an estimation is not available, one may consider using the extreme rank selection, which does not require an underlying distribution of X (Chen et al., 2005; Zheng et al., 2006). Define the study population as XS = {X < u or X > v, u < v}. Only individuals whose traits belong to XS are genotyped. We call the design with XS as extreme sampling. Sampling only high traits is a special case by letting u → −∞. Denote the penetrance for Gi as pi = pr(X > v|Gi, XS) = (1 - Fv,i)/(Fu,i + 1 - Fv,i) (i = 0, 1, 2), where Fx,i = F (x - μi) for x = u, v. The GRRs are given by λj = pj/p0 (j = 1, 2). Under H0, a = d = 0 is equivalent to λ1 = λ2 = 1 or p0 = p1 = p2.

The model space under extreme sampling

Under the alternative hypothesis H1, one is more interested in the three common genetic models: recessive (μ0 = μ1 < μ2), additive (μ0 < μ1 < μ2 and μ1 = (μ0 + μ2)/2) and dominant (μ0 < μ1 = μ2) and any model between the recessive and dominant models. That is, without any under- and over-dominant models, the model space formed by (μ0, μ1, μ2) under the random sampling of all traits can be written as M = {(μ0, μ1, μ2): μ0μ1μ2, μ0 < μ2}, which includes the three common genetic models. Define θ = (μ1 - μ0)/(μ2 - μ0) under H1. Hence, μ1 = (1 -θ)μ0 + μ2 and M can be indexed by θ as M = { θ ∈ [0, 1]: μ1 = (1-)μ0+ μ2}. The model space M under random sampling is constrained because θ only belongs to [0, 1]. Test statistics are more powerful under the constrained model space than under the unconstrained model space: H1: μ0μ1μ2 (Zheng et al., 2009a). Next, we first study if the model space in terms of GRRs (λ1, λ2) or penetrances (p0, p1, p2) would be further constrained under extreme sampling.

Denote the density function of X given Gi as fx,i = f(x-μi) (i = 0, 1, 2) and rx,i = fx,i/(1-Fx,i). Under extreme sampling, conditional on G, we assume (i) rx,i is an increasing function of x, (ii) fx,i is symmetric with respect to μi, i.e., f(x - μi) = f(−(x - μi)), and (iii) fx,i is a decreasing function of x > μi. These assumptions are all satisfied if the trait (after log or power transformation) follows a normal distribution. It can be shown that (iv) fx,i/Fx,i is a decreasing function of x < μi as follows. For x < y < μi, fx,i/Fx,i = f(x - μi)/F (x - μi) = f(μi - x)/{1 - F (μi - x)} > f(μi - y)/{1 - F (μi - y)} = fy,i/Fy,i due to μi - x > μi - y and (i).

Let (μ0, μ1, μ2) belongs to M. Then, under extreme sampling, we have u - μ0u - μ1u - μ2 and v - μ0v - μ1v - μ2. For i > j, 1 - Fv,i ≥ 1 - Fv,j and Fu,i ≤ Fu,j. It follows that Fu,j(1 - Fv,i) ≥ Fu,i(1 - Fv,j), which is equivalent to pi ≥ pj. Thus, if we define M′ = {(p0, p1, p2): p0p1p2 and p0 < p2} = {(λ1, λ2): λ2λ1 ≥ 1 and λ2 > 1}, then for any (μ0, μ1, μ2) ∈ M, (p0, p1, p2) ∈ M′ under extreme sampling, which leads to Result 1.

Result 1. The constraints on the mean traits (μ0, μ1, μ2) under the random sampling hold on the GRRs under extreme sampling. Thus, using extreme sampling does not further reduce the model space compared to the random sampling.

Impact on the GRRs under extreme sampling

Why would using extreme traits improve power of association studies? Result 1 shows it is not due to a smaller model space. Hence, we study how GRRs change under extreme sampling. Write the GRRs as

λi(u,v)={(1Fv,i)(Fu,0+1Fv,0)}{(1Fv,0)(Fu,i+1Fv,i)},fori=1,2.

Taking the partial derivatives of log λi with respect to u and v, we have, for any u < v,

ulogλi=fu,0Fu,0+1Fv,0fu,iFu,i+1Fv,ivlogλi=fv,01Fv,0fv,i1Fv,i+fv,iFu,i+1Fv,ifv,0Fu,0+1Fv,0=(1p0)rv,0(1pi)rv,i.

We prove ∂/∂u log λi < 0 under some conditions and ∂/∂v log λi > 0 for any u < v. For any u < μ0, by property (iv), fu,0/Fu,0 < fu,i/Fu,i. Then limv→∞ ∂/∂u log λi = fu,0/Fu,0 - fu,i/Fu,i < 0. Thus, there exists V > μ2 such that for any v > V and the given u < μ0, ∂/∂u log λi < 0. For the second one, since 1 - p0 > 1 - pi and rv,0 > rv,i, we have ∂/∂v log λi > 0 for any u < v. In the above results, either u or v is fixed. When they both change, let u = u(v) and denote uv=uv[GRH]0. Then,

vlogλi=(1p0)(rv,0+ru,0uv)(1pi)(rv,i+ru,iuv).

In order to have ∂/∂v log λi > 0, we only need uv[MGN]{(1p0)rv,0(1pi)rv,i}{(1p0)ru,0(1pi)ru,i}. That is, u cannot decrease much faster than v increases. This leads to Result 2.

Result 2. When f(x - μ)/{1 - F (x - μ)} is an increasing function of x and f(x - μ) is symmetric about μ, the GRRs monotonically increase either when v increases for a fixed u < v, or when u decreases for a fixed v > u that is large enough, or when u decreases and v increases simultaneously provided that v > u is large enough.

The result shows that when v is not extreme enough, going extreme on u alone (extreme low trait values) does not necessarily increase the GRRs. Practically, it implies that, if one chooses either 100(1 - c2)th upper percentile for extremely high traits or 100c1 th lower percentile for extreme low traits for association studies, one should let c2 be smaller than c1 unless the cost of screening is a concern. On the other hand, using the thresholds X < u and X > v, the probability that an individual will be selected for genotyping decreases as u becomes smaller and/or v becomes larger, and converges to 0 as u → −∞ and v → ∞. Hence, there is a trade-o between the power/sample size and the cost of screening extreme samples.

We plot the values of GRR λ2 due to extreme sampling with d = 0 and u = K−1(c1) and v = K−1(1 - c2), where K(x) = q2F (x + a) + 2pqF (x) + p2F (x - a) and p the frequency of allele b and q = 1 - p. We choose F (x) = Φ(x) as the standard normal distribution. GRR 2 = p2/p0 given a = 0.1, d = 0 and p = 0.3 is presented in Figure 1, where c1(c2) = 0.01 (more extreme), 0.10, 0.20 (moderate extreme), 0.30 and 0.50. Figure 1 shows that the GRR increases with extreme high traits (larger v), but not necessarily for extreme low traits (smaller u), which depends on how extreme v is. The plots for other parameter values including various allele frequencies (e.g., p = 0.15 or 0.45 and other a and d) are similar (the results are not presented here).

Figure 1.

Figure 1

A plot of GRR λ2 given a = 0.1, d = 0, p = 0.3, and u = K−1(c1) and v = K−1(1-c2) for c1(c2) = 0.01 (extreme truncation), 0.10, 0.20, 0.30, and 0.50.

The numerical values are also reported in Table 1 by focusing on the three scenarios: (a) c1 = c2 and both change, (b) c1 is fixed but c2 changes, and (c) c2 is fixed but c1 changes. Numerical values show that, when p = 0.3, a = 0.1 and u = K−1(0.2), λ2 increases monotonically from 1.126 with v = K−1(0.50) to 1.496 with v = K−1(0.80), and up to 2.154 with v = K−1(0.99). Thus, choosing extreme low traits for a given v does not necessarily increases the GRRs.

Table 1.

GRR λ2 under extreme sampling given a = 0.1, d = 0, p = 0.3, and u = K−1(c1) and v = K−1(1 − c2). The smaller c1, the more extreme are the low trait values. The smaller c2, the more extreme are the high trait values. Results are reported for the three different scenarios (a)-(c).

GRR
c 1 c2 = c1 c2 = 0.01 c2 = 0.10 c2 = 0.50
0.01 1.614 1.614 1.072 1.012
0.10 1.387 2.201 1.387 1.082
0.20 1.303 2.154 1.496 1.126
0.30 1.248 2.087 1.527 1.150
0.50 1.167 1.966 1.517 1.167
GRR
c 2 c1 = 0.01 c1 = 0.10 c1 = 0.50

0.01 1.614 2.201 1.966
0.10 1.072 1.387 1.517
0.20 1.034 1.214 1.358
0.30 1.022 1.144 1.268
0.50 1.012 1.082 1.167

Impact on genetic models under extreme sampling

The genetic model under the random sampling is indexed by θ such that μ1 = (1-θ)μ0+θμ2 (i.e., d = −(1-θ)a +θa). Under the recessive, additive, and dominant models, θ = 0, 1/2, 1, respectively. We study how the genetic models change from random sampling to extreme sampling. When θ = 0 (or θ = 1), i.e., d = −a (or d = a), it is straightforward to show p0 = p1 (or p1 = p2). However, in general, when μ1 = (1-θ)μ0 +θμ2 for θ ∈ (0, 1), we do not have p1 = (1-θ)p0 +θp2 with the same genetic model θ. The above arguments are summarized in Result 3.

Result 3. For any u < v, the recessive (θ = 0) or dominant (θ = 1) models under the random sampling will be retained under the extreme sampling. However, the genetic model indexed by θ ∈ (0, 1) under the random sampling will not be retained under the extreme sampling.

The above result shows that the additive model under the random sampling would not be the additive model under the extreme sampling, which depends on how (u, v) is chosen and the shape of the distribution of X. In particular, to retain the additive model under the extreme sampling from the random sampling, u = K−1(c1) and v = K−1(1 - c2) have to satisfy 1 - F (u) = F (v), under which p1 = {1 - F (v - μ1)}/{F (u - μ1) + 1 - F (v - μ1)} = {1 - F (v)}/{F (u) + 1 - F (v)} = 1/2 (as μ1 = 0 under the additive model) and p0 = F (u - a)/{F (u - a) + F (u + a)} and p2 = F (u + a)/{F (u - a) + F (u + a)}. Thus, p0 + p2 = 1, i.e., p1 = (p0 + p2)/2, the additive model under the extreme sampling.

To examine the deviation of the genetic model under the extreme sampling from the genetic model under the random sampling, we define an induced model as θ* = (1-1)/(2-1) = (p1-p0)/(p2-p0) and compare it to θ. The numerical values are reported in Table 2, which shows that θ* is actually quite close to θ, especially when c1 = c2 = c, which corresponds to u = K−1(c) and v = K−1(1 - c). Allele frequency has little impact on the genetic models under the random sampling and extreme sampling. Hence, the power of testing association using extreme traits would be little or not a ected when θ is used as the genetic model under extreme sampling even though θ* is the true model.

Table 2.

The induced genetic model θ* based on the GRRs under extreme sampling given a = 0.1, d = (1 − θ)(−a) + θa, p = 0.15, 0.30 and 0.45, and u = K−1(c1) and v = K−1(1 − c2). The entries are the values of θ* given θ.

θ
p c 1 c 2 0.1 0.3 0.5 0.7 0.9
0.15 0.01 0.01 0.105 0.316 0.524 0.724 0.911
0.30 0.136 0.381 0.593 0.774 0.930
0.50 0.132 0.371 0.582 0.766 0.927
0.30 0.01 0.071 0.230 0.413 0.624 0.816
0.30 0.101 0.305 0.507 0.707 0.903
0.50 0.109 0.321 0.525 0.721 0.909
0.50 0.01 0.074 0.236 0.421 0.631 0.869
0.30 0.093 0.284 0.481 0.685 0.894
0.50 0.100 0.301 0.502 0.702 0.901
0.30 0.01 0.01 0.102 0.308 0.514 0.715 0.908
0.30 0.136 0.380 0.592 0.774 0.930
0.50 0.132 0.371 0.581 0.766 0.927
0.30 0.01 0.071 0.229 0.411 0.623 0.865
0.30 0.100 0.302 0.503 0.704 0.902
0.50 0.108 0.320 0.524 0.720 0.909
0.50 0.01 0.073 0.235 0.420 0.630 0.869
0.30 0.093 0.283 0.480 0.684 0.893
0.50 0.100 0.301 0.501 0.701 0.901
0.45 0.01 0.01 0.098 0.299 0.503 0.707 0.905
0.30 0.136 0.379 0.591 0.773 0.930
0.50 0.132 0.371 0.581 0.765 0.927
0.30 0.01 0.070 0.228 0.410 0.621 0.865
0.30 0.099 0.300 0.501 0.702 0.901
0.50 0.108 0.318 0.522 0.719 0.908
0.50 0.01 0.073 0.235 0.420 0.630 0.869
0.30 0.092 0.282 0.479 0.682 0.893
0.50 0.100 0.300 0.500 0.701 0.900

Discussion

We have shown that, when using extreme high or low traits for association studies, the power is increased because the GRRs are increased and not because the model space (the space for the alternative hypothesis of association) is reduced. Although the genetic models are generally changed from random sampling to extreme sampling except for the recessive or dominant models, the changes are small so that the statistical power for association studies is not expected to be a ected. Therefore, when the true genetic model is known, the same statistic used for analyzing the association under the random sampling should also be used for analyzing the association with extreme sampling. On the other hand, the true genetic model for many complex traits is rarely known. Genetic association studies based on a single genetic model may not be not robust at all. Hence, robust tests (Freidlin et al., 2002; So & Sham, 2011) may be applied to detect association. Our results imply that the same robust tests under the random sampling can also be applied under extreme sampling. For replication studies, as far as the modes of inheritance and the model space are concerned, our results imply that it is valid to use samples with extreme traits, especially the higher traits, to replicate the results obtained based on random samples or less extreme traits. Finally, a limitation of our results is that they are derived when the marker is in complete LD with a functional locus. In practice, however, markers are likely in LD with the functional loci. In this case, the genetic models at the markers of interest are more complicated and may not be the same as those of the functional loci (Zheng et al., 2009b).

Acknowledgments

The work of J Xu was partially supported by Research Grant No. R-155-000-112-112 of National University of Singapore. The work of A Yuan was supported in part by the National Center for Research Resources at NIH grant 2G12RR003048. The authors do not have conflict of interest.

References

  1. Abecasis GR, Cookson WO, Cardon LR. The power to detect linkage disequilibrium with quantitative traits in selected samples. Am J Hum Genet. 2001;68:1463–1474. doi: 10.1086/320590. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Chen HY, Li M. Improving power and robustness for detecting genetic association with extreme-value sampling design. Genet Epidemiol. 2011;35:823–830. doi: 10.1002/gepi.20631. [DOI] [PubMed] [Google Scholar]
  3. Chen Z, Zheng G, Ghosh K, Li Z. Linkage disequilibrium mapping of quantitative-trait loci by selective genotyping. Am J Hum Genet. 2005;77:661–669. doi: 10.1086/491658. [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Forest WF, Feingold E. Composite statistics for QTL mapping with moderately discordant sibling pairs. Am J Hum Genet. 2000;66:1642–1660. doi: 10.1086/302897. [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Freidlin B, Zheng G, Li Z, Gastwirth JL. Trend tests for case-control studies of genetic markers: power, sample size and robustness. Hum Hered. 2002;53:146–152. doi: 10.1159/000064976. [DOI] [PubMed] [Google Scholar]
  6. Guey LT, Kravic J, Melander O, Burtt NP, Laramie JM, Lyssenko V, Jonsson A, Lindholm E, Tuomi T, Isomaa B, Nilsson P, Almgren P, Kathiresan S, Groop L, Seymour AB, Altshuler D, Voight BF. Power in the phenotype extremes: a simulation study of power in discovery and replication of rare variants. Genet Epidemiol. 2011;35:236–246. doi: 10.1002/gepi.20572. [DOI] [PubMed] [Google Scholar]
  7. Li D, Lewinger JP, Gauderman WJ, Elizabeth CE, Conti D. Using extreme phenotype sampling to identify the rare causal variants of quantitative traits in association studies. Genet Epidemiol. 2011;35:790–799. doi: 10.1002/gepi.20628. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Risch N, Zhang H. Extreme discordant sib pairs for mapping quantitative trait loci in humans. Science. 1995;268:1584–1589. doi: 10.1126/science.7777857. [DOI] [PubMed] [Google Scholar]
  9. Sims AM, Shephard N, Carter K, Doan T, Dowling A, Duncan EL, Eisman J, Jones G, Nicholson G, Prince R, Seeman E, Thomas G, Wass JA, Brown MA. Genetic analyses in a sample of individuals with high or low BMD shows association with multiple Wnt pathway genes. J Bone Mineral Res. 2008;23:499–506. doi: 10.1359/jbmr.071113. [DOI] [PubMed] [Google Scholar]
  10. Slatkin M. Disequilibrium mapping of a quantitative trait locus in an expanding population. Am J Hum Genet. 1999;64:1765–1773. doi: 10.1086/302413. [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. So H-C, Sham PC. Robust association tests under different genetic models, allowing for binary or quantitative traits and covariates. Behav Genet. 2011;41:768–775. doi: 10.1007/s10519-011-9450-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Xiong M, Fan RZ, Jin L. Linkage disequilibrium mapping of quantitative trait loci under truncation selection. Hum Hered. 2002;53:158–172. doi: 10.1159/000064978. [DOI] [PubMed] [Google Scholar]
  13. Xu X, Rogus JJ, Terwedow HA, Yang J, Wang Z, Chen C, Niu T, Wang B, Xu H, Weiss S, Schork NJ, Fang Z. An extreme-sib-pair genome scan for genes regulating blood pressure. Am J Hum Genet. 1999;64:1694–1701. doi: 10.1086/302405. [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Yang J, Wray NR, Visscher PM. Comparing apples and oranges: equating the power of case-control and quantitative trait association studies. Genet Epidemiol. 2010;34:254–257. doi: 10.1002/gepi.20456. [DOI] [PubMed] [Google Scholar]
  15. Zheng G, Ghosh K, Chen Z, Li Z. Extreme rank selection for linkage analysis of quantitative trait loci using selected sib-pairs. Ann Hum Genet. 2006;70:857–866. doi: 10.1111/j.1469-1809.2006.00268.x. [DOI] [PubMed] [Google Scholar]
  16. Zheng G, Joo J, Yang Y. Pearson’s test, trend test, and MAX are all trend tests with different types of scores. Ann Hum Genet. 2009a;73:133–140. doi: 10.1111/j.1469-1809.2008.00500.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. Zheng G, Joo J, Zaykin D, Wu CO, Geller NL. Robust tests in genome-wide scans under incomplete linkage disequilibrium. Stat Sci. 2009b;24:503–516. doi: 10.1214/09-sts314. [DOI] [PMC free article] [PubMed] [Google Scholar]

RESOURCES