Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2025 Apr 9.
Published in final edited form as: Stat Appl Genet Mol Biol. 2007 Apr 17;6:Article12. doi: 10.2202/1544-6115.1264

Sequential Quantitative Trait Locus Mapping in Experimental Crosses

Jaya M Satagopan 1, Saunak Sen 2, Gary A Churchill 3
PMCID: PMC11980321  NIHMSID: NIHMS2071553  PMID: 17474878

Abstract

The etiology of complex diseases is heterogeneous. The presence of risk alleles in one or more genetic loci affects the function of a variety of intermediate biological pathways, resulting in the overt expression of disease. Hence, there is an increasing focus on identifying the genetic basis of disease by systematically studying phenotypic traits pertaining to the underlying biological functions. In this paper we focus on identifying genetic loci linked to quantitative phenotypic traits in experimental crosses. Such genetic mapping methods often use a one stage design by genotyping all the markers of interest on the available subjects. A genome scan based on single locus or multi-locus models is used to identify the putative loci. Since the number of quantitative trait loci (QTLs) is very likely to be small relative to the number of markers genotyped, a one-stage selective genotyping approach is commonly used to reduce the genotyping burden, whereby markers are genotyped solely on individuals with extreme trait values. This approach is powerful in the presence of a single quantitative trait locus (QTL) but may result in substantial loss of information in the presence of multiple QTLs. Here we investigate the efficiency of sequential two stage designs to identify QTLs in experimental populations. Our investigations for backcross and F2 crosses suggest that genotyping all the markers on 60% of the subjects in Stage 1 and genotyping the chromosomes significant at 20% level using additional subjects in Stage 2 and testing using all the subjects provides an efficient approach to identify the QTLs and utilizes only 70% of the genotyping burden relative to a one stage design, regardless of the heritability and genotyping density. Complex traits are a consequence of multiple QTLs conferring main effects as well as epistatic interactions. We propose a two-stage analytic approach where a single-locus genome scan is conducted in Stage 1 to identify promising chromosomes, and interactions are examined using the loci on these chromosomes in Stage 2. We examine settings under which the two-stage analytic approach provides sufficient power to detect the putative QTLs.

Keywords: two-stage design, Fisher information, backcross, f2 cross

Introduction

Complex diseases have a heterogeneous etiology. Risk alleles present in a genetic locus or their simultaneous presence in multiple loci may affect a variety of intermediate biological functions. Changes in any one or more intermediate functions results in the overt expression of disease. Modern genetic studies are, therefore, increasingly focusing on systematically evaluating specific disease pathways in an effort to identify disease risk factors in an efficient manner (Thomas 2005). Quantitative (or qualitative) measurements corresponding to a specific phenotype underlying the disease are obtained on all the study subjects. The genetic loci linked to the phenotypic traits are identified to determine those conferring disease risk through specific biological mechanisms. Suppose the disease of interest is heart disease. Some examples of intermediate phenotypic traits are cholesterol level, blood pressure, body mass index, and urinary free cortisol. A heart disease patient is very likely to have abnormal levels of at least one of these phenotypic traits. Genetic loci linked to these traits may provide insights into biological mechanisms underlying the etiology of heart disease. Due to recent developments in biotechnology, it is now becoming increasingly feasible to simultaneously examine a large number of phenotypes at the molecular level (for example, gene expressions). In this paper we consider identifying genetic loci related to a single quantitative phenotypic trait in experimental crosses. We focus on sequential methods to identify the trait loci (or, equivalently, disease loci) in an efficient manner.

Quantiative trait studies in experimental crosses proceed by first obtaining a desired cross (for example, backcross or F2) using two parental strains. Genotypes at several loci and the phenotypic trait are measured on multiple progeny from this cross. The quantitative trait loci (QTLs) linked to the trait are identified using a relevant analytic approach such as interval mapping (Lander and Botstein 1989). A one-stage genotyping approach is often utilized, whereby an ensemble of loci (markers) is genotyped on all the n available subjects (progeny). The genome is scanned for the presence of QTLs by fitting a single QTL model at various loci. The genotyping cost (or genotyping burden) of this strategy is proportional to genotyping all the markers on the n subjects. Often the number of QTLs may be small relative to the number of markers genotyped. Hence, it may be pragmatic to consider a strategy that would require fewer genotyping to identify the QTLs with adequate power.

Selective genotyping is a one-stage approach commonly used to minimize the genotyping burden (Lander and Botstein 1989; Darvasi and Soller 1992, 1994). This is an outcome-based sampling approach where all the markers are genotyped on subjects with extreme trait values alone. When a single QTL is associated with the phenotypic trait, genotyping only a quarter of individuals from each extreme (i.e., one half of the n subjects) provides most of the linkage information compared to genotyping all the n subjects. However, selective genotyping may not be an efficient strategy when the variation in the trait is explained by multiple QTLs, at least one of which has a large effect (Sen et al. 2005). In such cases, the fraction of missing information can be as high as 50% if only one-half of the subjects having extreme trait values are genotyped. It is, therefore, important to devise a genotyping approach, whereby all the subjects (i.e., those with extreme as well as intermediate trait values) are genotyped, but without having to genotype every marker on every progeny, unless warranted otherwise.

Here we propose a two-stage sequential genotyping approach that requires fewer genotyping than a one-stage approach. Under the proposed method, several markers not related to the trait can be eliminated early on in the study by first evaluating all the markers on a subset of the available progeny. Only those markers showing promising evidence for association with the trait are further genotyped on the remaining progeny to identify the QTLs. Such cost-efficient two stage genotyping designs have been proposed to identify disease loci in human linkage and association studies (Elston 1994; Elston et al. 1996, 2007; Satagopan et al. 2002, 2004; Satagopan and Elston 2003; Wang et al. 2006). In this paper we develop this method for QTL mapping in experimental crosses. Any sequential approach can result in loss of power when the total sample size is fixed since a marker linked to the trait may be incorrectly eliminated in the first stage. This issue can be addressed by genotyping an appropriate subset of individuals in the first stage and by setting the corresponding significance level to identify the promising markers for further evaluation in the second stage. We show that the expected Fisher information of the two-stage design relative to a one-stage design has a simple form that can be used to identify the sample size and significance level for the first stage such that the loss of power and the genotyping burden relative to a one-stage design are minimized. Epistasis or interactions between multiple QTLs form an important characteristic of complex phenotypic traits. Conducting a genome scan using single QTL models may not be a powerful approach to identify all the QTLs (Broman and Speed 1999). The genome must be scanned using multi-locus models to simultaneously identify relevant QTLs, which can quickly become an arduous task. Two-stage analysis may be useful for circumventing this issue (Marchini et al. 2005). Here we examine a two-stage analytic approach where a genome scan using single locus models is conducted in the first stage to identify promising chromosomal regions. Interactions between loci in these regions alone are evaluated in the second stage using multi-locus models.

The goals of this paper are to investigate the power trade-offs between one-stage versus two-stage methods and to evaluate the optimal strategies. In the next section we describe the characteristics of a one stage design by providing the overall significance level, power, and the genotyping burden of this approach. Next we develop the two stage design, derive simple equations to obtain the optimal design parameters, and evaluate the operating characteristics. These methods are first developed for a backcross, and subsequently described for an F2 intercross, assuming a single locus model. This is followed by an evaluation of two stage designs for a backcross when two unlinked QTLs are associated with the trait. Finally, we examine a two-stage analytic approach to identify multiple QTLs conferring main effects and epistatic effects on the trait.

One Stage Design

Consider a genome of interest with C chromosomes, and Kc markers on chromosome c(1cC). The total number of markers is K=c=1CKc. Let L denote the length of the genome, and Δ represent the average genotyping density (i.e., distance between the markers), both in centiMorgan units. Under a one stage design all the K markers are genotyped on all the n available subjects. We first outline the general concept in the context of a backcross population using a single locus model, assuming that a single QTL is associated with the trait.

Let gi and yi denote the genotype at the QTL and the phenotypic trait, respectively, in a backcross subject i(1=1,,n). Without loss of generality, the genotype at a locus is coded 0 or 1, each having marginal probability 1/2. The single locus model is given by:

yi=δ2gi-1+ϵi, (1)

where δ is the effect of the QTL. Further, ϵi~N0,σ2 is the random error, where σ2 is the error variance. We shall assume σ2=1. While σ2 is estimated in a practical data analysis setting, the assumption of unit error variance during study design is simply a matter of scaling the QTL effect. Hence, δ/σ is interpreted as the standardized QTL effect or the effect size. In a design setting, the investigator strives to determine the power or sample size to detect a desired effect δ under some assumed σ2. This is equivalent to designing a study to detect a desired effect size δ/σ. Hence, without loss of generality, we consider σ2=1 throughout this paper.

The phenotypic trait is marginally distributed as Gaussian with mean 0 and variance τ2=1+δ2. We assume that the parameter estimates and test statistics are calculated based on the interval mapping approach (Lander and Botstein 1989). In practice we observe the marker genotypes but not the QTL genotypes. Suppose we investigate evidence for the presence of a QTL at a locus flanked by two markers such that the recombination between the locus and the left flanking marker is r1. Let r be the recombination fraction between the flanking markers. Both r and r1 can be calculated as a function of genetic distance using the Haldane mapping function (Ott 1991). Denote ϕ(.) as the probability density of a standard normal distribution. Suppose we are testing for the presence of a QTL at a particular genetic locus. Let qi denote the conditional probability that the QTL genotype at that locus is 1, given the flanking marker genotypes. The values of r1, r and, hence, qi can be easily obtained and treated as known during the analysis of that locus. The log-likelihood corresponding to model (1), evaluated at the locus under investigation, is given by:

lδ;r1=i=1nlogqiϕyi-δ+1-qiϕyi+δ. (2)

The score statistic, denoted Z, for testing the null hypothesis of no QTL (δ=0) is given by:

Z=lδ=0;r1Iδ=0;r1,n, (3)

where lδ=0;r1 is the derivative of the log-likelihood with respect to δ, evaluated at δ=0. Iδ=0;r1,n is the Fisher’s information corresponding to δ, also evaluated at δ=0 using a sample of n individuals. Following the general theory of likelihoods and score statistics (Cox and Hinkley 1974), Z is distributed as NδIδ=0;r1,n,1. [The mean of Z is outlined in the Appendix.] Hence, Z2 has a χ2 distribution with 1 degree of freedom and non-centrality parameter

λδ,n=Iδ=0;r1,nδ2. (4)

Clearly, the non-centrality parameter is 0 under the null hypothesis. The information depends upon the distance between the flanking markers and the location of the QTL within the marker interval, and information is the least when the QTL is located in the middle of the marker interval (Sen et al. 2005). Therefore, thoughout this paper we consider designing a QTL study under the assumption that the QTL is located in the center of the marker interval. The Fisher’s information is the expected value of the negative second derivative of equation (2) with respect to δ, given by Iδ=0;r1,n=nQr, where Qr=[1-4q(1-q)](1-r) and q(1-q)=r121-r12/r12+1-r122 [see Sen et al. 2005]. Here r and r1 are the recombinations corresponding to Δ and Δ/2 centiMorgan distances, respectively. Below we describe the overall significance level, power, and the associated genotyping burden.

Overall Significance Level:

We calculate test statistics Z2(t) at every locus t on the genome. The overall significance level α is the probability that the genome-wide maximum test statistic, maxtZ2(t), exceeds a threshold b2(α) under the null hypothesis. Denote P0(.) as the probability of an event under the null hypothesis of no QTL. The significance level can be approximated as (Feingold et al. 1993; Dupuis and Siegmund 1999):

α=P0maxtZ2(t)>b2(α)=1-exp-C1-𝒳12b2-0.04Lbϕbvb0.04Δ, (5)

where 𝒳s2(x) is the cumulative probability of a χ2 distribution with s degrees of freedom corresponding to critical value x, and v(u)=exp{-0.583u}. We have denoted b2(α) simply as b2 in second step of the above equation. Although this approximation was derived based on the likelihood ratio statistic, it is applicable to the current setting where Z(t) is the score statistic due to the asymptotic equivalence between Z2(t) and the likelihood ratio statistic evaluated at locus t (Cox and Hinkley 1974).

Power:

Suppose our goal is to identify the QTL having effect δ on the trait with power 1-β using a single locus model (equation 1). Consider the following two events. Event 1 corresponds to maxtZ2(t) exceeding b2 in the presence of a QTL, and Event 2 corresponds to test statistic Z2 at the QTL position exceeding b2. Let PA(.) denote the probability of an event under the alternative hypothesis. Clearly, PA(Event2)PA(Event1). Our working definition of power is the probability of Event 2 under the alternative hypothesis. Therefore, the power 1-β can be written as

1-β=PAZ2>b2(α)1-Φbα-λδ,n, (6)

where Φ(.) is the cumulative probability of a standard normal distribution.

The sample size n, significance level α and power 1-β are related through the following equation:

n=1δ2×[1-4q(1-q)]×(1-r)×b(α)-Φ-1(β)2. (7)

Genotyping Burden:

Under a one stage design a total of K markers are genotyped on n subjects. Therefore, the genotyping burden or the amount of genotyping is:

T1=n×K. (8)

Two Stage Sequential Genotyping Design

First obtain the sample size n to detect a desired QTL effect with power 1-β and significance level α using equation (7) as if we were conducting a one-stage design. This will be our given sample size. Obtain the trait values of these n subjects. Our goal is to detect the QTL by genotyping these n subjects in an efficient manner without having to genotype all the markers on all the subjects. The two stage sequential genotyping design proceeds as follows. Given the trait values of these n subjects, genotype all the desired markers on random subset of nθ10<θ1<1 individuals in Stage 1. Identify chromosomes significant at level α1. The markers on these chromosomes are genotyped on the remaining n1-θ1 subjects in Stage 2 and tested using all the n individuals at level α2. The sampling fraction θ1 and the significance levels α1 and α2 are unknown, and form the two stage design parameters. This two stage approach involves fewer genotyping than a one stage design. Hence, if the cost of genotyping were linearly related to the amount of genotyping, this design would provide a cost-effective genotyping approach. However, any sequential approach would result in loss of power relative to a one stage approach. For an appropriate choice of θ1, α1, and α2, it may be feasible to design a study involving substantially fewer genotyping but minimum loss of power, while ensuring an overall significance level of α. Below we investigate the operating characteristics of two stage designs to identify such θ1, α1 and α2. Our investigations focus on the following two questions: (1) Given the phenotypic traits of n individuals, what subgroup size should be used for genotyping in Stage 1? (2) How should the chromosomes be prioritized for further evaluation in the next stage so that the putative QTLs can be identified efficiently with minimum genotyping burden?

Overall Significance Level of the Two Stage Design

The overall significance level α is fixed at the outset, and is the probability of finding a false positive QTL i.e., the probability that the genome-wide maximum test statistic exceeds a significance threshold at the end of Stage 2 under the null hypothesis. Since the C chromosomes are unlinked, we can apply the Bonferroni correction and test each chromosome at level α/C. The significance level α1 for testing a single chromosome in Stage 1 represents the probability that the maximum test statistic on a chromosome exceeds a critical value b12 under the null hypothesis. The significance level in Stage 2, denoted α2, is the conditional probability under the null hypothesis that the maximum test statistic on chromosome c exceeds a critical value b22 at the end of Stage 2, given that the chromosome was declared significant in Stage 1. The following results are derived in the Appendix.

Result 1 For any chromosome, the significance level for testing in Stage 1 is greater than the overall significance level: α1>α/C.

Result 2 The Stage 2 significance level is given by: α2=α/C×α1.

Result 3 When the critical value for Stage 2 is equal to the critical value of a one stage design, i.e., when b22=b2, the overall significance level of the two-stage design is at most α.

Result 1 is consistent with intuition that the study may not have sufficient power to detect a QTL at significance level α/C when genotyping a fraction of nθ1(<n) subjects. Hence, the significance testing must be conducted at level α1>α/C at Stage 1 in order to have sufficient power to identify the chromosomes containing QTLs. Result 2 suggests that α2 is determined once α1 is known. Consequently, the only unknown parameters defining a two stage design are θ1 (the sampling fraction for Stage 1), and α1 (the per-chromosome significance level in Stage 1). Result 3 suggests that it is not necessary to derive a new critical value for testing the chromosomes in Stage 2 corresponding to significance level α2. After the completion of Stage 1, the chromosomes genotyped in Stage 2 can be evaluated using the same critical value as that under one stage design in order to obtain a per-chromosome overall significance level of at most α/C.

Power of the Two Stage Design

The power P* of the proposed two stage design is the probability that the test statistic at the QTL exceeds b2 at the end of Stage 2. Let Z1 and Z2 denote the score statistics calculated at the QTL position in Stages 1 and 2 using nθ1 and n individuals, respectively. Further, Z1~Nnθ1Qrδ,1 and Z2~NnQrδ,1. These score statistics are equivalent to the test statistics S1=nθ1QrZ1 and S2=nQrZ2.

In a one-stage design we would observe the test statistic S2 (equivalently, Z2) at the QTL position. Under a two-stage design, the QTL region will be evaluated in Stage 2 only if the corresponding chromosome is selected in Stage 1. The test statistic at the QTL position obtained in Stage 2 is S2. If the chromosome is not selected in Stage 1, we will only calculate the test statistic S1. A chromosome is evaluated in Stage 2 if the maximum score statistic (or the maximum chi-squared statistic) on this chromosome exceeds b1 (or b12) in Stage 1. The power of Stage 1 is 1-β1. Therefore, the chromosome harboring the QTL will be evaluated in Stage 2 with probability (at least) 1-β1. With probability β1, a test statistic at the QTL will be evaluated in Stage 1 but not in Stage 2. Therefore, the test statistic S at the QTL position obtained at the end of the two-stage procedure is either S1 or S2, depending upon whether the QTL position has been evaluated in Stage 1 alone or in Stage 2 as well. Therefore, S is a mixture over S1 and S2, given by test statistic at the QTL position obtained under the two-stage procedure is S=S1maxtZt<b1nQrθ1+S2maxtZt>b1nQr, where (.) is an indicator function. The power of the two-stage procedure is P*=PAS>bnQr, and requires evaluating a double integral.

The expected value of S is E(S)=β1ES1+1-β1ES2, which clearly depends upon 1-β1 and θ1. The non-centrality parameter of S, denoted λ2(δ,n), is the difference between the expected values of S evaluated under the alternative and null hypotheses. Since the expected value of S under the null hypothesis is 0, the non-centrality parameter is E(S)=λ2(δ,n), written as:

λ2δ,n=β1λδ,nθ1δ+1-β1λδ,nδ, (9)

where λ(.) is the non-centrality parameter given by equation (4). Note that λδ,nθ1/δ and λ(δ,n)/δ are the expected values of S1 and S2, respectively. Since the n individuals are independent, the test statistic S2 can be written as the sum of independent contributions from nθ1 individuals genotyped in Stage 1 and n1-θ1 individuals newly genotyped in Stage 2. Hence, S2=S1+X, where S1 and X are independent. Further, X has a normal distribution with mean n(1-θ)Qrδ and variance n(1-θ)Qr. The variance of S is Var(S)=nθ1Qr+n1-θ1Qr1-β12. Suppose we fix θ1. Intuitively 1-β11 as α11. The limiting case as α11 will be a one-stage design since increasing number of chromosomes will be genotyped in Stage 2. For a fixed θ1, E(S)ES2 and Var(S)VS2 as 1-β11. These imply weak convergence of S to S2 when 1-β11 for a given θ1. Therefore, identifying a two-stage design with non-centrality parameter close to that of a one-stage design i.e., λ2(δ,n) close to λ(δ,n) would provide a design having power close to that of a one-stage design.

The relative non-centrality parameter or, equivalently, the relative information is defined as the ratio of the non-centrality parameters of the test statistics at the QTL locus under the two-stage and one-stage designs:

λ2(δ,n){λ(δ,n)/δ}=β1θ1+1-β1=1-β11-θ1. (10)

The right hand side does not involve δ. Since β11 and θ11, the relative information is ≤ 1. Equivalently, P*<1-β, where 1-β=PAS2/n>b is the power of a one-stage design. Were we to identify θ1 and α1 (and, hence, 1-β1) such that β11-θ1 is small, then the relative information will be close to 1.

The Stage 1 sampling fraction, θ1, can be written using equation (7) as:

θ1=b1-Φ-1β12b-Φ-1(β)2. (11)

The overall significance level α (and, hence, b2), and the overall power 1-β are specified at the outset, and b1 depends upon α1. Therefore, θ1 depends upon α1 and 1-β1. Further, given 1-β1, θ1 does not depend upon the QTL effect. Finally, the relative information (equation 10) depends upon θ1 and β1.

Genotyping Burden

The total number of genotyping in a two stage design, denoted T2, comprises of genotyping all the K=c=1CKc markers in nθ1 subjects in Stage 1, and genotyping the markers from the promising chromosomes on the remaining subjects in Stage 2. Note that every null chromosome has probability α1 of being evaluated in Stage 2. Further, every chromosome containing a QTL will be evaluated in Stage 2 with probability at least 1-β1. Let D denote the total number of chromosomes carrying a QTL. When there is a single QTL, D=1. Therefore, T2 is given by:

T2=nθ1c=1CKc+n1-θ11-β1c=1DKc+α1j=1C-DKj. (12)

The relative genotyping burden, T2/T1, is the ratio of the genotyping burdens of the two stage and one stage designs:

T2T1=θ1+1-θ11-β1c=1DKcj=1CKj+α1j=1C-DKjj=1CKj, (13)

with θ1 given by equation (11). Therefore, T2/T1 is independent of the model parameters for given α1 and 1-β1. Clearly T2T1 i.e., fewer genotyping is undertaken in a two stage design relative to a one stage design.

Optimal Two Stage Design

The two stage design involves fewer genotyping than a one stage design, but incurs a loss of power. Therefore, our goal is to identify a two stage design so that the loss of power, 1-β-P*, is minimized. Equivalently, we shall identify a two-stage design such that the relative information (equation 10) is close to 1. In doing so, we must ensure that the overall significance level is maintained at a desired level α. The number of chromosomes C, the genome length L, the desired effect δ, the average genotyping density Δ, the desired overall significance level α, and power 1-β are specified at the design stage. First obtain the sample size n to detect some desired QTL effect δ using equation (7) as if we were conducting a one-stage design. This will be our fixed sample size. The phenotypic traits will be measured on all these n subjects. Therefore, given the trait values of these n subjects, our goal is to identify α1 and 1-β1 (and, hence, θ1) so that (i) the overall significance level is α, (ii) 1-λ2(δ,n)/λ(δ,n)<ϵ for some small ϵ, and (iii) T2/T1 is minimum. Here ϵ is a user-specified quanitity indicating the acceptable amount of relative information loss. For example, setting ϵ=0.05 would indicate that the desired relative information is at least 95%. Following Results 1, 2, and 3, the overall significance level will be maintained at level α by choosing α1(α/C,1) and setting the Stage 2 critical value equal to that of a one stage design. The optimal parameters can be obtained through the following steps.

  1. Fix a relative genotyping burden T2/T1.

  2. For a value of α1(α/C,1), obtain the critical value b1 by setting C=1 and using L/C in place of L in equation (5).

  3. Obtain 1-β1 from equation (13) using the Newton-Raphson method. In order to perform this calculation, θ1 given by equation (11) must be substituted into equation (13).

  4. Obtain the sampling fraction θ1 from equation (11), and the relative information using equation (10).

  5. Repeat the above steps for various values of α1(α/C,1), and find α1 minimizing 1-λ2(δ,n)/λ(δ,n).

  6. Repeat this procedure for various values of T2/T1, and identify the smallest T2/T1 for which 1-λ2(δ,n)/λ(δ,n)<ϵ.

Since equations (10) and (13) are independent of the QTL effect δ and equation (11) is independent of δ once β1 is given, the optimal solution is also indepdent of δ, once n is given.

We assess the operating characteristics of the two stage design under various parametric configurations to identify the optimal design. The overall significance level and power of a one stage design are α=0.05, and 1-β=0.80, unless indicated otherwise. Figure 1 illustrates the optimal two stage design as a function of the relative genotyping burden for a hypothetical genome with C=20 chromosomes, each of length 100 centiMorgans, and marker density Δ = 1 centiMorgan. It is evident that the relative information increases as the relative genotyping burden increases. The relative information is ≥ 95% (ϵ0.05) when the relative genotyping burden T2/T170%. Equivalently, a two stage design that utilizes only 70% genotyping burden results in less than 5% loss of information relative to a one stage design. Likewise, less than 10% information is lost by conducting a two stage design with 60% relative genotyping burden. This result holds regardless of the value of Δ (figures for other values of Δ are similar and, hence, not shown).

Figure 1:

Figure 1:

Optimal relative genotyping burden of the two stage approach for a backcross genome described in the text. The horizontal axis is the relative genotyping burden (equation 13). The vertical axis is the relative information (equation 10), equivalently the relative information. The horizontal lines indicate 90% (ϵ=0.10) and 95% (ϵ=0.05) relative informations.

Table 1 provides the optimal two stage design parameters under various choices of Δ, and T2/T1=0.60 and 0.70. When T2/T1=70%, the optimal parameters are approximately α1=0.21, 1-β1=0.90, and θ1=0.60, and the maximum relative information is 0.96 (i.e., ϵ<0.05), regardless of the value of Δ. When the relative genotyping burden is 60%, the optimal two stage design parameters are approximately α1=0.16, 1-β1=0.80, and θ1=0.50, and the maximum relative information is 0.90 (i.e., ϵ0.10) for all choices of Δ.

Table 1:

Optimal two stage design parameters for a backcross under a single locus model with 60% and 70% relative genotyping burden. The columns provide the average marker distance (Δ), relative genotyping burden (RGB), Stage 1 significance level α1, Stage 1 power 1-β1, sampling fraction θ1, and the relative information (RI).

Δ RGB α1 1-β1 θ1 RI
20 0.60 0.16 0.80 0.50 0.903
0.70 0.21 0.90 0.60 0.961
10 0.60 0.16 0.80 0.50 0.903
0.70 0.21 0.90 0.60 0.961
5 0.60 0.16 0.80 0.50 0.903
0.70 0.21 0.90 0.60 0.961
1 0.60 0.15 0.80 0.51 0.902
0.70 0.21 0.90 0.60 0.961
0.10 0.60 0.17 0.80 0.50 0.900
0.70 0.21 0.90 0.60 0.960

Figure 2 illustrates the behavior of the two stage design parameters when the relative genotyping burden is fixed at 70%. Choosing α1 between 10% and 35% provides a relative information of at least 95%. The quadratic pattern for the relative information and 1-β1 can be explained as follows. Small values for α1(<10%) imply that few chromosomes will be declared significant at the end of Stage 1 for further evaluation in Stage 2. While this permits us to genotype more individuals in Stage 1 (θ1 between 65% and 70%), small α1 also implies small Stage 1 power 1-β1. Consequently, the relative information is small. When α1 is large (for example, α1>50%), this implies that more chromosomes will be declared significant in Stage 1 and, hence, genotyped in Stage 2. Therefore, when the genotyping burden is fixed, the fraction of individuals to be genotyped in Stage 1 is reduced, thus compromising the Stage 1 power to detect the chromosomes harboring the QTL. This, in turn, results in small relative information. Again, this result holds regardless of the value of Δ.

Figure 2:

Figure 2:

Characteristics of the two stage design with 70% relative genotyping burden. The horizontal axis is the Stage 1 significance level, α1. The vertical axis takes value between 0 and 1, and corresponds to the the relative information (bold line), the Stage 1 power 1-β1 (dotted line), and the sampling fraction θ1 for Stage 1 (dashed line).

General Guideline:

These results indicate that, as a general guideline, genotyping all the markers on θ1=60% of the individuals in Stage 1, and genotyping the markers on chromosomes significant at α1=20% level using the remaining individuals in Stage 2 results in minimal loss of information (ϵ between 0.05 and 0.10) and requires nearly 30% fewer genotyping than a one stage design. In particular, first obtain sample size n as if a one-stage design has been planned. Treating this as the fixed sample size, now conduct the study using a two-stage design with the above guidelines. Similar guidelines were obtained for association studies in humans (Satagopan et al. 2004). This general guideline is applicable to genomes of any size, and can be used to identify a QTL via single locus models.

Before proceeding further, it will be useful to understand why Δ does not impact the optimal parameters. The critical value b and the sample size n, obtained at the outset using equation (7) to detect a desired QTL effect δ, will depend upon Δ. Suppose the desired QTL effect is δ=0.20, and the overall power and significance level are 1-β=0.80 and α=0.05, respectively. The critical values under Δ = 20 and 1 are b=3.90 and 3.95, respectively, yielding corresponding sample sizes of n=700 and 580. Suppose we desire a relative genotyping burden of 0.60 under a two-stage design. When θ1=0.50, the sample sizes are 350 and 290, respectively. When α1=0.16, the critical values to test a single chromosome with Δ = 20 and 1 are b1=2.51 and 2.56, respectively. The power of Stage 1 is 1-β1=0.80, relative information is approximately 0.90, and the relative genotyping burden is 0.60 under both the values of Δ. Hence, we have the same Stage 1 power and relative information under both the values of Δ precisely because the underlying sample sizes are different. However, the choice of optimal sampling fraction and significance level for Stage 1 do not depend upon the QTL effect and Δ once n is obtained corresponding to a given Δ and is treated as the fixed available sample size.

F2 Crosses

We now investigate optimal two stage genotyping for F2 crosses having one of three possible genotypes at each locus (homozygous corresponding to one of the two parental types, or a heterozygote). Dominant and additive effects of QTLs can be examined using F2 crosses. At any locus, the dominant effect, denoted δd, is the difference between the trait value of the heterozygotes and the average trait value of the homozygotes. The additive effect, denoted δa, is the averge difference between the trait values of the homozygotes. Without loss of generality, the marker and QTL genotypes are coded as −1 (homozygous for one parental type), 0 (heterozygote), and 1 (homozygous for the other parental type), and have respective marginal probabilities 1/4, 1/2, and 1/4. Denoting (.) as an indicator function, the phenotypic trait of individual i given the QTL genotype gi is modeled as:

yi=-δa-δd2gi=-1+δd2gi=0+δa-δd2gi=1+ϵi.

A single QTL model can be fit using the interval mapping approach, and a score statistic Z(t) can be obtained at every locus t to test the null hypothesis H0:δa=0=δd against the alternative HA:δa0 or δd0.Z(t)2 has a χ2 distribution with 2 degrees of freedom. The non-centrality parameter under the alternative hypothesis is given in the Appendix. The overall significance level of a one stage design for an F2 cross can be approximated as (Dupuis and Siegmund 1999):

α=P0maxtZt2>b2=1-exp-C+0.03b2Lνb0.06Δ×exp(-b2/2). (14)

The power 1-β of a one stage design is PAZ>b2, where Z is the test statistic calculated at the QTL position.

Under a two stage design, for α1(α/C,1), the critical value b12 can be obtained from equation (14) by setting C=1 and using L/C in place of L. The relative information and the relative genotyping burden have the same form as equations (10) and (13), respectively. The optimal two-stage design parameters can be obtained using the algorithm outlined for a backcross. The following result holds for small δa and δd when the sample size n is fixed (see Appendix for proof).

Result 4 When n is fixed and δa and δd are small, θ1 is independent of the QTL effects and phenotypic variance.

Consider an experiment with a sample size of 320 F2 individuals segregating a single QTL with additive and dominance effects δa=0.33=δd (heritability = 5%), and the hypothetical genome consisting of C=20 chromosomes, each of length 100 centiMorgans, with Δ = 1 centiMorgan. For this configuration, the power to detect the QTL is approximately 1-β=80% at an overall significance level of α=0.05 under a single locus model. Figure 3 illustrates the operating characteristics of the two stage design. The relative information is 95% (i.e., ϵ=0.05) and the relative genotyping burden is 70% when θ1=60% and α1=0.20. Relative information of 95% is also attained when θ1,α1=(0.65,0.12). This suggests that, while the optimal two stage design parameters can be obtained, the solution may not be unique. However, as a general guideline, genotyping all the markers on θ1=60% of the individuals in Stage 1 and genotyping the chromosomes significant at level α1=0.20 on the remaining subjects in Stage 2 provides 95% relative information with 70% relative genotyping burden. The operating characteristics of this general guideline is illustrated in Table 2 under a variety of parametric configurations for hypothetical genomes with C=20 and 10 chromosomes, each of length 100 centiMorgans, and different inter-marker distance Δ. The heritability is h2=κ2/1+κ2, where κ2=3/16×δa+δd/22+1/16×δd/22+3/16×δa-δd/22. The operating characteristics are shown for various values of h2. It is evident that the general guideline provides an optimal QTL mapping strategy, regardless of the QTL effects and the genotyping density. That the optimal design does not depend upon the QTL effects is consistent with Result 4. The relative information is at least 95% (ϵ<0.05) for a genome with C=20 chromosomes. For a genome with C=10 chromosomes, the relative information is between 93% and 95%. This general guideline is consistent with the optimal design recommendation identified for a backcross based on a single locus model.

Figure 3:

Figure 3:

Characteristics of the two stage design for an F2 cross with C=20 chromosomes, and Δ = 1 centiMorgan. The one-stage design has 80Shown are the contour plots corresponding to the relative information (left panel) and the relative genotyping burden (right panel) as a function of the sampling fraction θ1 (horizontal axis) and Stage 1 significance level α1 (vertical axis). The dotted lines indicate design parameters where 95% relative information is attained.

Table 2:

Two-stage design parameters for an F2 cross based on a single locus model with θ1=0.60 and α1=0.20. The columns are the number of chromosomes (C), heritability (h2; with δa and δd in parentheses), the sample size (n) such that the power of a one-stage design is approximately 80% at level α=0.05, inter-marker distance (Δ), Stage 1 power 1-β1, relative genotyping burden, and relative information.

C h2δa,δd n Δ 1-β1 RGB RI
20 0.01 (0.15, 0.15) 1500 20 0.90 0.69 0.96
1550 1 0.88 0.70 0.95
0.01 (0, 0.33) 825 20 0.90 0.69 0.96
950 1 0.87 0.69 0.95
0.04 (0.33, 0) 500 20 0.90 0.69 0.96
475 1 0.87 0.69 0.96
0.05 (0.33, 0.33) 310 20 0.91 0.69 0.96
320 1 0.88 0.70 0.95
10 0.01 (0.15, 0.15) 1400 20 0.88 0.71 0.95
1420 1 0.84 0.71 0.94
0.01 (0, 0.33) 760 20 0.87 0.71 0.95
880 1 0.84 0.71 0.94
0.04 (0.33, 0) 460 20 0.88 0.71 0.95
440 1 0.84 0.71 0.93
0.05 (0.33, 0.33) 290 20 0.88 0.71 0.95
295 1 0.84 0.71 0.94

Two QTL Models

The methods described so far consider a single locus model at various genomic locations to identify a QTL associated with the trait. Complex traits are very likely influenced by multiple QTLs. Suppose two QTLs are associated with the phenotypic trait. Let Pj denote the power to identify QTL j(j=1,2) using a single locus model. The power to detect both the QTLs is P1×P2. If P1=0.80=P2, then the genome scan based on single locus model has only 64% power to detect both the loci. Genome scans using multi-locus models can provide a more powerful approach to identify the putative loci. Here we consider the simple case where D=2 QTLs are associated with the phenotypic trait of a backcross individual, and examine the operating characteristics of a two stage design. Under a one-stage design, two-locus models are fit by considering every pair of loci on the genome. The null hypothesis of no QTL is tested against the alternative hypothesis of two QTLs. The two-stage design proceeds as follows. In Stage 1, all the markers are genotyped on a random subset of nθ1 individuals. A genome scan is conducted using two-locus models. On every chromosome c, we calculate test statistics corresponding to two-locus models where one locus is from chromosome c and the second locus is either from c or from a different chromosome. The maximum test statistic on chromosome c is the maximum of the test statistics from two-locus models where at least one locus is on c. Each chromosome-specific maximum test statistic is tested at significance level α1. The markers on the significant chromosomes are genotyped on the remaining subjects. A genome scan based on two-locus models is conducted using these chromosomes and genotype data from all the n subjects to identify the two QTLs.

Given the genotypes at the two QTLs, the trait of a backcross subject i is modeled as:

yi=δ12gi1-1+δ22gi2-1+ϵi. (15)

The score statistic, denoted Zt1,t2, can be calculated based on the above model at any two genomic locations t1 and t2 to test the null hypothesis of no QTL δ1=0=δ2 at those two loci. The square of test statistic has a chi-square distribution with 2 degrees of freedom and non-centrality parameter:

λδ1,δ2,n=n1-4q1-q1-rδ12+δ22. (16)

The overall significance level α is approximated as (Dupuis and Siegmund 1999):

α=P0maxt1,t2Z2t1,t2>b2=1-exp-(C+L/Δ)22×τΔb2ζ/22×1-𝒳22b2, (17)

where ζ=2 and τ(x)=exp-2s=1Φ(-xs)/s.

The relative information and the relative genotyping burden have the same form as equation (10) and (13). The optimal θ1, α1(α/C,1) (and the corresponding 1-β1) can be obtained as described earlier. We examined the optimal two-stage design under various parametric configurations when D=2 QTLs are associated with the trait. Heritability or the proportion of variation in the trait explained by the two QTLs is h2=δ12+δ22/1+δ12+δ22. The optimal two-stage design was examined for various values of δ12+δ22 (equivalently, h2). The results indicate that regardless of the heritability h2 and the genotyping density Δ, 95% or more relative information and, hence, near-optimal power is obtained when θ1,α1=(0.60,0.20). The relative information is at least 90% when θ1,α1=(0.60,0.10). The operating characteristics of these two stage designs are described in Table 3 under various parametric configurations. The heritability is h2=κ2/1+κ2, where κ2=δ12+δ22. Our investigations suggest that, as a general guideline, genotyping all the chromosomes on θ1=60% of the subjects in Stage 1 and genotyping the chromosomes significant at α1=20% provides most of the information and utilizes only 70% of the genotyping relative to a one-stage design. This is consistent with the general guideline obtained under a single QTL model for backcross and F2 crosses.

Table 3:

Characteristics of a two stage design based on a two-QTL model for a backcross segregating two QTLs. The columns indicate the choice of θ1 and α1, the number of chromosomes, heritability (with δ12+δ22 in parentheses), sample size such that a one-stage design has 80% power, Δ, 1-β1, relative genotyping burden, and relative information.

θ1,α1 C h2δ12+δ22 n Δ 1-β1 RGB RI
(0.60, 0.20) 20 0.25 (0.33) 360 20 0.96 0.71 0.98
370 1 0.92 0.71 0.97
0.15 (0.18) 1200 20 0.96 0.71 0.98
1230 1 0.92 0.71 0.97
10 0.25 (0.33) 320 20 0.93 0.74 0.97
335 1 0.88 0.73 0.95
0.15 (0.18) 1100 20 0.94 0.74 0.97
1130 1 0.88 0.73 0.95
(0.60, 0.10) 20 0.25 (0.33) 360 20 0.92 0.67 0.97
370 1 0.87 0.67 0.95
0.15 (0.18) 1200 20 0.92 0.67 0.97
1230 1 0.87 0.67 0.95
10 0.25 (0.33) 320 20 0.88 0.70 0.95
335 1 0.82 0.70 0.93
0.15 (0.18) 1100 20 0.89 0.70 0.95
1130 1 0.82 0.70 0.93

Two Stage Analytic Approach

Complex traits may be a consequence of multiple QTLs conferring effects individually (main effects) or solely through epistasis (interaction effects), or both. Conducting a genome scan using single QTL models may not provide adequate power to detect all the QTLs. Fitting multi-locus models to identify the QTLs can, however, be a tedious task, particularly under dense genotyping. A sequential analytic strategy can be employed to identify multiple QTLs in such cases. Here we examine a two-stage analytic approach when a dense set of markers is genotyped on all the n available individuals.

We consider the case where the trait is generated through the following model consisting of the main effects and interaction between two unlinked QTLs:

yi=δ12gi1-1+δ22gi2-1+δ34gi1gi2-1+ϵi. (18)

Here δ3 is the interaction effect. The phenotypic trait has marginal mean 0 and marginal variance τ2=1+δ12+δ22+3δ32+δ1δ3+δ2δ3. In our investigations below, we assume δ1=δ2. The heritability based on the above model is h2=h122+h32, where h122=δ12+δ22/τ2 and h32=δ3δ1+δ2+3δ3/τ2. The quantity η12=h122/h2 can be interpreted as the fraction of heritability conferred solely by the main effects of the two QTLs. Therefore, η12=1 implies that δ3=0, and 1-η12=h32/h2=1 implies that δ1=0=δ2.

Suppose we conduct a one-stage genome scan using single QTL models. As described in the previous section, the power to detect the two QTLs under this approach is P1×P2. Now suppose that we conduct a one-stage genome scan using two QTL models, where two main effects terms and a pairwise interaction term are considered as in equation (18). In this setting, there is no QTL under the null hypothesis, and there are two QTLs under the alternative. The test statistic, denoted Z2, calculated at the two QTL positions using the above model has a χ2 distribution with 3 degrees of freedom and non-centrality parameter given by:

λδ1,δ2,δ3,n=n×δ12+δ22+3δ32+δ1δ3+δ2δ3. (19)

The overall significance level α is given by equation (17) with ζ=4/3. The power of the one-stage genome scan using two-QTL models is 1-β=1-Φ[b-λδ1,δ2,δ3,n. The computational burden of this one-stage approach is the total number of chromosomes that will be considered for the pair-wise genome scan. Since two-locus models will be fit using markers on all the chromosomes, the computational burden is CB1=C, the total number of chromosomes.

Consider the following two-stage analytic approach. In Stage 1, conduct a genome scan using single QTL models and test each chromosome at significance level α1. In Stage 2, conduct a two-locus scan by applying model (18) to pairs of loci on the chromosomes identified as being significant in Stage 1 and test the maximum test statistic at significance level α2. Let 1-β1 and 1-β2 denote the powers to detect the chromosomes containing two QTLs in Stage 1. The test statistic in Stage 1 is based on a single QTL model, while that in Stage 2 is obtained using a two-QTL model. The power, P*, of the two-stage approach is the joint probability that the chromosomes containing the QTLs are identified in Stage 1, and the two locus test statistic Z2, calculated at the QTL positions in Stage 2, exceeds b2. Therefore, P* can be written as:

P*=PAthetwoQTLchromosomesareselectedinStage1andZ2>b2=PA(thetwoQTLchromosomesareselectedinStage1)×PAZ2>b2QTLchromosomesareselectedinStage1=1-β1×1-β2×PAZ2>b2QTLchromosomesareselectedinStage1=1-β1×1-β2×PAZ2>b2teststatisticiscalculatedatthetwoQTLpositions=1-β1×1-β2×1-β. (20)

A two locus model will be fit at the two putative loci only when the relevant chromosomes are selected in Stage 1. Therefore, P* can be substantially smaller than 1-β. The computational burden of the proposed two stage design is CB2=j=1D1-βj+α1(C-D). Clearly, CB2CB1. Therefore, the relative computational burden is given by:

CB2CB1=j=1D1-βjC+α1×1-DC. (21)

Our goal is to identify the two stage parameter α1 so that (i) the overall significance level is α, (ii) the loss of power, 1-β-P*, is smaller than ϵ for some desired ϵ, and (iii) the relative computational burden is minimized.

Since δ1=δ2, we have 1-β1=1-β2. The optimal two-stage parameters can be identified as follows. The values of α, 1-β, C, L, n, and the desired effect size δ1=δ2, and δ3 are specified when designing the study. The value of ϵ is specified by the user.

  1. Choose α1(α/C,1).

  2. Obtain critical value b12 from equation (5) by setting C=1 and using L/C in place of L on the right hand side.

  3. Calculate the power of Stage 1, 1-β1, using equation (6), noting that the test statistic at a QTL locus in Stage 1 has a marginal χ2 distribution with 1 degree of freedom and non-centrality parameter n×δ12/τ2-1-δ12.

  4. Calculate 1-β-P*=1-β-1-β12×(1-β). Check if this value is ϵ.

  5. Calculate the relative computational burden CB2/CB1 using equation (21).

  6. Repeat the above procedure for various choices of α1 to identify that α1 providing 1-β-P*ϵ and the smallest relative computational burden.

We examined the power to detect two QTLs under three methods: one-stage genome scan using single locus models, one-stage genome scan using two-locus models with main effects and a pair-wise interaction term, and the proposed two-stage analytic approach. Consider a hypothetical genome with C=20 densely genotyped chromosomes, each of length 100 centiMorgans. The desired overall significance level is α=0.05. Table 4 gives the power of the three methods. Column 1 gives the heritability h2, with sample size n in parentheses. Column 2 is η12, the fraction of heritability conferred solely by the main effects of the two QTLs. Column 3 is the main effect of the two QTLs. Column 4 is the interaction effect. Column 5 is the power of a one-stage analysis to detect both the QTLs based on a genome scan with single-locus models. Columns 6, 7, and 8 are the power of the two-stage analytic approach under α1=0.10 (under column denoted (a)), 0.20 (b), and 0.25 (c). Column 9 is 1-β. The relative computational burdens are approximately 18%, 28%, and 32% for α1=0.10, 0.20, and 0.25, respectively, regardless of h2, n, and η12. The power of the two-stage analytic approach is provided for α1=0.10, 0.20, and 0.25. The results indicate that over a broad range of values of h2, the two-stage approach provides sufficient power to identify both the QTLs so long as a reasonable fraction of heritability is explained by the main effects. As a general rule, a two-stage approach with α1=0.25 provides sufficient power to detect both the QTLs when η120.75. The loss of power is ϵ10%. This result holds for all of the configurations examined, regardless of the value of h2. The power of this design is given in Column 8 of Table 4, and should be compared with the power of the one-stage genome scan using two-locus models given in Column 9. When α1=0.25, the relative computational burden is 32% i.e., this two-stage analytic approach requires evaluating 68% fewer two-locus models to identify the two QTLs than a one-stage approach. Column 5 provides the power to detect the two loci using a one-stage genome scan with single locus models. Comparing this with Columns 6, 7, and 8, it is evident that the proposed two-stage analytic approach is substantially more powerful than a one-stage single locus genome scan.

Table 4:

Characteristics of the two stage analytic approach to detect two QTLs. The power of a one-stage genome scan using two-QTL models is 1-β0.90. The columns are explained in the text. The power of a two-stage analytic approach with α1=0.10, 0.20, and 0.25 are given under columns (a), (b), and (c), respectively.

h2(n) η12 δ1=δ2 δ3 P1×P2 (a) (b) (c) 1-β
0.15 1 0.30 0 0.41 0.85 0.88 0.88 0.90
(260) 0.90 0.28 0.03 0.26 0.78 0.83 0.85 0.89
0.80 0.27 0.05 0.20 0.77 0.82 0.84 0.90
0.75 0.26 0.06 0.15 0.70 0.78 0.80 0.89
0.60 0.23 0.10 0.05 0.53 0.64 0.68 0.89
0.10 1 0.24 0 0.50 0.86 0.88 0.89 0.91
(410) 0.90 0.224 0.022 0.35 0.80 0.84 0.85 0.89
0.80 0.211 0.041 0.24 0.74 0.81 0.82 0.89
0.75 0.204 0.05 0.18 0.70 0.78 0.80 0.89
0.60 0.183 0.08 0.08 0.56 0.68 0.72 0.91
0.05 1 0.162 0 0.58 0.87 0.89 0.89 0.91
(900) 0.90 0.154 0.015 0.48 0.84 0.88 0.89 0.91
0.80 0.145 0.028 0.34 0.80 0.85 0.86 0.91
0.75 0.140 0.034 0.27 0.76 0.72 0.84 0.90
0.60 0.126 0.052 0.13 0.63 0.73 0.76 0.91

Discussion

In this paper we have outlined two-stage sequential methods for QTL mapping in experimental crosses. Recent developments in molecular technology enables us conduct dense genotyping. Sequential genotyping methods can provide a cost-efficient strategy to search for QTLs without having to genotype all the markers on all the study subjects. Our investigations show that when a single QTL is associated with the phenotypic trait, genotyping all the markers on only 60% of the individuals in Stage 1 and genotyping the markers on chromosomes significant at 20% level using additional individuals in Stage 2 provides near-optimal power to identify the putative locus and utilizes only 70% genotyping burden relative to a one stage design. When two QTLs are associated with the trait, this guideline continues to provide an efficient approach to identify the QTLs when both Stages 1 and 2 involve genome scan using two locus models. The optimal parameters are independent of the heritability and genotyping density. When planning a two-stage design, the sample size n is initially calculated to detect a QTL with some desired power and significance level, assuming a genotyping density of Δ centiMorgans, as if one were conducting a one-stage design. Genotyping is then conducted using a two-stage approach by treating this n as the fixed available sample size. Therefore, while the sample size will depend upon several parameters including Δ, the choice of sampling fraction and significance level for Stage 1 do not depend upon the QTL effect and Δ once the sample size is fixed. Our investigations based on a single QTL model indicate that the general guideline is applicable to both backcross as well as F2 crosses. If the cost of genotyping is linear in the total number of markers, this two stage design provides a cost-effective genotyping strategy. These guidelines have parallels to optimal two-stage genotyping strategies for association studies in human population.

Note that once the promising markers are genotyped on additional individuals in Stage 2, the analysis of these markers is based on the entire samples i.e., the individuals genotyped in Stage 1 as well as Stage 2. While Stage 2 may be viewed as a replication study and the analysis may be conducted solely based on Stage 2 data, this would result in an inefficient genetic mapping strategy relative to a joint analysis of Stage 1 and Stage 2 samples (Skol et al. 2006). This is consistent with intuition since the joint analysis utilizes a larger sample size than an analysis solely based on Stage 2 data. This article implicitly assumes that the cost of genotyping a single marker is the same in the two stages. Wang et al. (2006) examine differential costs for genotyping a large set of markers in Stage 1 and a smaller subset in Stage 2, conclude that two-stage designs provide an optimal genotyping strategy by examining the power function, and provide optimal two-stage design parameters under different cost settings.

The contribution to the likelihood from each individual is based on a mixture model. A reviewer pointed out that the asymptotic normality of the score statistic and the asymptotic chi-squared distribution of the likelihood ratio statistic may be violated for mixture models. This violation occurs when the mixing proportion qi is 1/2 for all the individuals, resulting in a singular information matrix (see for example, Quinn et al. 1987). Consider a backcross. Note that qi=1/2 only when the flanking markers recombine, and qi=1-r12/(1-r) or r12/(1-r) when they do not recombine. In fact, if we observe that qi=1/2 for all the subjects, this is an indication that all the study subjects have recombining flanking markers, which should not occur in a well-designed study. A randomized study should yield both recombinant and non-recombinant pairs of markers. Hence, qi1/2 for all the backcross subjects in a well-designed study. Any chromosome with genotyped markers should not have this issue, and we consider testing for the presence of a QTL at specific locations in intervals flanked by a pair of genotyped markers. Similar argument holds for other crosses as well. Hence, the expected information is non-singular and the asymptotic distributions of the score statistic and the likelihood ratio statistic are valid.

We have used the approximation to the tail probability of the test statistic given by equation (5) in our calculations. This approximation is valid for equally spaced markers, and the tail probability would be over-estimated when the inter-marker distances are not the same (Malley et al. 2002). Our main focus here is study design, where the investigator typically makes some assumption about average inter-marker distances to evaluate the sample size and power. The approximation to the significance level would be reasonable for addressing design considerations under such settings. Since the tail probability is over-estimated when inter-marker distances vary, the resulting sample size n would be conservative. While the approximation to the significance level (equation 5) may be reasonable for study designs, it may be pragmatic to calculate p-values using a resampling approach during data analysis. Resampling methods have been described by Churchill and Doerge (1994) and Malley et al. (2002) for a one-stage design, and by Lin (2006) for a two-stage design. In this paper we have defined power as the probability that the test statistic at the QTL position(s) exceeds a desired critical value. This is our working definition of power, and may have some limitations. Alternatively, one may define power as the probability that the QTL position is within the confidence interval for the position corresponding to the maximum test statistic. Such a definition may guarantee better likelihood of detecting the QTL than our working definitions. Two-stage designs under this alternative definition of power is yet to be examined, and has not been attempted here.

The number of QTLs is unknown at the outset. Further, it is unknown whether multiple loci are associated with the trait solely through main effects or confer effects through interactions. It is, therefore, natural to conduct an initial genome scan using single locus models to identify the promising chromosomes in Stage 1. Multi-locus models can then be fit using the markers on these chromosomes in Stage 2 to identify the QTLs. Our investigations indicate that this approach provides substantial computational efficiency to identify multiple QTLs with sufficient power so long as at least 75% of the heritability is due to the main effects of the QTLs. This is consistent with intuition that a genome scan based on single locus models in Stage 1 may not have sufficient power to detect the chromosomes containing the putative QTLs when two loci confer substantial interaction effect. Posterior summaries for the presence of QTL at a locus (for example, Sen and Churchill 2001) obtained using hierarchical models can provide a more powerful genome scan strategy to identify QTLs. Such methods can be used for genome scanning in Stage 1 to identify the promising chromosomes. Further research is required to assess the efficiency of two stage designs that employ such analytic approach in Stage 1 and the resulting cost-efficiency and gains in computational burden, and has not been attempted here.

The proposed two stage design is conceptually similar to group sequential designs used in clinical trials (Jennison and Turnbull 2000). In a group sequential trial, a set of individuals are recruited into the trial and randomly assigned to have or not have the treatment. Outcomes from the treatment and control groups are compared, and the study proceeds to the next stage if the comparison is not statistically significant. Otherwise, the trial is stopped. Under the proposed two stage design, the study proceeds to the next stage if a chromosome is declared significant. While one or a few treatments are evaluated in a group sequential clinical trial, a QTL study involves testing multiple genetic loci. The overall significance error in a group sequential clinical trial is written as a sum of type I errors over multiple stages. Under the proposed two stage design the probability of not finding a false positive (equation 22 in the Appendix) is written as the sum of the corresponding probabilities in successive stages. This provides insight into the choice of significance level α1 for Stage 1, and the choice of significance level α2 and the corresponding critical value in Stage 2, as given by Results 1, 2, and 3.

DNA pooling (Churchill et al. 1993) is another cost-effective genotyping strategy whereby n given subjects are grouped into m pools consisting of k individuals each such that n=m×k. DNA is isolated en masse from each pool. Thus, the total number of genotyping per marker is m<n. Two-stage and multi-stage designs involving DNA pooling in Stage 1 and individual genotyping of the promising loci in subsequent stages have been examined for population-based association studies in humans (Zuo et al. 2006; Prentice and Qi 2007). Further research is needed to examine the optimal properties of two-stage or multi-stage QTL mapping studies involving DNA pooling. Cost-effective genetic mapping methods have also been investigated from phenotyping perspectives, particularly when multiple phenotypes are measured and/or when phenotyping is more expensive than genotyping. Jin et al. (2004) proposed a selective phenotyping strategy using a criterion that maximizes the genetic diversity of the phenotyped subjects. Medugorac and Soller (2001) considered cost-information tradeoffs under selective phenotyping with a main trait of interest and a correlated trait. This article focuses solely on the case where genotyping, but not phenotyping, cost can be substantial. The role of two-stage designs under selective phenotyping remains to be examined, and has not be considered here.

Our investigations where a single locus or multi-locus model is used in both stages make use of the fact that the non-centrality parameter is a mixture of the corresponding quantities in the two stages, thus providing a direct approach to evaluate optimal two stage design without the need to examine the power. Two stage designs have been used for mapping disease susceptibility loci in experimental organisims and human population. Sugiyama et al. (2001) used a novel two stage design for mapping traits associated with salt-induced hypertension in 250 rats. A selective genotyping approach was used in Stage 1, by genotyping all the markers on 92 mice with extreme trait values to identify the promising chromosomal regions. In Stage 2, the recombinant regions were densely genotyped on all the 250 mice if the flanking markers recombined. The rational behind this approach is that the genotypes of the intermediate loci are completely known if the flanking markers do not recombine. The design proposed in this article assmes that, once a chromosome is declared significant in Stage 1, all the markers on this chromosome are genotyped on additional individuals in Stage 2. Alternatively, one may genotype markers from only those intervals having a significant LOD score, instead of having to genotype all the markers on that chromosome. This approach can further reduce the genotyping burden. The proposed two stage design can be extended to encompass designs of this kind. However, the significance level b22 for Stage 2 may be different from b2, and further work is needed to obtain the appropriate b22. Maraganore et al. (2005) employed a novel approach to identify genetic loci associated with Parkinson disease using family-based as well as population-based case-control samples. A dense set of single nucleotide polymorphisms (SNPs) were genotyped on discordant siblings in Stage 1. The promising SNPs were genotyped in Stage 2 on unrelated case-control samples. Data on these promising SNPs from the discordant siblings and case-control samples were used to identify the putative disease loci. The rational behind this approach is that the use of discordant siblings reduces population stratification issues and, hence, false positive associations, and the use of case-control samples provides improved power to identify the putative loci. Likewise, different crosses (such as backcross, F2 or recombinant inbred lines) may be used in different stages to fine map the disease loci. The direct approach for evaluating power loss can provide a framework for investigating the operating characteristics of such designs. The efficiency of two stage designs employing different samples or crossses in different stages remains to be explored. Computer programs written using the R programming langugage (http://cran-r-project.org) are available from the authors to perform the two-stage design calculations described in this article.

Acknowledgments

This work was supported by NIH grants GM060457 (JMS), GM074244 (SS), and GM070683 (GAC). The authors thank two anonymous reviewers for their helpful comments.

Appendix

Mean of the score statistic: Single QTL Model

Here we describe the mean of the score statistic Z for a single QTL model based on the likelihood given by equation (2). The first derivative of the log-likelihood with respect to δ and evaluated at δ=0 is given by l(δ=0;r1,n=i=1nyi2qi-1. The expected value of lδ=0;r1,n based on n individuals with independently distributed phenotypic traits and independent and identically distributed genotypes is given by:

Elδ=0;r1,n=nE2qi-1Eyiqi=nδE2qi-12=nδE1-4qi1-qi=nδPm1i,m2i1-4Pgi=1m1i,m2i×Pgi=0m1i,m2i.

The summation in the last row is over the four possible flanking marker genotypes m1i,m2i={(1,1),(1,0),(0,1),(0,0)}. Since the QTL is assumed to be located in the center of the marker interval, it can be easily seen that the expected value is nδ(1-r)[1-4q(1-q)]=δIδ=0;r1,n. Hence, the mean of the score statistic is E(Z)=δIδ=0;r1,n.

Derivation of Results 1 – 4

Results 1 – 4 are derived below. These results are applicable to any QTL model (for example, one-QTL or two-QTL models) and any cross (for example, backcross or F2).

Proof of Results 1 and 2:

On any chromosome c, let W1(c) and W2(c) denote the chromosome-wide maximum test statistic calculated using nθ1 and n individuals, respectively. We observe W1(c) in Stage 1 and W2(c) in Stage 2 (if the chromosome is evaluated in Stage 2). The significance level in Stage 1 is the probability that W1(c) exceeds critical value b12, under the null hypothesis. The significance level in Stage 2, denoted α2, is the probability that W2(c) exceeds b22 in Stage 2 conditional upon the fact that W1(c) exceeded b12 in Stage 1. The probability of finding no false positive association on a chromosome c under the null hypothesis is:

1-αC=P0(thereisnofalsepositiveassociationattheendofthestudy)=P0(thereisnofalsepositiveassociationinStage1)+P0(thereisnofalsepositiveassociationinStage2andthereisafalsepositiveassociationinStage1)=P0W1(c)<b12+P0W2(c)<b22,W1(c)>b12=P0W1(c)<b12+P0W2(c)<b22W1(c)>b12×P0W1(c)>b12=1-α1+1-α2×α1. (22)

Since α1 and α2(0,1), this equation indicates that 1-α1<1-α/C i.e., α1>α/C, proving Result 1. Further, it follows from the above equation that α2=α/C×α1, proving Result 2.

Proof of Result 3:

The overall significance level of the two-stage design is the probability that a chromosome is declared significant in Stage 1 and Stage 2. Denoting b12 and b22 as the critical values in Stages 1 and 2, respectively, the overall significance level of the two-stage design is defined as P0W1(c)>b12,W2(c)>b22. Further, we can write PW2(c) as

P0W2(c)>b22=P0W2(c)>b22,W1(c)<b12+P0W2(c)>b22,W1(c)>b12P0W2c>b22,W1c>b12.

Let b2 denote the critical value of a one-stage design. The overall significance level of a one-stage design is α/C, and is defined as the probability that the test statistic W2(c) exceeds b2 under the null hypothesis i.e., α/C=P0W2(c)>b2. Our goal is to design a two-stage procedure to have overall significance level of at most α/C. Equivalently, we need the right hand side of the above equation to be at most α/C. From the above equation it can be easily seen that when the critical of Stage 2 is set to equal b2, we have P0W2(c)>b2,W1(c)>b12α/C, proving Result 3.

Proof of Result 4:

Under a one-stage design, the power to detect a QTL under model is given by 1-β=PAZ>b22, where Z is a random variable distributed as chi-square with 2 degrees of freedom and non-centrality parameter λλδa,δd,n. The probability density of Z is given by (Johnson, Kotz, and Balakrishnan 1995)

f(z)=exp{-λ/2}r=0λ2r1r!g(z;1+r,1/2),

where g(z;1+r,1/2) is the probability density of a gamma distribution with shape 1+r and scale 1/2, having cumulative distribution:

Gz;1+r,12=0z1Γ1+r121+rurexp-2udu.

When the sample size is fixed and the QTL effects δa and δd are small, resulting in small non-centrality parameter λ, β=PAZ<b22 can be written as:

β=0b22f(z)dz=exp-λ/20λ2r1r!Gb22;1+r,1/2=1-λ2×G1+λ2G2.

The second step follows by taking the integral inside the summation, since G(.) is a cumulative distribution and hence is in the interval (0, 1). The third step utilizes the fact that λ is small and, hence, the first two terms are used to approximate the infinite sum. In the last step G1=Gb22;1,1/2 and G2=Gb22;2,1/2. For small λ, it can be easily seen that β<G1. Therefore, solving the above quadratic equation for a given β provides:

λ=G2-G1+G2-G12+G2G1-βG2

The right hand side depends upon the power 1-β and the critical value b22, and is independent of the QTL effects. Similar expression can be written for λδa,δd,nθ1, the non-centrality parameter of Stage 1, as a function of Stage 1 power 1-β1 and the corresponding critical value b12. Define G21=Gb12;2,1/2 and G11=Gb12;1,1/2. Therefore:

θ1=λδa,δd,nθ1λδa,δd,n=G21G11+G21G112+G21G11β1G2G1+G2G12+G2G1β×G2G21

The right hand side, and hence θ1, is independent of the QTL effects.

Non-centrality parameter of F2 cross

The test statistic at the QTL position of an F2 cross has a χ2 distribution with 2 degrees of freedom. The non-centrality parameter can be obtained using the second derivative of the log-likelihood. The QTL is in the middle of a marker interval of length Δ centi Morgans. Using symbolic calculations, the non-centrality parameter is given by Aδa2+Dδd2, where:

A=[1-4q(1-q)](1-r)D=A2×a1a2a1=6r4-12*r3+10r2-4*r+1a2=8r4-16r3+12r2-4r+1

Contributor Information

Jaya M. Satagopan, Memorial Sloan-Kettering Cancer Center

Saunak Sen, University of California, San Francisco.

Gary A. Churchill, The Jackson Laboratory

References

  1. Broman K and Speed T (1999). A review of methods for identifying qtls in experimental crosses. In Seillier-Moiseiwitsch F (Ed.), Statistics in genetics and molecular biology, Volume 33 of IMS Lecture Notes - Monograph Series, pp. 114–142. Institute of Mathematical Statistics and American Mathematical Society, Hayward, CA. [Google Scholar]
  2. Churchill GA and Doerge RW (1994). Empirical threshold values for quantitative trait mapping. Genetics 138, 963–971. [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Churchill GA, Giovannoni JJ, and Tanksley SD (1993). Pooled-sampling makes high-resolution mapping practical with DNA markers. Proceedings of the National Academy of Sciences 90, 16–20. [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Cox D and Hinkley D (1974). Theoretical Statistics. Chapman and Hall, London. [Google Scholar]
  5. Darvasi A and Soller M (1992). Selective genotyping for determination of linkage between a marker locus and a quantitative trait locus. Theoretical and Applied Genetics 85, 353–359. [DOI] [PubMed] [Google Scholar]
  6. Darvasi A and Soller M (1994). Optimum spacing of genetic markers for determining linkage between marker loci and quantitative trait locus. Theoretical and Applied Genetics 89, 351–357. [DOI] [PubMed] [Google Scholar]
  7. Dupuis J and Siegmund D (1999). Statistical methods for mapping quantitative trait loci from a dense set of markers. Genetics 151, 373–386. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Elston RC (1994). P value, power, and pitfalls in the linkage analysis of psychiatric disorders. In Gershon E and Cloninger C (Eds.), Genetic Approaches to Mental Disorders: Proceedings of the Annual Meeting of the American Psychopathological Association, pp. 3–21. American Psychiatric Press, Washington D.C. [Google Scholar]
  9. Elston RC, Guo X, and Williams LV (1996). Two-stage global search designs for linkage analysis using pairs of affected relatives. Genetic Epidemiology 13, 535–558. [DOI] [PubMed] [Google Scholar]
  10. Elston RC, Lin D, and Zheng G (2007). Multi-stage sampling for genetic studies. Annual Review of Genomics and Human Genetics 00, 00–00. [DOI] [PubMed] [Google Scholar]
  11. Feingold E, Brown PO, and Siegmund D (1993). Gaussian models for genetic linkage analysis using complete high-resolution maps of identify by descent. American Journal of Human Genetics 53, 234–251. [PMC free article] [PubMed] [Google Scholar]
  12. Jennison C and Turnbull BW (2000). Group sequential methods with application to clinical trials. Chapman and Hall, New York. [Google Scholar]
  13. Jin C, Lan H, Attie AD, Churchill GA, and Yandell BS (2004). Selective phenotyping for increased efficiency in genetic mapping studies. Genetics 168, 2285–2293. [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Johnson N, Kotz N, and Balakrishnan N (1995). Continuous univariate distributions, Volume 2. John Wiley and Sons, New York. [Google Scholar]
  15. Lander ES and Botstein D (1989). Mapping Mendelian factors underlying quantitative traits using RFLP linkage maps. Genetics 121, 185–199. [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Lin DY (2006). Evaluating statistical significance in two-stage genomewide association studies. American Journal of Human Genetics 78, 505–509. [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. Malley JD, Naiman DQ, and Bailey-Wilson JE (2002). A comprehensive method for genome scans. Human Heredity 54, 174–185. [DOI] [PubMed] [Google Scholar]
  18. Maraganore DM, de Andrade M, Lesnick TG, et al. (2005). High-resolution whole-genome association study of parkinson disease. American Journal of Human Genetics 77, 685–693. [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. Marchini J, Donnelly P, and Cardon LR (2005). Genome-wide strategies for detecting multiple loci that influence complex diseases. Nature Genetics 37, 413–417. [DOI] [PubMed] [Google Scholar]
  20. Medugorac I and Soller M (2001). Selective genotyping with a main trait and correlated trait. Journal of Animal Breeding and Genetics 118, 285–295. [Google Scholar]
  21. Ott J (1991). Analysis of Human Genetic Linkage. The Jhons Hopkins University Press, Baltimore. [Google Scholar]
  22. Prentice RL and Qi L (2007). Aspects of the design and analysis of high-dimensional SNP studies for disease risk estimation. Biostatistics 7, 339–354. [DOI] [PubMed] [Google Scholar]
  23. Quinn BG, McLachlan GJ, and Hjort NL (1987). A note on the Aitkin-Rubin approach to hypothesis testing in mixture models. Journal of the Royal Statistical Society – Series B 49, 311–314. [Google Scholar]
  24. Satagopan JM and Elston RC (2003). Optimal two-stage genotyping in population-based association studies. Genetic Epidemiology 25, 149–157. [DOI] [PMC free article] [PubMed] [Google Scholar]
  25. Satagopan JM, Venkatraman ES, and Begg CB (2004). Two-stage designs for gene-disease association studies with sample size constraints. Biometrics 60, 589–597. [DOI] [PMC free article] [PubMed] [Google Scholar]
  26. Satagopan JM, Verbel DA, Venkatraman ES, Offit KE, and Begg CB (2002). Two-stage designs for gene-disease association studies. Biometrics 58, 163–170. [DOI] [PMC free article] [PubMed] [Google Scholar]
  27. Sen Ś and Churchill GA (2001). A statistical framework for quantitative trait mapping. Genetics 159, 371–387. [DOI] [PMC free article] [PubMed] [Google Scholar]
  28. Sen Ś, Satagopan JM, and Churchill GA (2005). Quantitative trait locus study design from an information perspective. Genetics 170, 447–464 [DOI] [PMC free article] [PubMed] [Google Scholar]
  29. Skol A, Scott LJ, Abecasis GR, and Boehnke M (2006). Joint analysis is more efficient than replication-based analysis for two-stage genome-wide association studies. Nature Genetics 38, 209–213. [DOI] [PubMed] [Google Scholar]
  30. Sugiyama F, Churchill GA, Higgins DC, Johns C, et al. (2001). Concordance of murine quantitative trait loci for salt-induced hypertension with rat and human loci. Genomics 71, 70–77. [DOI] [PubMed] [Google Scholar]
  31. Thomas DC (2005). The need for a systematic approach to complex pathways in molecular epidemiology. Cancer Epidemiology Biomarkers and Prevention 14, 557–559. [DOI] [PubMed] [Google Scholar]
  32. Wang H, Thomas DC, Peer I, and Stram DO (2006). Optimal two-stage genotyping designs for genome-wide association scans. Genetic Epidemiology 30, 356–358. [DOI] [PubMed] [Google Scholar]
  33. Zuo Y, Zuo G, and Zhao H (2006). Two-stage designs in case-control association analysis. Genetics 173, 1747–1760. [DOI] [PMC free article] [PubMed] [Google Scholar]

RESOURCES