Summary.
Gene–disease association studies based on case–control designs may often be used to identify candidate polymorphisms (markers) conferring disease risk. If a large number of markers are studied, genotyping all markers on all samples is inefficient in resource utilization. Here, we propose an alternative two-stage method to identify disease-susceptibility markers. In the first stage all markers are evaluated on a fraction of the available subjects. The most promising markers are then evaluated on the remaining individuals in Stage 2. This approach can be cost effective since markers unlikely to be associated with the disease can be eliminated in the first stage. Using simulations we show that, when the markers are independent and when they are correlated, the two-stage approach provides a substantial reduction in the total number of marker evaluations for a minimal loss of power. The power of the two-stage approach is evaluated when a single marker is associated with the disease, and in the presence of multiple disease-susceptibility markers. As a general guideline, the simulations over a wide range of parametric configurations indicate that evaluating all the markers on 50% of the individuals in Stage 1 and evaluating the most promising 10% of the markers on the remaining individuals in Stage 2 provides near-optimal power while resulting in a 45% decrease in the total number of marker evaluations.
Keywords: Linkage disequilibrium, Optimal design, Order statistic, Power
Résumé
Pour explorer l’association entre un gène et une maladie, les études d’association de type cas/témoins sont souvent utilisées pour identifier des polymorphismes (marqueurs) qui confèrent un risque accru de développer la maladie. Si un grand nombre de marqueurs sont étudiées, génotyper tous les marqueurs chez tous les individus de l’échantillon n’est pas optimal en terme d’utilisation des ressources. Ici, nous proposons une approche alternative en deux étapes pour identifier les marqueurs associés à une maladie. Lors de la première étape, tous les marqueurs sont génotypés chez une fraction de l’échantillon total. Les marqueurs les plus prometteurs sont ensuite testés sur les individus restants lors de la deuxième étape. Cette approche peut être efficace en terme de coût puisque les marqueurs vraisemblablement non associés sont éliminés lors de la première étape. A l’aide de simulations, nous montrons que lorsque les marqueurs sont indépendants ou lorsqu’ils sont corrélés, l’approche en deux étapes conduit à une réduction substantielle dans le nombre total de marqueurs évalués pour une perte de puissance minimale. La puissance de notre approche est évaluée lorsqu’un seul marqueur est associé avec la maladie et en présence de plusieurs marqueurs associés à la maladie. De façon générale, nos simulations indiquent que génotyper l’ensemble des marqueurs chez 50% des individus lors de la première étape, puis de génotyper les 10% des marqueurs les plus prometteurs est quasi optimal en terme de puissance pour une diminution de 45% dans le nombre total de marqueurs génotypés.
1. Introduction
Linkage or pedigree analyses have been widely used in genetic epidemiologic studies to identify genetic regions contributing to Mendelian disorders (Ott, 1991; Weir, 1996). These methods utilize information from recombination between the markers (i.e., candidate polymorphisms) to estimate the location of the risk-conferring gene. Disease-susceptibility loci can generally be localized to fairly large intervals (for example, 10–20 Mb size intervals) using this method (Jorde, 2000), and refinement of the region would be necessary for practical use such as positional cloning. This, however, requires analyses using very large pedigrees in order to identify sufficient numbers of recombination events (Boehnke, 1994). Alternatively, population-based gene–disease association studies using a large number of polymorphic markers have been proposed as useful methods to identify disease-susceptibility loci (Risch, 2000). These studies are based on the underlying assumption that a population-based sample of unrelated individuals are only unrelated in a relative sense, i.e., the genomes of unrelated individuals will be more distantly related than samples ascertained from pedigrees, thus providing a higher chance of recombination events to have taken place (Nordborg and Tavare, 2002).
Association studies are carried out using the whole-genome approach or the candidate gene approach. The whole-genome approach evaluates genetic loci spaced throughout the genome to identify the markers associated with disease. Under this approach, the markers are not preselected with regard to their function or possible contributions to disease etiology. Genetic epidemiologic studies, on the other hand, may be based on candidate genes or genetic pathways contributing to disease incidence or progression. Rather than evaluate markers evenly spaced along the entire genome, these studies use rigorous epidemiologic principles and a hypothesis-driven approach to select candidate genes that point to functional pathways of interest (Tabor, Risch, and Myers, 2002). Specific polymorphisms in these candidate regions are evaluated to identify association with disease. Thus, these studies utilize a priori knowledge (or putative regions identified using linkage analyses) to investigate specific candidate polymorphisms or markers, and constitute a viable hypothesis-driven strategy to clarify the genetic mechanisms underlying complex diseases. These studies, as a result, would realistically involve investigating fewer candidate polymorphisms or genetic loci in specific candidate regions of interest.
In this article, we focus on candidate gene association studies where the candidate markers are evaluated on affected cases and unaffected controls. When the cases and controls are unrelated, a chi-square test for trend (Armitage, 1955; Sasieni, 1997) can be used to compare the allele or genotype frequencies between the cases and controls at every marker locus. Under a one-stage design, every candidate marker is genotyped on all the cases and controls. This one-stage strategy, however, may not be an efficient use of the available resources when studying many candidate markers, since markers not exhibiting substantial evidence for association with the disease can be eliminated early on in the study. Recently, we have shown that a two-stage design is an optimal strategy to search for disease genes when the total number of marker evaluations, not the total number of subjects, is the primary study constraint (Satagopan et al., 2002). In this approach, at Stage 1 all the markers are genotyped by ascertaining some cases and controls. Thus, only a fraction of the total marker evaluations (equivalently, the total available resources) are utilized at Stage 1. At Stage 2, the promising markers are further evaluated using additional cases and controls. Using simulations we showed that this two-stage approach is more powerful than a one-stage approach. In particular, a rule-of-thumb strategy providing a near-optimal power would be to first screen all the markers using 75% of the marker evaluations, and screen the most promising 10% of the markers with the remaining resources. This rule-of-thumb two-stage design would enable genotyping 225% more individuals than a one-stage design, thus resulting in a substantial increase in power to detect the true markers of association. Here, power is defined as the probability that at least one of the true marker(s) of association will have test statistics larger than the test statistics of the null markers.
In practice, however, the total number of subjects will often be a limiting factor. A two-stage approach can also provide an efficient genotyping strategy in this setting where at Stage 1 all the markers are evaluated on a subset of cases and controls. The most promising markers are further evaluated at Stage 2 on the remaining individuals, and tested using all the individuals. Satagopan and Elston (2003) used a hypothesis-testing paradigm to evaluate this two-stage design when the markers are independent. At every marker, the null hypothesis of no association is tested against the alternative that the marker is associated with the disease. Under this paradigm, the statistical power has the conventional definition of identifying at least one true marker of association. In other words, the statistical power is the probability that the test statistic of at least one true marker exceeds a desired critical value under the alternative hypothesis. It was shown that the two-stage design can nearly halve the cost of the study when the markers are independent.
In this article, we show how to derive an optimal two-stage design for association studies when the total number of subjects (n) is fixed. We consider the scenario where one or more markers are associated with the disease, and evaluate the optimal two-stage design when the markers are independent. If the markers are densely spaced they may be correlated due to linkage disequilibrium. Thus, we also examine the setting in which the markers are correlated. Following our prior work in Satagopan et al. (2002), we define power as the probability that true markers associated with the disease have the largest test statistics. Denoting D (≥1) as the total number of disease loci, our goal is to identify a desired number of d (1 ≤ d ≤ D) disease loci. Section 2 describes the two-stage design and the cost functions of the one- and two-stage designs. In Section 3, we discuss the power functions of the two-stage design for independent as well as correlated markers. Comparisons between the power of the one- and two-stage designs under various parametric configurations are presented in Section 4. We show that, as before, near-optimal power and a considerable reduction in the number of marker evaluations can be achieved by performing a two-stage design irrespective of the extent of correlation between the markers. Finally, we show that when the markers are independent and when the total number of markers m is large, this approach is equivalent to the hypothesis-testing framework, i.e., evaluating the top i proportion of markers at Stage 2 is equivalent to testing all the m markers at significance level i at Stage 1.
2. Design
Consider a genome of interest marked with a total of m genetic markers, and a fixed number of n subjects available for the study. We assume that there are equal numbers of cases and controls. Consider the situation when D (≥1) of the m markers are the true markers of risk, and the remaining m – D markers are not associated with risk. Here, we assume that these D markers are in complete linkage disequilibrium with the D disease loci (i.e., the D markers themselves are the disease loci). The goal of the association study is to identify a minimum desired number d (1 ≤ d ≤ D) of these true markers. For simplicity, we assume a unit cost per genetic marker evaluation. At every marker locus one would perform a test statistic, for example a chi-square test based on a 2 × 2 table. The decision rule then would be to select the markers with the d largest test-statistic value as the putative risk-conferring loci. This is a one-stage design. This design would result in a total of T1 = n × m marker evaluations. In the absence of any constraints on the number of marker evaluations, the one-stage approach would be the most powerful strategy to identify the true disease-susceptibility loci. Here, we define power as the probability that d true markers have the largest d test statistics at the end of the study. However, this approach can be inefficient in resource utilization since it requires evaluations of a large number of genes that can be identified early on in the study as being extremely unlikely to be associated with the disease.
Consider, instead, the following two-stage approach. At Stage 1, we evaluate all m markers on a total of n1 individuals and calculate a test statistic corresponding to every marker. Let n1 = nj, where j denotes the proportion of the total available sample size (n) that will be used in Stage 1. Rank the markers based on the test-statistic value. Select the top ith (0 < i < 1) proportion of these markers, i.e., select the top mi markers. At Stage 2, we evaluate these mi markers on the remaining n2 = n – n1 individuals. Calculate the same test statistic for each of the mi markers based on the information from all the n individuals. Rank the mi markers based on their test-statistic values, and select the d markers with the largest statistics. This two-stage approach requires T2 marker evaluations given by
| (1) |
Clearly, T2 < T1 since 0 < i, j < 1. The reduction in the number of marker evaluations in performing a two-stage relative to a one-stage design is then given by
| (2) |
i.e., 100(1 – j)(1 – i)%. Hence, if the two-stage design were to provide power close to that of a one-stage design, then it would be more cost effective to perform a two-stage design.
3. Power
3.1. Power of a One-Stage Design
The power of a one-stage design, denoted as P*, can be written as follows. We define power as the probability that at least d out of the D true markers of association have test statistics that are larger than all of the null markers. Any typical test statistic for association at a marker locus derived from n individuals will have an asymptotic normal distribution N(nμ, nσ2), where μ and σ2 are the mean and variance of the distribution, respectively. The mean μ would be zero when there is no association between the disease and the marker. The variance σ2 may depend upon factors such as allele frequency, age of the population, age of the mutation, and recombination rate between the marker and the disease-susceptibility locus (or the recombination between the marker and its flanking markers). By appropriately scaling the test statistic it can be assumed, without loss of generality, that σ2 = 1. Thus, the test statistic at a marker locus derived from n individuals has an asymptotic N(nμ, n) distribution. We assume for simplicity that the strengths of the signal scaled in this fashion are identical for the D true markers. Henceforth, we will consider standardized test statistics. Therefore, the test statistic at a marker locus has an asymptotic N(μn1/2, 1) distribution.
Independent markers.
For simplicity, we first assume that the markers are independent and hence, their test statistics are independent. Let X1, X2, … , XD denote the test statistics of the D true markers of association. Then Xs ~ N(μn1/2, 1), s = 1, … , D. Let Y1, Y2,… , Ym–D denote the tests of the m – D null markers. As the null markers will not exhibit signal indicating association with the disease, the mean of these test statistics is μ = 0. Therefore, Yt ~ N(0, 1), t = 1, … , m – D.
Let Y(m–D) denote the maximum of the m – D null marker test statistics. Then, the random variable Gm–D = bm–D(Y (m–D) – am–D) has a Gumbel distribution with probability distribution given by P(Gm–D ≤ g) = exp{−e−g} (Johnson, Kotz, and Balakrishnan, 1995), where
The test statistics of the true markers can be written as Xs = Zs + μn1/2, where Zs ~ N(0, 1), s = 1, … , D. Let ϕ(·) denote the probability density function of a standard normal distribution. The power of a one-stage design, P*, is the probability that at least d of the D true markers have test statistics larger than Ym–D. Because the markers are assumed to be independent, P* can be written using a binomial probability as
| (3) |
where π is the probability that a single true marker has test statistic larger than Y(m–D), given by
For given values of m, D, d, n, and μ, we can evaluate π, and hence P*, using numerical integration.
Correlated markers.
In practice, the markers may be correlated as a result of linkage disequilibrium (LD). Several measures are used to quantify the extent of LD between a pair of markers (Devlin and Risch, 1995). All of these measures are functions of the difference between the joint probability of the alleles at the two loci (i.e., haplotype probability) and the product of their marginal probabilities. Numerous phenomena influence the correlation between markers. These include the recombination rate, the age of the mutation, mutation rate, selection, and population growth (Ardlie, Kruglyak, and Seielstad, 2002). These factors will influence both the mean and the variance of the test statistics at every locus under investigation. More specifically, in the presence of LD, a null marker may no longer have a test statistic with mean 0. Instead, the mean of the test statistic of the null marker will be influenced by the mean of a disease-susceptibility locus (in its neighborhood).
A simple framework treats the decay of LD in the neighborhood of a disease locus as a function of the genetic distance (or, equivalently, recombination under dense mapping) between the disease and marker loci (Devlin and Risch, 1995; Abecasis et al., 2001). Under this framework, the LD between the disease and marker loci decays exponentially as a function of recombination between the two loci (and the age of the mutation). Therefore, the expected value of the test statistic has a maximum at the disease-susceptibility locus, and decreases exponentially as the genetic distance between the disease and marker loci increases (Tang and Siegmund, 2001).
Here, we consider ρ to represent the correlation between the test statistics of two adjacent markers that decreases as the distance between the two markers increases. Therefore, when ρ is near 1, the distance (or recombination) between the two markers is small. Likewise, when ρ is near 0, the two markers are spaced widely apart. Let ρ0 represent the correlation induced by the other factors such as the allele frequencies of the markers, age of the disease mutation(s), and selection (or distant relationship between the cases and controls). We treat ρ0 to represent background correlation that influences the covariance of the marker test statistics. In other words, ρ0 is an overdispersion parameter that corresponds to variance inflation due to other factors (i.e., factors in addition to the genetic distance or recombination between the markers). The mean and covariance matrix of the test statistics are modeled as follows. We treat the markers to be equally spaced within a candidate genomic region (or regions) of interest, and assume, for simplicity, that ρ is the same for every adjacent pair of markers (recognizing that this will always be, at best, a crude approximation). Suppose there is a single true marker at location s having signal μ, and two null markers are in the adjacent locations s − 1 and s + 1, the signals of these null markers are proportional to μρ. The null markers at loci s − 2 and s + 2 are proportional to μρ2. In general, we can write the signal of a null marker at locus t as being proportional to μρ∣t–s∣, where ∣t – s∣ denotes the number of intervals between the true marker at locus s and the null marker at locus t. Suppose we number the m markers consecutively 1, … , m in a linear order, and markers a1, … , aD are the identifiers of the D true markers associated with the disease and the remaining m – D with identifiers u1, … , um–D are null markers (note that the identifiers a1, … , aD, u1, … , um–D are a permutation of 1, … , m), the expected signal of a null marker at locus number t is then proportional to , t = u1, … , um–D, t ≠ s. Following Tang and Siegmund (2001), the covariance matrix of the test statistics is given by , where 1 is a vector containing 1’s of length m and AR(ρ) denotes the m × m matrix
Note that this covariance matrix has unit variances. Obtaining an analytic formula for the power function is not straightforward when the markers are correlated. Let X(1), X(2), … , X(D) denote the ordered test statistics of the D true markers. Therefore, while X(u) denotes the uth smallest test statistic, the notation X(D–u+1) will be used to denote the (u – 1)th largest test statistic (u = 1, … , D). In other words, when u = 1, we have X(D–u+1) = X(D), the largest test statistic, and when u = 2 it is the second largest test statistic X(D–1), etc. The power of the one-stage design is then given by
| (4) |
and can be approximated using Monte Carlo simulations.
3.2. Power of a Two-Stage Design
The power of a two-stage design, denoted as P, is the probability of selecting a desired number of d disease-susceptibility loci at the end of the study, and can be written as follows. Let P1 denote the probability that the outcomes of at least d disease loci are among the top mi marker outcomes in Stage 1. Let P2 denote the probability that the outcomes of d of these disease loci are the largest among the mi markers at the end of Stage 2, given that each of these true markers is selected at the end of Stage 1 for further evaluation in Stage 2. Then, P = P1 × P2. These probabilities can be written as follows.
For s = 1, … , D, let X1s denote the test statistic of the true marker in Stage 1 derived from n1 = nj individuals, and X2s denote the test statistic of the true marker derived from the aggregate information provided by n individuals at the end of Stage 2. Then , X2s ~ N(μn1/2, 1). When the markers are independent, the test statistics of the m – D null markers have mean 0 in both Stages 1 and 2. When the markers are correlated, the test statistic of the null marker has mean (nk)1/2 μt in stage k = 1, 2, with nk = n1 in Stage 1 and nk = n in Stage 2, and the mean μt is as described earlier.
We write the power of the two-stage design using the ranked test statistics as follows. Let X(1,D) ≥ X(1,D–1) ≥ ⋯ ≥ X(1,1) denote the ordered test statistics of the D disease loci in Stage 1. Further, let Z(1, m) ≥ Z(1,m–1) ≥ ⋯ ≥ Z(1, 1) represent the ordered test statistics of all the m markers (true and null markers combined) in Stage 1. In Stage 1 we need at least d of the D disease loci to be among the top mi outcomes. Hence, P1 can be written as
| (5) |
We identify mi markers at the end of Stage 1 for further evaluation in Stage 2. Suppose u of these are true disease-susceptibility loci, where d ≤ u ≤ min(D, mi). In other words, denoting I(·) as the indicator function we select disease loci in Stage 1. Let X(2,u) ≥ X(2,u–1) ≥ ⋯ ≥ X(2, 1) denote the order statistics of these u true markers in Stage 2. Further, let Z(2, mi) ≥ Z(2, mi–1) ≥ ⋯ ≥ Z(2, 1) represent the order statistics of all the mi markers (true and null markers combined) at the end of Stage 2. We need d of the disease loci to have the largest test statistics in Stage 2. Therefore, P2 is given by
| (6) |
Both P1 and P2 are functions of the signal of the true markers, total sample size, the total number of markers, the proportion of samples considered in Stage 1 (j), and the proportion of markers evaluated in Stage 2 (i). These probabilities can be evaluated using Monte Carlo simulations. The goal is to identify i and j such that the difference between P and P* is small (i.e., within 1–10%). We do this by dividing the unit square (corresponding to 0 < i < 1 and 0 < j < 1) into several fine grids. For every point on the grid (which corresponds to a value of i and j) we calculate the power P of the two-stage design using Monte Carlo simulations. We find values of i and j for which P is within 1–10% of P*. Finally, we calculate the corresponding cost fraction T2 / T1 from equation (2) to find the smallest cost fraction for which the power of the two-stage design is close to the power of a one-stage design.
In this formulation of the problem, we have defined power as the probability that a desired number of true markers will have the maximum test statistics at the end of the study, as opposed to the conventional statistical definition where power is the probability that a true marker’s test statistic exceeds a critical value (at a desired overall significance level α). The sample size n1 at Stage 1 is a fraction j of the total sample size n, i.e., n1 = nj. Alternatively, one could consider the sample size n1 to be such that the statistical power to detect the true marker(s) of association at Stage 1 is 1 – β1 at significance level α1. The markers significant at level α1 will then be evaluated at Stage 2 using the remaining n2 = n – n1 individuals (Satagopan and Elston, 2003). When the markers are independent, the expected number of markers carried over to Stage 2 is given by m1 = (m – D)α1 + D(1 – β1). Here we have m1 = mi. Therefore,
and α1 → i as m → ∞. Hence, when the markers are independent selecting the top i proportion of markers at the end of Stage 1 is equivalent to testing all the m markers at significance level i.
4. Results
The power of one- and two-stage designs were evaluated for combinations of m = 3000, 1000, 500, 200, and 100 markers, n = 1000, 200, 100, and 50 individuals. The correlation matrix was evaluated for ρ = 0 (independent markers), 0.3, 0.6, 0.8, 0.9, and 0.95, and ρ0 = 0.1 when ρ > 0. The values of signal μ were calculated such that the power of a one-stage design is approximately 80%. We considered a total of D = 1, 5, and 10 true disease-susceptibility loci when m > 100. We evaluated the power to detect d = 1, 2, and 5 loci when D = 5, and the power to detect d = 1, 2, 5, and 10 loci when D = 10. When m = 100, we considered D = 1 and 2 disease loci, and evaluated the power to detect d = 1, 2 disease loci when D = 2.
4.1. Independent Markers
We first describe the results corresponding to D = 1 disease locus with independent markers. The simulations indicate that, when the markers are independent and when there is a single disease-susceptibility locus, a two-stage design provides near-optimal power for a range of values of design parameters i and j while providing a substantial relative decrease in cost. Table 1 gives the power of two-stage designs for m = 3000 markers and choice of design parameters i = 10%, 15%, 20%, and j = 25%, 30%, 50% with D = 1 disease locus. Column 1 gives the sample size n. Column 2 provides the signal μ. Column 3 gives the estimated power of a one-stage design. Column 4 demonstrates the power of a two-stage design corresponding to the values of i and j given in parentheses. Finally, column 5 shows the cost fraction, i.e., the decrease in the cost of performing a two-stage design relative to a one-stage design (equation [2]). It can be seen that a two-stage design provides near-optimal power for various values of i and j. In particular, the decrease in power is between 1% and 2% for i = 0.20 and j = 0.30, and at most 1% for i = 0.10 and j = 0.50. The substantial decrease in the relative number of marker evaluations in a two-stage approach is evident from column 5.
Table 1.
Power of one- and two-stage designs for m = 3000 markers, n = 100 and 1000 samples, and D = 1 disease-susceptibility locus. The signal μ of the disease locus corresponds to a one-stage design with approximately 80% power. The design parameters i and j corresponding to the power of the two-stage design are given in parentheses in column 4. Column 5 provides the decrease in cost of a two-stage relative to a one-stage design as given by equation (2). The cost fraction is, therefore, 100 – cost decrease.
| Power | Cost decrease (%) |
|||
|---|---|---|---|---|
| n | μ | One-stage | Two-stage (i, j) | |
| 1000 | 0.14 | 0.80 | 0.71 (0.10, 0.25) | 67.5 |
| 0.75 (0.15, 0.25) | 63.75 | |||
| 0.78 (0.20, 0.30) | 56.0 | |||
| 0.80 (0.10, 0.50) | 45.0 | |||
| 0.80 (0.15, 0.50) | 42.5 | |||
| 0.80 (0.20, 0.50) | 40.0 | |||
| 100 | 0.445 | 0.81 | 0.72 (0.10, 0.25) | 67.5 |
| 0.75 (0.15, 0.25) | 63.75 | |||
| 0.79 (0.20, 0.30) | 56.0 | |||
| 0.80 (0.10, 0.50) | 45.0 | |||
| 0.80 (0.15, 0.50) | 42.5 | |||
| 0.80 (0.20, 0.50) | 40.0 | |||
The simulations using independent markers for various combinations of m and n with D = 1 suggest that for i ∈ (0.10, 0.25) and j ∈ (0.30, 0.60), the loss of power by performing a two-stage design is between 1% and 10% (Table 1 shows part results for m = 3000). From equation (2) it can be seen that specifying i and j is equivalent to specifying i and the cost fraction T2/T1. (Note that a low cost fraction represents the selection of fewer markers for evaluation in the second phase.) Therefore, the loss of power is between 1% and 10% when i ∈ (0.10, 0.25) and the cost fraction T2/T1 ∈ (0.37, 0.70) (using equation [2]). Figure 1 gives the power of a two-stage design as a function of i for T2/T1 between 0.25 and 0.60 in steps of 0.05, m = 3000 (solid lines) and 100 markers (dashed lines), n = 1000, D = 1 disease locus, and signal μ chosen such that a one-stage design has approximately 80% power. Several patterns emanate from this figure. The power increases as the cost fraction increases. When T2/T1 > 0.50, sufficient power can be achieved for i in the range of 0.10–0.30. Therefore, when at least 55% of the total marker evaluations is feasible a judicious choice of i in the range of 0.10–0.30 will lead to a near-optimal power. When T2/T1 is smaller, the maximum power is still usually achieved in the range of i = 0.10–0.20. However, this maximum is substantially smaller than the power of the one-stage design, and more pronounced for studies with a small number of markers (m). The increase in power as T2/T1 increases is also illustrated in Figure 2 for various values of m when n = 1000. As above, this figure suggests that a near-optimal power is achieved when T2/T1 is at least 0.55 for large m. Based on these results, as a general guideline sufficient power can be obtained when i = 0.10 and j = 0.50. This translates to a rule-of-thumb two-stage design where all the markers are screened in 50% of the available individuals in Stage 1, and the best 10% of the markers are mapped in the remaining patients in Stage 2. The decrease in the number of marker evaluations in this rule-of-thumb design relative to a one-stage design is 45%.
Figure 1.

Power of a two-stage design for m independent markers as a function of the proportion of markers (i) carried over to Stage 2 for various values of T2/T1. There is a single disease locus (D = 1). The signal μ is chosen such that the one-stage design has 80% power to identify the disease locus. The sample size is n = 1000. Dashed lines correspond to m = 100, and solid lines correspond to m = 3000 markers. The pairs of curves (for m = 100 and 3000) given from bottom to top correspond to the following values of T2/T1: 0.25, 0.30, 0.35, 0.40, 0.45, 0.50, 0.55, and 0.60, respectively. The cost fraction corresponding to every pair of curves is indicated in the figure. For example, the top pair of curves gives the power of a two-stage design for m = 100 (dashed line) and 3000 (solid line) when T2/T1 = 0.60 for various values of i (shown in the horizontal axis).
Figure 2.

The maximum power of a two-stage design as a function of the proportion of marker evaluations (T2/T1) for m = 3000 (solid), 1000 (dot), 500 (dash), 200 (dash-dot), and 100 (long dash) independent markers, for sample size n = 1000. Power is shown for D = 1 single disease locus. The signal μ is such that the corresponding one-stage designs have 80% power to identify the disease locus.
We further evaluated the power of a two-stage design for D > 1 for independent markers. The above rule-of-thumb two-stage design provides near-optimal power for various choices of D and d when the markers are independent. The results for m = 3000, n = 1000 and m = 1000, n = 100 are given in Table 2 under the column with ρ = 0. It can be seen that the power of the rule-of-thumb two-stage design is close to the power of the one-stage design. For example, when m = 3000, n = 1000 and there are D = 5 disease loci with signal μ = 0.113, a one-stage design has 80% power to detect d = 2 of these disease loci. The rule-of-thumb two-stage design has 79% power in this case. (Additional results corresponding to other parametric configurations are available from our web page http://www.mskcc.org/biostatistics.) The simulations indicate that the rule-of-thumb two-stage design provides a cost-effective approach to identify markers associated with the disease in the presence of multiple disease loci with independent markers.
Table 2.
Power of one- and two-stage designs for (m, n) = (3000, 1000) and (1000, 100), and D = 1, 5, and 10 disease loci. Power to detect a desired number of d disease genes is given when ρ = 0 (independence), 0.3, 0.8, 0.95. The value of ρ0 = 0.1 when ρ > 0. The signal μ corresponds to a one-stage design with approximately 80% power. For every combination of (m, n, D, d), the power of the rule-of-thumb two-stage design is given in the top row, and the power of the one-stage design in the bottom row.
| m = 3000, n = 1000 | m = 1000, n = 100 | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| ρ | ρ | ||||||||||
| D | d | μ | 0 | 0.3 | 0.8 | 0.95 | μ | 0 | 0.3 | 0.8 | 0.95 |
| 1 | 1 | 0.140 | 0.80 | 0.79 | 0.70 | 0.39 | 0.420 | 0.81 | 0.81 | 0.69 | 0.36 |
| 0.80 | 0.80 | 0.70 | 0.39 | 0.82 | 0.82 | 0.69 | 0.37 | ||||
| 5 | 1 | 0.094 | 0.78 | 0.78 | 0.62 | 0.31 | 0.266 | 0.78 | 0.77 | 0.57 | 0.26 |
| 0.79 | 0.78 | 0.62 | 0.31 | 0.78 | 0.77 | 0.57 | 0.27 | ||||
| 2 | 0.113 | 0.79 | 0.78 | 0.33 | 0.04 | 0.326 | 0.78 | 0.76 | 0.28 | 0.03 | |
| 0.80 | 0.78 | 0.33 | 0.04 | 0.79 | 0.77 | 0.28 | 0.03 | ||||
| 5 | 0.170 | 0.81 | 0.77 | 0.02 | 0.00 | 0.509 | 0.80 | 0.74 | 0.01 | 0.00 | |
| 0.82 | 0.78 | 0.02 | 0.00 | 0.82 | 0.77 | 0.01 | 0.00 | ||||
| 10 | 1 | 0.082 | 0.81 | 0.80 | 0.58 | 0.25 | 0.225 | 0.79 | 0.77 | 0.52 | 0.21 |
| 0.81 | 0.81 | 0.59 | 0.26 | 0.79 | 0.78 | 0.52 | 0.21 | ||||
| 2 | 0.095 | 0.80 | 0.78 | 0.30 | 0.03 | 0.269 | 0.78 | 0.75 | 0.24 | 0.02 | |
| 0.80 | 0.78 | 0.30 | 0.03 | 0.79 | 0.76 | 0.24 | 0.02 | ||||
| 5 | 0.120 | 0.77 | 0.73 | 0.03 | 0.00 | 0.348 | 0.75 | 0.68 | 0.02 | 0.00 | |
| 0.78 | 0.74 | 0.03 | 0.00 | 0.77 | 0.70 | 0.02 | 0.00 | ||||
| 10 | 0.180 | 0.81 | 0.71 | 0.00 | 0.00 | 0.541 | 0.80 | 0.64 | 0.00 | 0.00 | |
| 0.82 | 0.72 | 0.00 | 0.00 | 0.83 | 0.66 | 0.00 | 0.00 | ||||
4.2. Correlated Markers
The power of the two-stage design was evaluated for correlated markers under various parametric configurations as above. Table 2 provides the results of a one-stage design and the rule-of-thumb two-stage design (additional results are available from our web page given above). First, as one would expect, the power of the one-stage design decreases as the correlation between the markers increases. In general, the power of the one-stage design to detect multiple true markers of association is very small when the correlation is high (ρ = 0.60 or above). However, the power of the above rule-of-thumb two-stage design is very close to the power of the one-stage design even when the markers are correlated. For example, when m = 3000, n = 1000 and there are D = 10 disease loci with signal μ = 0.082, a one-stage design provides only 59% power to detect d = 1 of these disease loci when ρ = 0.8. The rule-of-thumb two-stage design provides 58% power in this setting, which, although small, is close to the power of a one-stage design. In general, the loss of power under the rule-of-thumb design is minimal (1–2%) when testing a large number of markers, and increases to approximately 5% when the sample size is small and when fewer markers are tested. The simulations suggest that the rule-of-thumb two-stage design provides power close to that of a one-stage design across various levels of correlations and various values of m, even when the power of a one-stage design is small.
5. Concluding Remarks
Here, we have derived a two-stage design for association studies when evaluating m candidate markers with a given sample size. Every marker can be a single nucleotide polymorphism (SNP), or we could consider a “marker” to correspond to haplotype information in a candidate gene. In the former case when every marker is an SNP, multiple SNPs evaluated in a candidate region would very likely be correlated due to linkage disequilibrium between the loci. Suppose, when evaluating haplotypes from candidate regions we consider a marker to be one haplotype for every candidate gene. We may then treat the various haplotypes to be independent. Note that the haplotypes can also be considered to be correlated. However, correlations between haplotypes are likely to be smaller than between SNPs in general. Our simulations indicate that a rule-of-thumb design is applicable for both independent and correlated markers.
Here, we have defined power as the probability of detecting true disease-susceptibility loci. Note that the power decreases as the correlation between adjacent loci increases. This is due to the fact that when the candidate loci are correlated, the signal of the genetic locus adjacent to the disease locus can be high, leading to a false positive detection and loss of power. However, were we to define power as the probability of selecting a locus within a certain centiMorgan distance or linkage disequilibrium with reference to the disease locus, the decrease in power as the correlation increases would be small. Further, the cost function we have considered here is based on a unit cost of genotyping a marker on an individual. Alternatively, one can consider an initial capital cost that would involve designing the primers and probes for all the markers, and then the cost of genotyping the markers on all the individuals. We believe that the results of our two-stage approach would continue to hold when the capital cost is not a large proportion of the total cost of the study.
In this article, we have focused on candidate gene association studies. Whole-genome association studies, on the other hand, evaluate much larger numbers of markers throughout the genome of cases and controls. The simulations presented in this article indicate that a rule-of thumb two-stage design provides near-optimal power, irrespective of the total number of markers evaluated in the study. Therefore, this two-stage approach can also be used in whole-genome association studies. However, when evaluating much larger numbers of markers (say, 300,000 markers) along the genome, adjacent pairs of markers are likely to be highly correlated. In this scenario, testing for marker–disease association using a single marker at a time will not be a powerful approach to identify the disease-susceptibility loci. Instead, a multipoint method that evaluates association between haplotypes and disease will be preferable. Recent studies indicate that linkage disequilibrium along the genome occurs in blocks. The pattern of such haplotype blocks, however, depends upon the population under study. The regions at which the haplotypes need to be evaluated in an association study is not known at the outset. An alternative two-stage approach can be considered where, at Stage 1, every marker can be evaluated individually using a subset of cases and controls to identify the promising regions. At Stage 2, haplotypes at these promising regions can be evaluated using additional individuals to identify the disease-susceptibility genes.
DNA pooling is becoming a cost-effective genotyping approach for association studies (Shaw et al., 1998). Unlike the individual genotyping approach, DNA pooling provides information on allele frequencies using multiple individuals in the pool, instead of providing the genotype of every individual. Further, DNA pooling undermines the assumption that cost is approximately proportional to the number of individual marker evaluations. Recently, Wang, Kidd, and Zhao (2003) showed that DNA pooling using two individuals per pool is a cost-effective approach to estimate haplotype frequencies by considering the cost function as an aggregate of the cost of ascertaining n individuals and the cost of genotyping up to two SNPs in p pools with np individuals per pool (i.e., n = p × np). A two-stage approach can be considered in this scenario where at Stage 1 all the markers can be evaluated using a subset of cases and controls using the DNA pooling approach suggested by Wang et al. (2003) to identify the promising regions by testing association between haplotypes and disease. Markers in the promising regions can be evaluated using individual genotyping in Stage 2 on all the cases and controls, and association between haplotypes in these regions and the disease can be tested to identify the disease-susceptibility genes. Further research is required to explore the optimality of two-stage designs of this nature.
In summary, the simulations show that for a given sample size, a two-stage design can provide near-optimal power to detect the true marker conferring disease risk while substantially reducing the total number of marker evaluations. In particular, the rule-of-thumb two-stage approach in which half of the patients are tested on all markers in the first stage, and the top 10% of markers are carried to the second stage, provides a practical cost-effective strategy to search for disease-susceptibility genes in association studies based on case–control designs.
Acknowledgements
This research was supported in part by National Institutes of Health grants GM60457, CA73848, and CA098438.
References
- Abecasis GR, Noguchi E, Heinzmann A, et al. (2001). Extent and distribution of linkage disequilibrium in three genomic regions. American Journal of Human Genetics 68, 191–197. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ardlie KG, Kruglyak L, and Seielstad M (2002). Patterns of linkage disequilibrium in the human genome. Nature Reviews Genetics 3, 299–309. [DOI] [PubMed] [Google Scholar]
- Armitage P (1955). Tests for linear trends in proportions and frequencies. Biometrics 11, 375–386. [Google Scholar]
- Boehnke M (1994). Limits of resolution of genetic linkage studies: Implications for the positional cloning of human genetic diseases. American Journal of Human Genetics 55, 379–390. [PMC free article] [PubMed] [Google Scholar]
- Devlin B and Risch N (1995). A comparison of linkage disequilibrium measures for fine-scale mapping. Genomics 29, 311–322. [DOI] [PubMed] [Google Scholar]
- Johnson NL, Kotz S, and Balakrishnan N (1995). Continuous univariate distributions, Volume 2. New York: John Wiley and Sons. [Google Scholar]
- Jorde LB (2000). Linkage disequilibrium and the search for complex disease genes. Genome Research 10, 1435–1444. [DOI] [PubMed] [Google Scholar]
- Nordborg M and Tavare S (2002). Linkage disequilibrium: What history has to tell us. Trends in Genetics 18, 83–90. [DOI] [PubMed] [Google Scholar]
- Ott J (1991). Analysis of Human Genetic Linkage. Baltimore, Maryland: The Johns Hopkins University Press. [Google Scholar]
- Risch N (2000). Searching for genetic determinants in the new millennium. Nature 405, 847–856. [DOI] [PubMed] [Google Scholar]
- Sasieni P (1997). From genotypes to genes: Doubling the sample size. Biometrics 53, 1253–1261. [PubMed] [Google Scholar]
- Satagopan JM and Elston RC (2003). Optimal two-stage genotyping in population-based association studies. Genetic Epidemiology 25, 149–157. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Satagopan JM, Verbel DA, Venkatraman ES, Offit KE, and Begg CB (2002). Two-stage designs for gene-disease association studies. Biometrics 58, 163–170. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Shaw SH, Carrasquillo MM, Kashuk C, et al. (1998). Allele frequency distributions in pooled DNA samples: Applications to mapping complex disease genes. Genome Research 8, 111–123. [DOI] [PubMed] [Google Scholar]
- Tabor HK, Risch NJ, and Myers RM (2002). Candidate-gene approaches for studying complex genetic traits: Practical considerations. Nature Reviews Genetics 3, 1–7. [DOI] [PubMed] [Google Scholar]
- Tang H and Siegmund DO (2001). Mapping quantitative trait loci in oligogenic models. Biostatistics 2, 147–162. [DOI] [PubMed] [Google Scholar]
- Wang S, Kidd KK, and Zhao H (2003). On the use of DNA pooling to estimate haplotype frequencies. Genetic Epidemiology 24, 74–82. [DOI] [PubMed] [Google Scholar]
- Weir BS (1996). Genetic Data Analysis II, 2nd edition. Sunderland, Massachusetts: Sinauer Associates. [Google Scholar]
