Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2022 Apr 4.
Published in final edited form as: Biometrics. 2002 Mar;58(1):163–170. doi: 10.1111/j.0006-341x.2002.00163.x

Two-Stage Designs for Gene–Disease Association Studies

Jaya M Satagopan 1,*, David A Verbel 1, E S Venkatraman 1, Kenneth E Offit 2, Colin B Begg 1
PMCID: PMC8978151  NIHMSID: NIHMS1790763  PMID: 11890312

Summary.

The goal of this article is to describe a two-stage design that maximizes the power to detect gene–disease associations when the principal design constraint is the total cost, represented by the total number of gene evaluations rather than the total number of individuals. In the first stage, all genes of interest are evaluated on a subset of individuals. The most promising genes are then evaluated on additional subjects in the second stage. This will eliminate wastage of resources on genes unlikely to be associated with disease based on the results of the first stage. We consider the case where the genes are correlated and the case where the genes are independent. Using simulation results, it is shown that, as a general guideline when the genes are independent or when the correlation is small, utilizing 75% of the resources in stage 1 to screen all the markers and evaluating the most promising 10% of the markers with the remaining resources provides near-optimal power for a broad range of parametric configurations. This translates to screening all the markers on approximately one quarter of the required sample size in stage 1.

Keywords: Cost constraint, Gaussian approximation, Optimal design, Power

1. Introduction

Studies of gene–disease association are used commonly to investigate candidate regions by saturating areas of the genome with a large number of marker loci (such as single nucleotide polymorphisms, or SNPs) in order to identify genes conferring risk for a disease of interest. This approach has been proposed to identify candidate cancer susceptibility alleles utilizing case–control association studies. Since large numbers of markers will need to be genotyped for each individual, a major question of importance relates to resource utilization. It is therefore desirable to optimize the design of such studies to maximize the power to identify loci of true association for a given cost of the study.

Once information about marker alleles (or genotypes) is obtained, a test statistic can be calculated for every marker to determine the significance of the association between that marker and disease outcome (e.g., a Fisher’s exact or chi-square test for a dichotomous outcome such as presence or absence of disease). A corresponding p-value, after correcting for multiple comparisons, can then be used to determine the significance of association between the marker and disease (Schaid, 1996; Martin, Kaplan. and Weird, 1997; Schaid and Rowland, 1998; Teng and Risch, 1999). While considerable research has focused on developing test statistics for association studies using case–control samples (and nuclear family samples), designing such studies has received very little notice. Attention to the study design can have important ramifications, especially in the setting described here, where one is interested in searching for association in the presence of a large number of genetic markers.

In this article, we discuss an optimal design to identify markers conferring increased risk in case–control studies. We propose that, if the total cost. determined by the total number of gene evaluations performed, is the primary limitation on resources, as opposed to the total number of individuals, considerable statistical efficiency will be gained by performing a two-stage design, where, in the first stage, all markers are evaluated on each individual, and in the second stage, only the most promising markers from the first stage are evaluated on additional individuals.

In the following sections, we describe a specific application of this design in breast cancer and formulate the problem mathematically. We also characterize quantitatively the trade-offs in efficiency of alternative designs and derive a power function. We use the term power in this context to represent the probability that the true marker is identified by the study, in contrast with its conventional usage in the context of statistical hypothesis testing. Finally, we discuss simulation results and provide general guidelines for designing such case–control association studies of candidate regions.

2. The Breast Cancer Polymorphism Study

It is known that mutation in two genes, BRCA1 and BRCA2, account for a majority of, but not all, kindreds with inherited breast and ovarian cancer (Ford et al., 1998). Estimates of the lifetime risk for breast cancer due to, e.g., a BRCA2 mutation vary from 26% (Satagopan et al., 2001) to 84% (Ford et al., 1998). It is hypothesized that polymorphisms in other genes may modify penetrance and account for some of the variation that has been observed. In addition, these polymorphisms may constitute low penetrance alleles associated with breast cancer risk. To address these questions, investigators have planned a study to compare the frequencies of polymorphisms among breast cancer patients with and without an inherited BRCA mutation. The study subjects will be women affected with breast cancer. The association between a BRCA mutation and an SNP will be determined by recording the presence of the mutation and the polymorphism in these study subjects. This test will be carried out for many SNPs. Approximately 1500 anonymous SNPs and several known candidate polymorphisms will be analyzed and their frequencies compared among the two groups of individuals (patients with and without a BRCA mutation). The polymorphic variants to be studied are determined by the clinical investigators based on prior work about the penetrance of breast cancer due to inherited BRCA mutations and the role and function of these mutations as components of the DNA damage response pathway.

The major costs related to association studies involve ascertainment of the study subjects and genotyping. Since individual tissues required for this study are available from a related project of the clinical investigators. the primary cost is associated with genotyping the large number of SNPs and other candidate polymorphisms in these tissues. Several biotechnology companies offer genotyping services for such large-scale projects. These companies typically have a flat rate charge for the first few (1–3) genotypes on some (90–180) individuals. Any additional genotyping will be charged on a per genotype per individual basis.

A total of 2000 breast cancer cases (with and without a BRCA mutation) will be available for this study, requiring genotyping of several hundred markers on all these individuals. Genotyping all the polymorphisms on all of these tissues may not be a cost effective way of identifying the true genes of association. Hence, the fundamental task at the experimental design stage of this study is to identify an appropriate method to screen the polymorphisms in these individuals under cost constraints such that the power (as defined in the previous section) to identify the true markers (or polymorphisms) of association is maximized.

3. Design

In the following, we introduce the concept of a two-stage design and define optimality conditions assuming that the total number of genetic evaluations or resources is the fixed constraint rather than the number of individuals. We assume unit cost per gene evaluation and denote the total number of genetic evaluations or total cost by T. Consider any genome of interest marked with a total of m genetic loci. For simplicity, consider the situation when only one of these m genes is the true gene conferring risk and none of the other m − 1 genes are associated with risk. The goal is to identify this true gene. In the absence of a cost constraint. the optimal strategy would be to evaluate all m markers on all N available individuals for a total cost of mN. At each genetic locus, one would test the association between an allele at that locus and disease status based on a 2 × 2 table. This could be a chi-square test for association between the allele of interest at that genetic locus and disease. The decision rule would be to select the gene corresponding to the largest test statistic.

If the primary constraint is in the total cost T and if T is smaller than mN, screening all markers on all the subjects will not be feasible. In this case, the one-stage design would involve evaluating all the m markers on T/m individuals. However, this can be inefficient in resource utilization since it may require large numbers of evaluations of genes that can be identified early in the study as extremely unlikely to be the true disease gene.

Consider, instead, optimization of the following two-stage design. In stage 1, screen all m genes on a set of n1 individuals using the test statistic where the numbers of cases and controls in this subset of n1 are chosen in proportion to their relative frequency in the full set of available subjects. Rank the genes based on the absolute value of the test statistic. Select the top ith proportion of these genes, i.e., select the top mi genes. In stage 2, evaluate these mi genes on subsequent individuals until the total number of available genetic evaluations, T, is used up, again selecting cases and controls in constant relative proportions. Rank the genes based on the same test for risk and select the gene with the highest test statistic. In the following sections, we show how to optimize this design by determining the values of n1 and i that lead to the maximum probability (power) of selecting the true gene at the conclusion of the study.

4. Power of a Two-Stage Design

Define power, P, as the probability that the true gene is selected at the end of the study and let n2 be the number of patients used in stage 2. Then

T=n1m+n2mi. (1)

The goal is to maximize P with respect to i and n1 for fixed values of T and m. Note that, since T/m is fixed. choosing i and n1 automatically determines n2. Therefore, optimizing P with respect to i and n1 is equivalent to determining the proportion, denoted j, of the total resource to be used in stage 1, where j = n1m/T, and the proportion, i, of genes to be carried forward to the second stage for further testing. The proportional increase in the total number of patients required for the two-stage design, relative to a single-stage design, is given by j + (1 – j)/i.

The power P can be written as follows. Let the probability that the true gene is among the top ith proportion in stage 1 be given by P1. Let P2 be the probability that the true gene has a higher observed association with disease in the study than every null gene, given that the true gene and these null genes are included in stage 2. Then the power is given by P = P1 × P2. This quantity can be calculated using a Gaussian approximation. We first optimize the design under the assumption that the mutational profiles of all the genes are mutually uncorrelated, and then we modify the model to accommodate the influence of linkage disequilibrium on the correlations of adjacent genes.

5. Power Function for Independent Gene Outcomes

In this section, we present the power function under the assumption that the gene outcomes are independent within a subject. By outcome we refer to a subject’s contribution to the test statistic generated by that gene. Any typical test statistic for association has an asymptotic normal distribution; i.e., if the test statistic is computed from n independent subjects, then its distribution is given by N(, 2), where μ and σ2 are the asymptotic mean and variance, respectively. Thus, by appropriate scaling of the statistic, we can, without loss of generality, assume that the test statistics have unit variance, i.e., σ = 1. The asymptotic mean μ of the statistic is zero if there is no association. From these arguments, we can assume that the gene outcomes for each subject have a normal distribution with mean μ and unit variance. Note that the accuracy of normal approximation is dependent on factors such as overall sample size and case prevalence rate. In such situations, measures such as Yates continuity correction for chi-square statistic will provide a better approximation.

Let X1 denote the test statistic computed from the n1 subjects obtained in stage 1 and let X2 denote the test statistic from the combined n(= n1 + n2) subjects from stages 1 and 2 for the true gene. Similarly, let Yl and Y2 denote the corresponding test statistics for any null gene. Then X1 ~ N(n1μ, n1), X2 ~ N(, n), Y1 ~ N(0, n1), and Y2 ~ N(0, n). Let ϕ(·) and Φ(·), respectively, denote the probability density and distribution functions of a standard normal distribution. Denote f1(x; μn1, n1) = (n1)−1/2 ϕ[(xμn1)/(n1)1/2] as the probability density function of X1 with corresponding distribution function F1(x; μn1, n1) = Φ[(xμn1)/(n1)1/2]. Similarly. let f2(y; n1) = (n1)−1/2 ϕ[(y/(n1)1/2] represent the probability density function of Y1 with corresponding distribution function F2(y; n1) = Φ(y/(n1)1/2). The total number of individuals in stage 1, n1, is given by

n1=jT/m. (2)

Using equations (1) and (2), the total number of individuals used in the study is given by

n=n1+n2=[j+1ji]T/m. (3)

Since the subjects are assumed to be independent and thus contribute independent gene outcomes, the summary outcome for the true gene combined over stages 1 and 2, X2, can be written as the outcome in stage l, X1, plus an independent normal component. Hence, covariance between X1 and X2 is the variance of X1, and similarly, the covariance between Y1 and Y2 is the variance of Y1. Therefore, (X1, X2) has a bivariate normal distribution with mean (μn1, μn) and covariance matrix ∑. where

Σ=(n1n1n1n). (4)

Similarly. (Yl, Y2) has a bivariate normal distribution with mean (0, 0) and covariance matrix ∑.

In stage 1, Pl is the probability that X1 is among the top mi gene outcomes. Denote Y(m − mi) as the (m − mi)th ordered null gene outcome having probability density function given by

g(y)=(m1)!(mi2)!(mmi)![F2(y;n1)]mi2×[1F2(y;n1)]mmif2(y;n1). (5)

The probability P1 can be written as

P1=P(X1>Y(mmi))=g(y)[1F1(y;μn1,n1)]dy. (6)

In stage 2, P2 is the probability that X2 is greater than each of the mi − 1 null gene outcomes, conditional upon the results from stage 1. It can be easily shown that the conditional distribution of X2 | X1 = x1 is normal with mean x1 + μn2 and variance n2. The conditional distribution of Y2 | Y1 = y1 is normal with mean y1 and variance n2. Hence, denoting Y21, …, Y2, mi−1, as the mi − 1 null gene outcomes in stage 2 and their corresponding outcomes in stage 1 as Y11, …, Y1, mi−1, P2 can be written as

P2=P(X2>Y21,,X2>Y2,mi1Y11>Y(mmi),,Y1,mi1>Y(mmi),X1>Y(mmi))=[P(Y2<X2Y1>Y(mmi))]mi1×dP(X2X1>Y(mmi)). (7)

The first expression in the integrand of the above equation can be written as

P(Y2<X2Y1>Y(mmi))=P(Y2<X2,Y1>Y(mmi))P(Y1>Y(mmi))=P(Y2<X2,Y1>y)g(y)dyP(Y1>y)g(y)dy=P(Y2<X2Y1>y)P(Y1>y)g(y)dyg(y)[1F2(y;n1)]dy=g(y)yf2(y1;n1)P(Y2<X2y1)dy1dyg(y)[1F2(y;n1)]dy=g(y)yf2(y1;n1)F1(X2;y1;n2)dy1dyg(y)[1F2(y;n1)]dy

since Y2 and X2 are independent and Y2 | Yl has a normal distribution with mean y1 and variance n2 Further, the second expression in the integrand of equation (7) can be written as

P(X2X1>Y(mmi))=1P1g(y)×{yF1(X2;x1+μn2,n2)×f1(x1;μn1,n1)dx1}dy.

The probabilities P1 and P2 (equations (6) and (7)) can be evaluated using Monte Carlo simulation for given values of i, j, and μ.

6. Power Function for Correlated Gene Outcomes

In practice, the assumption of independent gene outcomes within subjects may not be even approximately true when testing multiple markers. Gene outcomes can be correlated due to various phenomena such as genetic linkage and loss of heterozygosity (evolutionary causes) and allele frequency and marker density (recombination). Correlation (denoted ρ) due to recombination can be easily quantified (Feller, 1966). Here we focus only on the aggregate correlation rather than correlation due to specific causes.

Under the assumption of independence, the true gene outcomes have a mean of μ. while the null gene outcomes have a mean of zero. However, when we cannot assume independence, the null genes in the neighborhood of the true genetic locus need not have a mean of zero since the mean outcome will be influenced by the correlation between the null and the true genes. Therefore, the mean outcome of the null genes will reduce to zero as a function of correlation as one moves away from the neighborhood of the true gene.

As defined in the previous section, let (Xl, X2) denote the true gene outcomes in stages 1 and 2, respectively, normally distributed with mean (μn1, μn) and covariance matrix ∑ (given by equation (4)). Further, let Y1,u, u = 1, …, m −1, denote the linear ordering on the genome of the null gene outcomes under stage 1. Similarly, let Y2,u, u = 1, …, m −1, denote the outcomes of the selected null genes in their linear order to be evaluated in stage 2. The true gene (having outcomes Xl and X2 in stages 1 and 2, respectively) can be located anywhere along the genome. When addressing the design question, we consider the simple case where we assume that the correlations between adjacent pairs of loci are equal. As stated earlier, the true gene has mean μn1 and n1 in stage 1 and mean μn and variance n after stage 2. Therefore. the uth null gene away from the true gene has mean μn1ρu and variance n1 in stage 1 and mean μnρu and variance n after stage 2. The mean of each of the null genes approaches zero as the correlation between the true and the null genes decreases.

The power P = P1 × P2 for this setting can be described as follows. In stage l, P1 is the probability that X1 is among the top mi gene outcomes. Let g* (·) denote the density of Y(mmi), which denotes the (mmi)th ordered null gene outcome in stage 1. The density g* (·) depends on the mean μ, sample size n1, and the correlation ρ.

Therefore, the probability P1 can be written as

P1=g*(y)[1F1(y;μn1,n1)]dy. (8)

P2 is the probability that X2 is greater than each of the mmi null gene outcomes in stage 2, conditional upon the results of stage l. Hence, P2 can be written as

P2=P(X2>max1ummi{Y2,u}X1>Y(mmi),min1ummi{Y1,u}>Y(mmi)). (9)

As in the previous case, the probabilities P1 and P2 can be evaluated using a Monte Carlo simulation for varying values of i, j, and μ.

7. Results

7.1. Optimal Two-Stage Design for a Single True Gene

The power function discussed in the previous section can be used to provide guidelines for optimizing the study design. The power function can be maximized with respect to i, the proportion of genes selected for validation, and j, the proportion of resources allocated for stage 1, for given values of T, m, and μ. Further clarification of resource allocation is possible by expressing it in the context of the total sample size and the proportion of individuals allocated to stage 1. The number of individuals in a one-stage design is given by T/m, and that of a two-stage design is (j + (1 – j)/i]T/m. The ratio of the number of individuals required for a two-stage design to the number required for the one-stage design, for fixed T and m, is thus given by j + (1 − j)/i. Note that the proportion of individuals in a two-stage design allocated to stage 1 is given by ij/(ij + 1 − j).

In our simulations, power is calculated for ρ = 0, 0.10, 0.20, 0.40, 0.60, 0.80, 0.90, and 0.98, where ρ is the correlation between adjacent genes and ρ = 0 corresponds to the case of independent gene outcomes. For the purpose of our simulations, the signal μ is calculated for cases where a one-stage design testing independent markers will have 30, 40, 50, or 60% power. Table 1 summarizes the results of these simulations for m = 3000 and T/m = 5000. Row (a) gives the maximum power of the two-stage design. The numbers in parentheses in row (b) give the design parameters i and j at which the maximum power is obtained. Figures 1 and 2 provide a graphical representation of m = 1000, T/m = 500, μ = 0.120, and m = 100, T/m = 100, μ = 0.275, respectively. The bold line in the figures give the maximum power of the two-stage design. The design parameters i and j at which the maximum power is obtained are shown below the horizontal axis. As correlation between the genes decreases, the power of the optimal two-stage design tends toward that of the independent gene outcomes for all combinations of T, m, and ρ. Further, for fixed correlation, power increases as the signal (μ) increases.

Table 1.

Power of one- and two-stage destgns for μ, = 3000. T/m, 5000, and values of μ = 0.130, 0.145, and 0.155 for increasing values of correlation between adjacent markers. Row (a) gives the maximum power of the optimal two-stage design. Row (b) gives the optimal parameters (i, J). Row (c) gives the power corresponding to a rule-of-thurnb two-stagjc design (when, i = 0.10 and j = 0.75). Row (d) gives the power of a one-stage design. Row (e) gives the power (and percentage increase in cost) when using a one-stage design where the total number of individuals is fixed.

Correlation (ρ)
μ 0.00 0.10 0.20 0.40 0.60 0.80 0.90 0.98
0.042 (a) 0.88 0.90 0.89 0.88 0.84 0.68 0.50 0.19
(b) (0.09, 0.71) (0.09, 0.74) (0.14, 0.64) (0.09, 0.75) (0.09, 0.72) (0.03, 0.75) (0.02, 0.76) (0.01, 0.70)
(c) 0.88 0.89 0.88 0.88 0.83 0.58 0.36 0.10
(d) 0.30 0.28 0.26 0.27 0.24 0.14 0.04 0.00
(e) 0.98 0.97 0.95 0.94 0.94 0.96 0.90 0.65
(655%) (605%,) (535%) (588%) (639%) (1514%) (2127%) (5117%)
0.045 (a) 0.92 0.94 0.93 0.93 0.88 0.73 0.56 0.19
(b) (0.09, 0.74) (0.09, 0.81) (0.14, 0.66) (0.09, 0.76) (0.10, 0.68) (0.03, 0.75) (0.01, 0.78) (0.01, 0.75)
(c) 0.91 0.93 0.92 0.91 0.87 0.65 0.41 0.11
(d) 0.40 0.38 0.36 0.35 0.29 0.14 0.05 0.00
(e) 0.99 0.96 0.97 0.97 0.95 0.96 0.98 0.64
(605%) (487%) (515%) (571%) (647%) (1514%) (3797%) (4292%)
0.050 (a) 0.96 0.97 0.97 0.97 0.93 0.81 0.66 0.25
(b) (0.10, 0.75) (0.15, 0.66) (0.12, 0.76) (0.09, 0.81) (0.09, 0.72) (0.03, 0.75) (0.01, 0.78) (0.01, 0.76)
(c) 0.96 0.96 0.96 0.96 0.93 0.71 0.44 0.12
(d) 0.50 0.47 0.49 0.46 0.41 0.24 0.08 0.00
(e) 0.99 0.99 0.98 0.98 0.98 0.99 0.99 0.68
(542%) (488%) (460%) (487%) (639%) (1514%) (3797%) (4127%)
0.055 (a) 0.98 0.99 0.99 0.99 0.97 0.88 0.75 0.29
(b) (0.11, 0.76) (0.10, 0.82) (0.15, 0.74) (0.12, 0.74) (0.09, 0.74) (0.02, 0.82) (0.01, 0.72) (0.01, 0.62)
(c) 0.98 0.98 0.98 0.97 0.96 0.76 0.47 0.14
(d) 0.60 0.63 0.63 0.59 0.56 0.33 0.12 0.00
(e) 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.90
(490%) (437%) (412%) (484%) (605%) (1637%) (4787%) (6437%)

Figure 1.

Figure 1.

Power of one- and two-stage designs for m = 1000, T/m = 500, and values of μ = 0.120 for increasing values of correlation between adjacent markers. The bold line shows the maximum power of the optimal two-stage design. Optimal parameters (i and j) are shown below the horizontal axis. The dotted line shows the rule-of-thumb two-stage design (when i = 0.10 and j = 0.75). The dashed line gives the power of a one-stage design. The value of μ = 0.120 corresponds to a one-stage design with 30% power and independent markers.

Figure 2.

Figure 2.

Power of one- and two-stage designs for m = 100, T/m = 100, and values of μ = 0.275 for increasing values of correlation between adjacent markers. The bold line shows the maximum power of the optimal two-stage design. Optimal parameters (i and j) are shown below the horizontal axis. The dotted line shows the rule-of-thumb two-stage design (when i = 0.10 and j = 0.75). The dashed line gives the power of a one-stage design. The value of μ = 0.275 corresponds to a one-stage design with 60% power and independent markers.

The results show that over a broad range of values of T, m, and μ in the case of independent gene outcomes (ρ = 0.0), the optimal design parameters are in the range of i ∈ (9%, 15%) and j ∈ (63%, 76%). The power of this optimal design is very close to a design where i = 10% and j = 75%. Therefore, as a general rule, when the genes are independent, sufficient power can be obtained by allocating approximately 75% (j) of the resources for screening in stage 1 and by validating the top 10% of the genes (i) in stage 2. In the case of correlated gene outcomes, the optimal design parameters are in the range of i ∈ (1%, 22%) and j ∈ (52%, 82%). Applying the above rule-of-thumb design to the correlated gene outcome case, we find that this design provides a sufficient approximation to the power of the optimal design for various configurations of T, m, μ, and ρ. This approximation is good for all correlations except very large ones (ρ > 0.60). In terms of total sample size and number of individuals required for stage 1, our rule-of-thumb of i = 0.1 and j = 0.75 corresponds to a design that requires 3.25T/m individuals, of whom 23% are screened for all genes in stage l. The remaining 77% of the individuals are used to validate the promising 10% of genes in stage 2.

Power of this rule-of-thumb design is displayed for each configuration in row (c) of Table 1. By comparing rows (a) and (c). it is clear that the rule-of-thumb design produces power that is close to optimal for most configurations of T, m, and μ examined. For ρ ≤ 0.60, the difference between the optimal power and the rule-of-thumb power is in the range of 0–1%. However, for ρ > 0.60, this difference is in the range of 3–28%. Row (d) of Table 1 shows the power of the one-stage design using T/m individuals. This should be compared with the power of the optimal two-stage design shown in row (a). It is clear that the two-stage design is more powerful than a one-stage design with fixed T. This is also evident from Figures 1 and 2, comparing the solid and dashed lines.

7.2. Critical Value of Test Statistic

In practice, once we identify the “best” gene having the maximum test statistic value using a two-stage design, the next task is to determine whether this maximum test statistic is large enough for the gene to be significantly associated with the endpoint of interest. This may be done by comparing the test statistic of the best gene with the critical value from the null distribution of the maximum test statistic at, say, 5% significance level, where the null hypothesis is that none of the m genes are associated with the disease outcome. When testing m = 100, 1000, and 3000 independent genes per individual at 5% significance level using a one-stage design, the critical values of the standardized test statistic are 3.51, 4.07, and 4.31, respectively. The corresponding critical values for a two-stage design are 3.34, 4.02, and 4.29, respectively. These critical values are obtained from the null distribution of the maximum test statistic under a rule-of-thumb two-stage design. Notice that the critical values increase as more genes are evaluated for association, reflecting the effect of multiple testing. Critical values for a one-stage design are larger than the corresponding critical values for a two-stage design since the one-stage test statistic is the maximum over all markers, whereas the two-stage statistic is the maximum over markers chosen for stage 2.

7.3. Optimal Two-Stage Design for More Than One True Gene

A critical assumption to determine the optimal two-stage design so far has been that there is one true genetic marker and that the remaining m − 1 are null genes. In practice, multiple genetic factors could influence disease etiology, and the ultimate goal of any study would be to identify all or as many of the true markers as possible. All these markers may have the same or varying signals. Suppose for simplicity we assume that there are multiple true markers with the same signal and the goal of the study is to identify all the true markers. At the design stage, the actual number of true markers may not be known. Therefore, we may want to design the study such that (a) the power PK to select some prespecified number K markers of association is maximized or (b) the power P1* to select at least one true marker of association is maximized. Note that P1* will be ≥ PK.

We considered a simple case of m = 100, 1000, and 3000 independent markers, with five true markers of association, all with the same signal. The maximum power to identify all the K = 5 true markers was calculated for a two-stage design using Monte Carlo simulation and compared with the power of a corresponding one-stage design. Results similar to those presented in the earlier section were observed. Table 2 illustrates the power to detect K = 5 markers using a two-stage design for T/m = 5000, with m = 3000, 1000, and 100 independent markers and five true markers of association. It can be seen that the two-stage design is more powerful than a one-stage design. Further, the power of the rule-of-thumb two-stage design (with i = 10% and j = 75%) is very close to that of the optimal two-stage design for m = 1000 and 3000. However, unlike the cases of m = 1000 and 3000, the power of the optimal two-stage design is consistently larger than that of the rule-of-thumb design when m = 100. If the goal is to select at least one of these five true markers, the power using this design will be no less than that of selecting all five true markers.

Table 2.

Power to detect all five true genes of association using one- and two-stage deszgns in the presence of m = 3000, 1000, and 100 independent markers and T/m = 5000. Column (a) gives the mactrnum power of the optimal two-stage design. Column (b) gives the optimal parameters (i, j). Column (c) gives the power corresponding to a rule-of-thumb two-stage design (when i = 0.10 and j = 0.75). Column (d) gives the power of a one-stage design.

m μ (a) (b) (c) (d)
3000 0.061 0.98 (0.09, 0.82) 0.96 0.30
0.064 0.99 (0.12, 0.80) 0.98 0.40
0.066 0.99 (0.12, 0.81) 0.98 0.50
0.069 0.99 (0.15, 0.75) 0.99 0.60
1000 0.056 0.95 (0.12, 0.82) 0.92 0.30
0.059 0.98 (0.12, 0.80) 0.95 0.40
0.062 0.98 (0.19, 0.76) 0.97 0.50
0.065 0.99 (0.12, 0.85) 0.98 0.60
100 0.046 0.76 (0.14, 0.81) 0.56 0.30
0.0485 0.81 (0.14, 0.85) 0.67 0.40
0.051 0.87 (0.14, 0.85) 0.73 0.50
0.054 0.91 (0.14, 0.85) 0.80 0.60

Our simulations suggest that a rule-of-thumb two-stage design is applicable when the total number of genes to be selected (K) is much smaller than the total number of genes (m). A total of mi genes are evaluated in stage 2. If the goal is to select K genes, then mi must be larger than K. Our simulations indicate that the probability of selecting all K genes at the end of stage 2 increases as mi increases (for a fixed value of i, mi will be large when m is large.)

8. Discussion

In our formulation of the problem, we have assumed that the cost constraint is in the genotyping and not in individual ascertainment. With the increasing use of gene microarray technology, the cost per chip or per individual will have to be taken into consideration in addition to genotyping costs. For example, if C is the cost of ascertaining an individual (or baseline cost per chip), then the total cost of the study given by equation (1) would be modified as T = n1m + n2mi + C × (n1 + n2). Maximizing the power using this cost function could alter the optimal design parameters. However, note that T = n1m (1 + C/m) + n2m (i + C/m). The fraction C/m represents the relative cost of ascertaining an individual to the cost of genotyping that individual (m, the total number of markers evaluated per study subject, is the total cost of genotyping an individual, assuming a unit cost for each marker genotype). If Cm. then T ≈ n1m +n2mi, and the results presented in the previous section can be applied. Another issue contributing to C could be the availability of sufficient cases, particularly when the disease is rare.

If the total number of individuals (N) is fixed (and m, the total number of genes, is given), then the optimal design is to perform all m gene studies on every individual. It is pertinent to pose the following question in this setting. How much power do we lose by using our rule-of-thumb design versus performing all m gene studies on all N individuals? This is evaluated in row (e) of Table 1, which presents the power of an unconstrained one-stage design in which all of the individuals receive all of the tests. The relative cost of this design to the two-stage design is shown in parentheses under row (e). Recall that row (c) gives the power of a rule-of-thumb two-stage design under the cost constraint. Therefore, comparing rows (c) and (e), we can see that, while there is additional power gained by performing all of the gene studies in all of the individuals when correlation is high, the additional proportional increase in the cost of the one-stage design is very large.

The markers of interest may be scattered throughout the genome (low density of markers) or densely located in some candidate regions (high density). The extent of correlation between the markers (due to both genetic distance and evolutionary causes) depends on the density of the markers of interest and the extent of linkage disequilibrium (nonrandom association or correlation) between the markers. Thus, we expect the test statistics of 3000 equally spaced markers to be more correlated than 100 equally spaced markers in a fixed genomic region. For studies of isolated populations where linkage disequilibrium extends across a distance of 30–50 kilo-bases (i.e., correlation between loci in a distance of 30–50 kilo-bases), it can be anticipated that less than 100,000 markers will be required to identify candidate regions of gene/disease association (Boehnke. 2000). Therefore, having 3000 equally spaced markers over the entire genome would result in markers with very low correlation. While the actual correlations can only be estimated from the observed data at the end of the study, broader assumptions about the correlations must be used in the setting of study designs. Often these assumptions can be based on a priori knowledge about the markers from previous studies, if such information is available.

After examining Table 1 and Figures 1 and 2, the similarity in power between the optimal two-stage design and the rule-of-thumb design is clearly shown. Furthermore, it is clear that the one-stage design has much lower power. Therefore, when the principal design constraint is total cost, as represented by the total number of gene evaluations, the rule-of-thumb two-stage design gives a pragmatic approach that provides most of the power achieved by a one-stage design at a fraction of the cost.

Acknowledgements

The authors would like to thank two referees and an associate editor for their insightful comments. This research was supported in part by National Institutes of Health grants RO1 GM60457 and CA73848.

References

  1. Boehnke M (2000). A look at linkage disequilibrium. Nature Genetics 25, 246–247. [DOI] [PubMed] [Google Scholar]
  2. Feller W (1966). An Introduction to Probability Theory and Its Applications, Volume 2. New York: Wiley. [Google Scholar]
  3. Ford D, Easton DF, Stratton M, et al. (1998). Genetic heterogeneity and penetrance analysis of BRCA1 and BRCA2 in breast cancer families. The Breast Cancer Linkage Consortium. American Journal of Human Genetics 62, 676–689. [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Martin ER, Kaplan NL, and Weir BW (1997). Tests for linkage and association in nuclear families. American, Journal of Human Genetics 61, 439–448. [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Satagopan JM, Offit K, Foulkes W, Robson ME, Wacholder S, Eng C Karp SE, and Begg CB (2001). The lifetime risks of breast cancer in Ashkenazi Jewish carriers of BRCA1 and BRCA2 mutations. Cancer Epidemiology, Biomarkers, and Prevention 10, 467–473. [PubMed] [Google Scholar]
  6. Schaid DJ (1996). General score tests for associations of genetic markers with disease using cases and their parents. Genetic Epidemiology 13, 423–449 [DOI] [PubMed] [Google Scholar]
  7. Schaid DJ and Rowland C (1998). Use of parents, sibs, and unrelated controls fòr detection of association between genetic markers and disease. American Journal of Human Genetics 63, 1492–1506. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Teng J and Risch N (1999). The relative power of family-based and case-control designs for linkage disequilibrium studies of complex human diseases, II. Individual genotyping. Genome Research 9, 234–241 [PubMed] [Google Scholar]

RESOURCES