Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2018 Nov 30.
Published in final edited form as: Stat Med. 2018 Jul 12;37(27):4083–4095. doi: 10.1002/sim.7898

A pooling strategy to effectively use genotype data in quantitative traits genome-wide association studies

Wei Zhang 1, Aiyi Liu 1, Paul S Albert 2, Robert D Ashmead 3, Enrique F Schisterman 1, James L Mills 1
PMCID: PMC6204292  NIHMSID: NIHMS988910  PMID: 30003569

Abstract

The goal of quantitative traits genome-wide association studies is to identify associations between a phenotypic variable, such as a vitamin level and genetic variants, often single-nucleotide polymorphisms. When funding limits the number of assays that can be performed to measure the level of the phenotypic variable, a subgroup of subjects is often randomly selected from the genotype database and the level of the phenotypic variable is then measured for each subject. Because only a proportion of the genotype data can be used, such a simple random sampling method may suffer from substantial loss of efficiency, especially when the number of assays is relative small and the frequency of the less common variant (minor allele frequency) is low. We propose a pooling strategy in which subjects in a randomly selected reference subgroup are aligned with randomly selected subjects from the remaining study subjects to form independent pools; blood samples from subjects in each pool are mixed; and the level of the phenotypic variable is measured for each pool. We demonstrate that the proposed pooling approach produces considerable gains in efficiency over the simple random sampling method for inference concerning the phenotype-genotype association, resulting in higher precision and power. The methods are illustrated using genotypic and phenotypic data from the Trinity Students Study, a quantitative genome-wide association study.

Keywords: biallelic model, efficiency, estimation, group testing, homozygous minor allele, multiple comparison, phenotypes, pooling, power, random sampling, single nucleotide polymorphism

1 |. INTRODUCTION

The goal of a quantitative traits genome-wide association study (GWAS) is to identify genetic variants, often single-nucleotide polymorphisms (SNPs) that influence levels of a biological compound of interest, such as a vitamin, called the phenotypic variable. The phenotypic variable is generally measured on a continuous scale. The SNP is generally biallelic, that is, there are two possible versions, and each parent contributes one or the other to the study subjects. Thus, the possible combinations are AA (two copies of the major allele), Aa (one copy of the major allele and one copy of the minor allele), and aa (two copies of the minor allele).

Current technology can test an extraordinary number of SNPs, ie, 2.5 million, at a cost of only about $600 per person. In contrast, measuring the phenotypic variable or variables can be very expensive, depending on the compound, or compounds, selected. For example, measuring one phenotypic variate, homocysteine, could cost about $200 per person. Such cost constraints sometime necessitate measuring phenotypes in randomly selected samples rather than samples from all genotyped subjects. A random sampling strategy is simple in concept and yields phenotypic levels from each selected individual subject that make analysis by the standard methods such as two-sample t-test, ANOVA, and regression feasible.

However, disadvantages of simple random sampling are also apparent, a major one being the absence of phenotypic data from subjects not selected by the random sampling. Given the low genotype frequencies of the minor alleles in many SNPs, the number of subjects with the homozygous minor allele genotype can be very small or even zero in the subgroup randomly selected for phenotype testing. Thus, there may be insufficient data to detect an effect of the genotype on the phenotypic variables. By chance, the selected subgroup is likely to contain most subjects with the minor allele for some SNPs, but very few subjects with the minor allele for other SNPs, because there is little or no connection between the presence of rare alleles from SNP to SNP. The small number of subjects who have the uncommon genotypes of interest when random sampling is performed may seriously hinder the ability to find important genotypes in a GWAS, especially because of the need to adjust for multiple comparisons.

In the present paper, we propose a pooling strategy to improve the efficiency of an existing data sample obtained from the simple random sampling approach. Suppose that a subgroup, referred to as the reference group, of subjects has been randomly selected from the existing genotypic database. For each subject in the reference subgroup, we randomly select a number of subjects from among the remaining subjects in the genotype database to align with the subject to form a pool, regardless of their genotypes. For subjects within each pool, serum samples are mixed and the level of the phenotypic variable is measured. The result is the average of the unobserved individual phenotypic levels of the subjects in the pool. The genotypic-phenotypic association for each SNP is then estimated based on these averages of phenotypic levels across the different pools of samples. A special case to be investigated both analytically and numerically is the so called one-big-pool approach, also defined as 1 and n − 1 approach by Schisterman et al,1 where individual phenotypic levels are measured for all but one subject selected by simple random sampling, whereas the remaining subjects are all put into one pool to yield an average measure. A unique feature of the proposed pooling strategy is that pooling is carried out so that each pool contains only one subject from the reference group (the subjects already selected by random sampling).

Group testing was first proposed by Dorfman2 to improve screening for a rare disease by pooling of biospecimens. It has drawn considerable attention in the statistical and medical literature for screening and estimation of a binary outcome; see, among others, related works.37 Note that, in a group testing framework, the outcome of each pool is the maximum of individual binary outcomes. The group testing method has been extended to measuring continuous variables where the outcome of a pool is the average of the individual observations in the pool; see, among others, other works.813

The proposed pooling strategy has some similarities to both Dorfman’s2 group testing concerning a binary outcome and the pooling strategy concerning the assessment of an exposure measured on a continuous scale (eg, the work of Schisterman et al13). By aligning subjects, the chances are increased that a pool contains subjects with homozygous minor allele genotypes of interest, which, in turn, could yield more efficient inference on the association.

The pooling strategy proposed in the present paper, however, differs from conventional pooling methods in one aspect. In the conventional pooling of continuous variable data, pooling is supervised in that the subjects are randomly pooled according to certain characteristics being compared (eg, cases or controls); see, eg, the work of Schisterman et al.13 The assignment of subjects to each pool from the proposed strategy, on the other hand, is unsupervised, not involving genotypes or other characteristics. (Indeed, in a quantitative traits GWAS, it is impossible to assign subjects to a pool according to their genotypes because the number of SNPs is so large.)

The paper is arranged as follows. In Section 2, we describe the Trinity Students Study, a quantitative traits GWAS, which motivated the present research. In Section 3, we discuss estimation and hypothesis testing based on the random sampling strategy in the context of quantitative traits GWAS. In Section 4, we describe the general pooling approach and develop methods for estimation and hypothesis testing with data generated from such a strategy. In Section 5, we focus on the one-big-pool approach and derive least-squares estimates of the parameters of interest. In Section 6, we demonstrate the efficiency gains of the proposed pooling strategy over simple random sampling. Specifically, we present conditions under which pooling yields more efficient estimation than the simple random sampling method. We further develop methods to compute the probability that these conditions hold. The methods are illustrated in Section 7 using data from the Trinity Students GWAS to evaluate the genotype-phenotype association between phenotypic variables and genotypes of interest. Finally, some discussions are presented in Section 8.

2 |. THE TRINITY STUDENTS GWAS: A MOTIVATING EXAMPLE

The Trinity Students (TSS) GWAS was conducted in 2003–2004 by investigators at The Epidemiology Branch, Eunice Kennedy Shriver National Institute of Child Health and Human Development, Trinity College, Dublin, National Human Genome Research Institute and the Health Research Board of Ireland. From a starting population of a total of 3569 students attending the University of Dublin, Trinity College, 2507 students who had Irish grandparents and no major medical problems, who completed the study questionnaire, and who provided the necessary blood samples constituted the final study population. Written informed consent and IRB approval were obtained.

DNA was collected for the GWAS and over 750 000 SNPs were assayed via the Illumina system. The GWAS analysis is ongoing. As expected, many SNPs have low minor allele frequencies, producing rates of homozygosity for the minor allele of 1%−10%.

Serum samples were collected, processed, and stored until they were shipped to laboratories that measured over 40 target phenotype measures including hematologic factors, liver function tests, B vitamins, and related metabolites.

The large number of subjects makes measuring phenotype measures costly. In fact, financial constraints have made it possible to measure additional phenotypic variables of interest in only half of the study participants. For this reason, the strategy of random sampling was discussed. Random sampling is not optimal because genotype data from subjects not selected by the random sampling approach are wasted. Therefore, pooling was discussed as an alternative, potentially more informative, approach.

Later, we will use the genotype data and phonotype data collected on all subjects from the Trinity Students GWAS to evaluate the efficiency of the proposed pooling strategy, assuming for illustration that the phenotypic data had not been available for all subjects.

3 |. THE SIMPLE RANDOM SAMPLING APPROACH

Let Y denote the phenotypic level and G the genotype of a single biallelic SNP. For simplicity, we consider binary genotypes with G = 1 if the SNP has homozygous minor allele aa and G = 0 if otherwise. Define

μ1=E(Y|G=1),σ12=var(Y|G=1),μ0=E(Y|G=0),σ02=var(Y|G=0).

Our interest is to estimate δ = μ1μ0 and to test the null hypothesis H0 : δ = 0.

Suppose that genotypic data are available from a total number N of subjects, denoted by S={1,2,,N}. Let S1={1iN:Gi=1} and S0={1iN:Gi=0} be the subset of subjects with G = 1 and G = 0, respectively. Furthermore, let N1 and N0 be the number of subjects in the two subsets, respectively, and ρ = N1/N be the proportion of subjects with G = 1 in the genotype data set. If the N subjects are randomly selected from the study population, then ρ is an unbiased estimate of the genotype (G = 1) frequency in the population.

For subject iS, denote by Yi its phenotypic level and Gi its genotype. Note that genotypes are observed for all subjects. However, the phenotypic level Yi can be observed for only n < N subjects.

A simple random sampling selects a subset SR of size n from S. For subjects in this reference group, define

S1R={iSR:Gi=1},S0R={iSR:Gi=0}

to be the (reference) group of subjects having genotype 1 and 0, respectively, and write n1 and n0 as the number of subjects in the two groups, respectively. Let Y1R and Y0R be the vector of phenotypic levels for subjects in S1R and S0R, respectively, where throughout vectors refer to column vectors. Then, we have, conditioning on the random number n1,

E{(Y1RY0R)}=(1n11n101n0)(δμ0),var{(Y1RY0R)}=(σ12In100σ0In0), (1)

where 1m is the vector of 1’s with order m, and Im is the identity matrix of order m. Without further distributional assumption, the parameters can be estimated using least-squares method.14 It follows that the least squares estimate of δ is simply the difference in the observed average of phenotypic levels between the two genotypes, that is,

δ^R=Y¯1RY¯0R,

where Y¯1R=iS1RYi/n1, and Y¯0R=iS0RYi/n0. The variance of δ^R is given by

σR2=1n1σ12+1n0σ02=1n1σ12+1nn1σ02. (2)

The two variances σ12 and σ02 can be estimated by their sample variances

σ^12=iS1R(YiY¯1R)2/(n11),σ^02=iS0R(YiY¯0R)2/(n01).

The null hypothesis H0 can then be tested using the Student’s t-test.

For given n, σ1, and σ0, the variance σR2 is a function of n1 and decreases as n1 increases until n1 becomes larger than 1/(σ1 + σ0). This fact can be easily verified by investigating the derivative of σR2 with respect to n1. When the genotype frequency of the homozygous minor allele (G = 1) is relative low, the simple sampling strategy can suffer from substantial loss of efficiency because the number n1 of subjects with genotype G = 1 can be very small. With low frequency of a genotype, it is possible that the random sample may not include any subjects with this genotype, in which case inference on δ becomes infeasible. The most efficient (minimum variance) estimator will be achieved if n1 = N1, that is, if all subjects who are homozygous for the minor allele (G = 1) are selected. However, because N1 is much smaller than n, the chance that the simple random sampling selects all N1 subjects with G = 1 is very low, noting that the probability that the simple random sampling picks n1 subjects with genotype G = 1 from S is CN1n1CNN1nn1/CNn, where CNn is the binomial coefficient. Moreover, even if the simple random sampling yields an efficient estimation for certain SNPs, it may still lose substantial efficiency for other SNPs in the GWAS analysis because of the variation in genotype frequency across SNPs.

Due to the low frequency of G = 1, the variance σ12 is much more influential than σ02 on the precision of δ^R. For N = 2500 and N1 = 250, reflecting a typical SNP in the Trinity Students GWAS example with 10% frequency for G = 1, Figure 1 presents the variance σR as a function of 1 ≤ n1N1, for σ12 and 1.5, respectively, with n = N/2 = 1250. In the figure, σ0 is set to be 1. The figure shows clearly that the precision of the estimator increased substantially when n1 increases from a relatively small value. It is also worth noting that, for this specific example, the probability is about 0.0005 that the simple random sampling includes at least 150 subjects with G = 1; the probability approaches to zero (≈ 10−25) if at least 200 such subjects are selected.

FIGURE 1.

FIGURE 1

Variance of the estimator by simple random sampling. σ12=0.5 (dash line), 1 (solid line), and 1.5 (dotted line)

4 |. A SAMPLE ENRICHMENT APPROACH BY POOLING

Given a reference group, a randomly selected subset SR of n subjects, as described in the previous section, we are interested in whether the remaining Nn subjects in the subset S\SR can be used together with those in SR to produce an estimate of δ with smaller variance or a test with higher power than the estimate or test based on subjects in SR. Here, we propose a sample enrichment strategy as follows. For each subject iSR, we randomly select ki − 1 subjects from S\SR to form a pool with subject i. Thus, ki is the size of the pool (total number of subjects in the pool), with ki1,i=1mki=N. If ki = 1, then the phenotypic level is measured for subject i. However, for subjects in a pool with size ki ≥ 2, individual phenotypic levels are not observed. Instead, we observe the average of individual phenotypic levels in the pool, that is,

Y˜iR=(Yi+sum of the Y s from subjects aligned to subject i)ki.

Let k1i be the number of subjects with G = 1 in the ith pool. Write ri = k1i/ki as the proportion of subjects with G = 1 in the pool, and θi=riσ12+(1ri)σ02 as the weighted variances. Then, we have

E(Y˜iR)=μ0+riδ, var(Y˜iR)=θiki, (3)

which reduces to Equation (1) for individual phenotypic observations if ki = 1.

An estimate δ^E of δ can be obtained using the least squares methods. Let σ^12 and σ^02 be consistent estimates of σ12 and σ02, respectively, and define θ^i=riσ^12+(1ri)σ^02. Then, the two-stage weighted least squares estimate of δ is given by

δ^E=(kiθ^i)(k1iY˜iRθ^i)(k1iθ^i)(kiY˜iRθ^i)(kiθ^i)(k1iriθ^i)(k1iθ^i)2, (4)

and its variance is asymptotically given by

σE2=kiθi(kiθi)(k1iriθi)(k1iθi)2, (5)

where, unless otherwise stated, the summation is over {i:iSR}.

With homoscedasticity, that is, σ12=σ02=σ2, the least squares estimate and its variance reduce to

δ^E=ki(riρ)Y˜iRk1i(riρ),σE2σ2k1i(riρ).

For low frequency of G = 1, we expect that, for most cases, the enrichment approach will yield smaller variance, that is, var(δ˜E)=σE2<σR2, and thus tests with higher power, because it fully utilizes the genotype data.

Consistent estimates of σ12 and σ02 are needed for interval estimation and hypothesis testing concerning δ. These estimates can be obtained as follows. Let μ0 and δ be the least squares estimate from (3), derived by assuming that σ02=σ12. Define Zi=ki(Y˜iRμ0*riδ*)2. Then, it follows from (3) that

E(Zi)riσ12+(1ri)σ02.

The variances can then be estimated by the ordinary least squares estimates from the aforementioned regression equation. This estimation process can also be iterated a few times till the estimates converge, with μ0 and δ being estimated using least squares methods in further iterations.

Alternatively, we can obtain the maximum likelihood estimates if the phenotypic levels are assumed to be normally distributed. However, the least square methods are more robust against normal distributional assumptions.

5 |. THE ONE-BIG-POOL APPROACH

The one-big-pool approach is a special case of the sample enrichment, with ki = Nn + 1 for some i and 1 for all others. Recall that subjects in the reference group SR are selected from the random sampling procedure, among which n1 subjects have genotype G = 1 and n0 subjects are of genotype G = 0. The one-big-pool strategy randomly selects a subject from SR to be combined with all subjects in S\SR to form a single pool, hence the name “one-big-pool.” We denote by SR the subset by excluding from SR the selected subject for alignment, and define

n˜1=#{iSR:Gi=1},n˜0=#{iSR:Gi=0},

to be the number of subjects with G = 1 and G = 0 in SR, respectively, where Gi is the genotype of the ith subject. Thus, n˜1=n1 if the subject selected for alignment is of genotype G = 0 and n˜1=n11 if the subject selected for alignment is of genotype G = 1.

Individual phenotypic levels are measured for each subject in SR. For the one-big-pool, its outcome is denoted by Y˜, which is the average of the unobserved individual phenotypic levels in the pool. Note that, in this pool, there are N1n˜1 subjects with genotype G = 1 and N0n˜0 subjects with genotype G = 0. Note that, for iSR, we have ki = 1, k1i = Gi = ri, and θi=Giσ12+(1Gi)σ02. Define r=(N1n˜1)/(Nn+1), the proportion of subjects in the pool that have genotype G = 1, and

Y¯1=1n˜1iSR,Gi=1Y˜iR,Y¯0=1n˜0iSR,Gi=0Y˜iR.

Then, following tedious yet straightforward algebraic manipulation, we can show that Equation (4) becomes

δ^E=n˜1σ^12(n˜0σ^02+N0n˜0θ^)Y¯1n˜0σ^02(n˜1σ^12+N1n˜1θ^)Y¯0+Nn+1θ^(rn˜0σ^02(1r)n˜1σ^12)Y˜n˜0σ02(n˜1σ12+r(N1n˜1)θ)+n˜1(1r)2(Nn+1)θσ12,

and Equation (5) reduces to

σE2=n˜1σ12+n˜0σ02+Nn+1θn˜0σ02(n˜1σ12+r(N1n˜1)θ)+n˜1(1r)2(Nn+1)θσ12, (6)

where θ=rσ12+(1r)σ02.

Among all pooling strategies, the one-big-pool approach achieves the maximum amount of individual phenotypic data. This is appealing because the sample variances based on these individual phenotypic data provide convenient estimates of the variances.

With homoscedasticity, the least squares estimate further reduces to

δ^E=(1ρ)n˜1Y¯1ρn˜0Y¯0+(Nn+1)(rρ)Y˜(1r)n˜1+(1ρ)N1,

with variance

σE2=σ2(1r)n˜1+(rρ)N1. (7)

6 |. GAINS IN EFFICIENCY

The sample enrichment yields more efficient estimation of δ and subsequently more powerful test if σE2<σR2. Given the genotypic data S and the number n, a specific reference group SR of n subjects is selected by random sampling from S. We are interested in whether gains in efficiency can be achieved by enriching the reference group SR via pooling of samples with subjects in SR. Specifically, we will investigate conditions under which σE2<σR2, and the probability that such variance inequality holds. Due to the large number of combinations of aligning the subjects, these issues can be addressed analytically for simple cases such as the one-big-pool approach, whereas, for others, they can be assessed via simulations.

For the one-big-pool case, the variances are given in (6) for heteroscedastic error variances and (7) for homoscedastic error variances, respectively. Fixing other components in the variance functions, the random sampling variance σR2=σR2(n1) depends on n1, the number of subjects with G = 1 in SR, and the enrichment variance σE2=σE2(n˜1) depends on n˜1, the number of subjects with G = 1, not in the pool. Recall that n˜1 is either n1 or n1 − 1. The probability of having the sampling outcome (n1,n˜1) is

p(n1,n˜1)={CN1n1CNN1nn1CNn×nn1n,if n˜1=n1;CN1n1CNN1nn1CNn×n1n,if n˜1=n11.

Therefore, the probability that the one-big-pool enrichment results in more efficient estimation is given by

P{σE2(n˜1)<σR2(n1)}=(n1,n˜1)Ωp(n1,n˜1),

where Ω={(n1,n˜1):σE2(n˜1)<σR2(n1)}.

Explicit conditions in terms of n1 can be derived under which σE2(n˜1)<σR2(n1). Note that the equation σE2(n˜1)<σR2(n1) reduces to a quadratic equation for either n˜1=n1 or n˜1=n11. Let L < U be the two roots of the quadratic equation, then σE2(n˜1)<σR2(n1) if and only if n1 < L or n1 > U. For the homoscedastic error variances in particular, the two roots are given by ρ{nN±nN(Nn+1)}/(N+1) if n˜1=n1, which is obtained by solving the quadratic equation σE2(n˜1)=σR2(n1) with σR2(n1) defined in (2) and σE2(n˜1) in (7).

Assume that the variances are homoscedastic, we compute the relative efficiency of the one-big-pool enrichment to the random sampling with N and n, as given in Figure 2, but with N1 = 250 and 750, respectively, representing the case of relatively low and high frequency of G = 1.

FIGURE 2.

FIGURE 2

Relative efficiency of enrichment to random sampling (solid lines correspond to n˜1=n11, and dash lines to n˜1=n1)

With N1 = 250, if the one subject to be aligned with the remaining subject has genotype G = 0, then the enrichment approach is more efficient if n1 ≤ 122 or ≥ 128. Otherwise, if the subject has genotype G = 1, then the enrichment is more efficient if n1 ≤ 102 or ≥ 148. The probability is found to be 0.665 that the enrichment is more efficient than the random sampling.

With N1 = 750, if the one subject to be aligned with the remaining subject has genotype G = 0, then the enrichment approach is more efficient if n1 ≤ 367 or ≥ 383. Otherwise, if the subject has genotype G = 1, then the enrichment is more efficient if n1 ≤ 357 or ≥ 393. Notably, the probability that the enrichment approach is more efficient than the random sampling approach is found to be almost one.

Figure 2 presents the variance ratios of σE2(n1)/σR2(n1) and σE2(n11)/σR2(n1) for 1 ≤ n1N1. For both cases, the enrichment yields more efficient estimation for most values of n1. The enrichment gains slightly more efficiency if the one subject to be aligned with the remaining subject has genotype G = 0. However, the genotype of this subject becomes less influential when genotype frequency is relatively high. Substantial gains in efficiency occur when n1 is small, that is, when the random sampling selects fewer subjects with G = 1. The figure also indicates that the loss of efficiency due to enrichment, if occurring, is minimal.

We wanted to see how the enrichment performs compared to the random sample for different values of N1, the number of homozygous minor alleles, especially for small values. In order to make this comparison, we calculated the power of each approach as a function of N1, ranging from 100 to 550.

For a given number of subjects with each genotype out of a total of 2500, data were simulated from a normal distribution. To be specific, we generated the phenotype data with G = 0 from the normal distribution N(0, 11), and N(2, 11) for those with G = 1. That suggests that the two populations of genotypes had an effect size of 2/11 = 0.1818, which is similar to the largest effect sizes we saw in the TSS data. Let n = 1250. The significance level was set to be 0.05/500 000 to adjust for multiple comparisons in a GWAS.

Figure 3 shows that, for SNPs with low percentages of homozygous minor alleles, the enrichment gains a fairly large advantage in power over the random sample approach. While the difference in power decreases as the percentage of homozygous minor alleles increases, the one-big-pool approach appears to always yield larger power.

FIGURE 3.

FIGURE 3

Power as a function of the number of homozygous minor alleles

7 |. ILLUSTRATION WITH DATA FROM THE TRINITY STUDENTS STUDY

Since we had individual phenotype value on all subjects, we averaged the individual phenotypic values from groups of subjects to create the value that would be obtained by random sampling and pooling their serum samples before testing. We then examined how the results obtained by analyzing data on these pools compared with the complete result calculated by including all the individual data.

For illustrations, we used a biochemical analyte as the phenotypic variable along with 120 SNPs to test the different approaches. The genotype data were selected from among the genes related to this factor biologically, eg, enzyme genes involved in the metabolism of the biochemical analyte. These genotypic data were supplemented with data on genes that were considered potentially important in other phenotypic variables but were not thought to be related to the biochemical analyte we studied.

A small number of values were missing from the data set, with 0.6% of the phenotypic variable’s entries missing and between 1.6% and 8.4% of entries missing for the different SNPs. Missing values of the phenotypic variable was replaced by the overall average of the observed values, and missing SNPs were randomly assigned a genotype according to the proportion in the sample for that particular SNP.

The two-stage least-squares analysis showed that, in general, the one-big-pool enrichment yielded smaller p-values than the random sample approach for SNPs deemed significant in the complete approach. Depending on the sample drawn, the methods varied in their ability to identify the same significant SNPs as the complete approach. For the comparison of methods, we averaged the p-values over 1000 iterations of random samples from the data. The results are presented in Figure 4.

FIGURE 4.

FIGURE 4

Log averaged p-values of the The Trinity Students data for each sampling method (the dashed line corresponds to the type I error correction threshold of log(0.05/120)). Single-nucleotide polymorphisms were sorted by the ascending order of p-values for the complete analysis

We found that, for the SNPs that were at least marginally significant, the average p-value for the one-big-pool approach was considerably smaller than for the random sample. While this was not always the case for an individual sample, on average, the one-big-pool approach gave results closer to the complete.

In this illustration, many of the SNPs had a very low homozygous minor allele percentage. Among the SNPs with the 17 lowest p-values for the complete approach, the percentage of homozygous minor alleles ranged from 0.8% to 5.9%. We saw in Figure 3 that the gains in efficiency were greatest for samples with small percentages of homozygous minor alleles, which corresponds to the relatively large gains in efficiency for the one-big-pool approach that we see in this example.

8 |. DISCUSSION

In this paper, we proposed a sample enrichment approach to use existing genotypic data more effectively to estimate the genotype-phenotype association in a quantitative GWAS. Through technical development, simulation, and illustration with data from the TSS GWAS, we demonstrated that the one-big-pool approach could gain substantial efficiency as compared to the simple random sampling approach. Pooling DNA samples for genotyping has been discussed in the GWAS literature. To the best of our knowledge, this is the first paper on pooling phenotypic measures in the context of quantitative GWAS. We discuss next a few issues and some potential research directions arising from the proposed enrichment design.

8.1 |. Other alignment approaches

The performance of other enrichment strategies deserves further investigation, both analytically and numerically. Unlike the one-big-pool approach, explicit conditions under which other enrichment strategies are more efficient than the random sampling approach are extremely difficult to derive because of the number of possible combinations of aligning subjects. The challenge here is to figure out all possible configurations of the number of subjects in each pool having the homozygous minor allele and the probability that each configuration occurs.

Equation (5) and its special case under homoscedasticity provide the basis for evaluating other designs. Given the reference group, it is possible to produce Figure 2 to compare between random sample design and the one-big-pool design because the variance depends on whether or not the reference subject in the big pool has the homozygous minor allele (n˜1=n1 or n˜1=n11). For other designs, however, such graphical presentation may not be feasible due to the many different configurations of {ki : iSR}. For example, consider the 1–1 design where each subject in the reference group of size n < N/2 is aligned with one subject from the remaining subjects. The variance involves the number of pools with both subjects having the allele, only one subject having the allele and neither subject having the allele. Understandably, there are many configurations of these three numbers. Instead of comparing variances for each configuration, we can compare the expected variance E(σE2) with the expectation taken over all possible configurations of {ki : iSR}.

To demonstrate, consider 1–1 alignment with N = 2500, N1 = 50, and n = 250, reflecting a rare allele. For each n1 = {1, 2, … , N1}, we calculate the average variance of the sample enrichment approach based on 5000 simulations (ie, 5000 possible configurations). The ratio of the expected variance of the sample enrichment approach over that of the random sampling approach is 0.928, showing the efficiency of the 1–1 design.

8.2 |. Relative cost efficiency

In the present paper, the relative statistical efficiency, defined as the ratio of variances of estimators, was used to compare different designs. This is a common practice in statistical literature. In some practical applications, both statistical precision and real cost are of concern, in which case, Thomas’ (2007) relative cost efficiency15 can be used, which provides further informative understanding for the comparison of different study designs. According to Thomas (2007), the relative cost efficiency is the relative statistical efficiency multiplied by the ratio of costs.

The present paper deals with a secondary problem following the completion of genotyping, therefore no additional cost will incur for genotyping. There are, however, two potential sources of cost in the follow-up study. One is the laboratory assay cost for phenotypic measures for each pool; the total cost of such is proportional to the number of pools. Another is the cost associated with each subject selected for follow-up, including possibly purchasing cost of the biospecimens and the cost for combining samples for pooling. If only the laboratory assay cost occurs, then the cost for assaying a random sample of size n is the same as for assaying n pools; the relative cost efficiency then reduces to the relative statistical efficiency.

Otherwise, let the cost per laboratory assay is C1 and the cost per subject is C2. Then, the relative cost efficiency of a pooling design with n pools and pool size {ki, i = 1, … , n} to a random reference sample of size n is

nC1+C2i=1nkin(C1+C2)=nc+i=1nkin(c+1)

multiplied by the variance ratio, where c = C1/C2. In practice, the assay specific cost C1 is usually much higher than the subject specific cost, that is, c is usually large. Then, the relative cost efficiency is close to the relative statistical efficiency.

Future research is needed to investigate the performance of various enrichment designs and to search for optimal designs based on the relative cost efficiency criterion.

8.3 |. Other remarks

A difficult problem that remains unsolved is whether optimal strategies, in terms of sampling methods and pool sizes, exist. Even if they do exist, the optimal strategy may itself depend on the genotype frequency, making this characterization not practically useful when multiple SNPs are of interest. To this end, methods in the work of Schisterman et al13 for calculating the optimal pool and unpooled samples to maximize efficiency could be extended for this problem.

The two-group comparison we focused on in the present paper applies to recessive and dominant genetic model, which essentially divides subjects into two groups. If the underlying genetic model is additive, then the genotype will be coded as 0, 1, and 2, which in turn yields three groups of subjects. In this case, the ANOVA and Kruskal-Wallis test can be used instead to compare a phenotypic variable among the three groups. It is expected that the enrichment design will continue to outperform the random reference group, when the allele frequency is relatively low. Extensions of the enrichment approach to multiple-group comparison deserve further investigation.

When multiple correlated genotypes are of interest, the problem for association studies becomes one that tests the equality of phenotypic means in all possible genotype (haplotype) combinations. For example, if k genetic markers (loci) are considered in an association study, then there are 2k possible haplotype pairs that are compatible with the observed genotypes and the null hypothesis is that there is no difference between the phenotypic means of these 2k genotypic groups. One potential method is to directly extend the sample enrichment approach discussed in the present manuscript to the comparison of multiple groups. The correlations between the markers are automatically embedded in the grouping of the genotypes. The effect of genotyping errors on the efficiency and means to correct for genotyping errors is also worth further investigating.

Multiple phenotypes are commonly encountered in practice including GWAS. Indeed, the TSS GWAS that motivated our interest measured concentration levels of a number of vitamins and biomarkers from each student, although in this study, investigators are interested in each individual phenotypic variable. When multiple phenotypes are being tested simultaneously, the problem becomes one that tests the equality of phenotypic mean vectors between independent groups. This can be accomplished by applying a proper multivariate test such as Hoteling’s T2 or multivariate analysis of variance.

In the present paper, the sampling approach is constructed by assuming a homogenous population, which is the case in the TSS GWAS (participants are students in a college in Ireland). If stratification exists in the population, then both genotypes and phenotypes need to be adjusted for population stratification; see, among others, the work of Price et al.16 When the sample enrichment design is employed, only the averages of the individual phenotypic levels are available. This would make it extremely difficult to adjust for stratification for the phenotypes. Further research is needed along these lines.

Finally, it is worth pointing out that, in the general context of pooling biospecimens to measure the levels of biomarkers on a continuous scale, a key assumption is that the value calculated from a pool of samples is the average of the individual levels of the samples in the pool. While in many cases, this assumption seems well justified, in other cases, caution needs to be exercised when making such an assumption.

ACKNOWLEDGEMENTS

The authors thank the editor and two anonymous referees for their insightful comments and suggestions that substantially improved the manuscript. Research of the authors is supported by the Intramural Research Program of the Eunice Kennedy Shriver National Institute of Child Health and Human Development. The design, recruitment, and all metabolite and genomic analyses of the TSS cohort were implemented through the collaboration of Prof. John Scott and Dr. Anne Molloy, Trinity College Dublin, Ireland, Dr. Peadar Kirke, Health Research Board, Dublin, Ireland, Dr. Lawrence Brody and Dr. Faith Pangilinan, National Human Genome Research Institute (NHGRI), and Dr. James Mills, Eunice Shriver National Institute for Child Health and Human Development (NICHD) with funding through NICHD contract NO1- HD-3–3348 and with additional financial support from NHGRI and the Health Research Board, Ireland. The authors thank Dr. Yaakov Malinovsky for helpful comments and suggestions, and Mary Conley for help with the TSS GWAS data.

Funding information

Intramural Research Program of the Eunice Kennedy Shriver National Institute of Child Health and Human Development; NICHD, Grant/Award Number: NO1- HD-3–3348; NHGRI; Health Research Board, Ireland

APPENDIX

TECHNICAL DERIVATIONS

Equations (4) and (5)

By Equation (3), we can use a regression model E(Y) = to relate Y˜iR and ri, i = 1, … , n, where

Y=(Y˜1RY˜nR),X=(1r11rn), and β=(μ0δ),

with Cov(Y) = Σ, an n × n variance-covariance matrix with diagonal elements being Var(Y˜iR)=θi/ki and off-diagonal elements all being 0.

Assuming θi=riσ12+(1ri)σ22 are known, then the weighted least squares estimate of β is given by

β˜=(μ˜0δ˜E)=(XΣ1X)1XΣ1Y,

yielding

μ˜0=(kiri2θi)(kiY˜iRθi)(kiriθi)(k1iY˜iRθi)(kiθi)(k1iriθi)(k1iθi)2,andδ˜E=(kiθi)(k1iY˜iRθi)(k1iθi)(kiY˜iRθi)(kiθi)(k1iriθi)(k1iθi)2.

Note that Cov(β˜)=(X1X)1. Thus, we have

Var(δ˜E)=kiθi(kiθi)(k1iriθi)(k1iθi)2.

The estimator δ^E in Equation (4) is then obtained by replacing θi in δ˜E with its consistent estimator θ^is. The (asymptotic) variance of δ^E is given by Equation (5), utilizing that fact that Var(δ^E)=Var(δ˜E) (in large sample sense).

Equations (6) and (7)

Note that the one-big-pool approach is a special case of the sample enrichment with ki = Nn + 1 for the selected subject and 1 for all others. For iSR, we observe that ki = 1, k1i = Gi = ri, and θi=Giσ12+(1Gi)σ02. With the estimators σ^12 and σ^02,θi can be estimated by θ^i=Giσ^12+(1Gi)σ^02. As denoted earlier, there are n˜1 subjects with G = 1 and n˜0 subjects with G = 0 in SR, where n˜1+n˜0=n1; in the big pool, a total of N1n˜1 subjects have genotype G = 1 and N0n˜0 subjects have genotype G = 0. Hence, it follows that

iSRkiθ^i=n˜1σ^12+n˜0σ^02+Nn+1θ^,iSRk1iθ^i=n˜1σ^12+N1n˜1θ^,iSRkiY˜iRθ^i=n˜1Y¯1σ^12+n˜0Y¯0σ^02+(Nn+1)Y˜θ^,iSRk1iY˜iRθ^i=n˜1Y¯1σ^12+(N1n˜1)Y˜θ^,

where θ^=[(N1n˜1)σ^12+(N0n˜0)σ^02]/(Nn+1). Plugging them into (4) yields

δ^E=n˜1σ^12(n˜0σ^02+N0n˜0θ^)Y¯1n˜0σ^02(n˜1σ^12+N1n˜1θ^)Y¯0+Nn+1θ^(rn˜0σ^02(1r)n˜1σ^12)Y˜n˜0σ^02(n˜1σ^12+r(N1n˜1)θ^)+n˜1(1r)2(Nn+1)θ^σ^12,

where r=(N1n˜1)/(Nn+1). Analogously, the Equation (5) becomes

σE2=n˜1σ12+n˜0σ02+Nn+1θn˜0σ02(n˜1σ12+r(N1n˜1)θ)+n˜1(1r)2(Nn+1)θσ12,

where θ=rσ12+(1r)σ02. Hence, we obtain the expected results in (6) and (7).

Footnotes

CONFLICT OF INTEREST

The authors declare no potential conflict of interests.

REFERENCES

  • 1.Schisterman EF, Vexler A, Ye A, Perkins NJ. A combined efficient design for biomarker data subject to a limit of detection due to measuring instrument sensitivity. Ann Appl Stat. 2011;5:2651–2667. [Google Scholar]
  • 2.Dorfman D The detection of defective members of large populations. Ann Math Statist. 1943;14:436–440. [Google Scholar]
  • 3.Sobel M, Groll PA. Group testing to eliminate efficiently all defectives in a binomial sample. Bell System Tech J. 1959;38:1179–1252. [Google Scholar]
  • 4.Gastwirth J, Johnson W. Screening with cost-effective quality control: potential applications to HIV and drug testing. J Am Stat Assoc. 1994;89:972–981. [Google Scholar]
  • 5.Chen CL, Swallow WH. Using group testing to estimate a proportion, and to test the binomial model. Biometrics. 1990;46:1035–1046. [PubMed] [Google Scholar]
  • 6.Hughes-Oliver JM, Swallow WH. A two-stage adaptive group-testing procedure for estimating small proportions. J Am Stat Assoc. 1994;89:982–993. [Google Scholar]
  • 7.Tu XM, Litvak E, Pagano M. On the informativeness and accuracy of pooled testing in estimating prevalence of a rare disease: application to HIV screening. Biometrika. 1995;82:287–289. [Google Scholar]
  • 8.Weinberg CR, Umbach DM. Using pooled exposure assessment to improve efficiency in case-control studies. Biometrics. 1999;55:718–726. [DOI] [PubMed] [Google Scholar]
  • 9.Faraggi D, Reiser B, Schisterman EF. ROC curve analysis for biomarkers based on pooled assessments. Statist Med. 2003;22:2515–2527. [DOI] [PubMed] [Google Scholar]
  • 10.Kendziorski CM, Zhang Y, Lan H, Attie D. The efficiency of pooling mRNA in microarray experiments. Biostatistics. 2003;4:465–477. [DOI] [PubMed] [Google Scholar]
  • 11.Liu A, Schisterman EF. Comparison of diagnostic accuracy of biomarkers with pooled assessments. Biom J. 2003;45:631–644. [Google Scholar]
  • 12.Yuan M, Yang Y, Zheng G. Two-stage genome-wide association studies with DNA pooling and genetic model selection. Stat Sin. 2009;19:1769–1786. [Google Scholar]
  • 13.Schisterman EF, Vexler A, Mumford SL, Perkins NJ. Hybrid pooled-unpooled design for cost-efficient measurement of biomarkers. Stat Sin. 2010;29:597–613. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Tebbs JM, Swallow WH. Estimating ordered binomial proportions with the use of group testing. Biometrika. 2003;90:471–477. [Google Scholar]
  • 15.Thomas DC. Multistage sampling for latent variable models. Lifetime Data Anal. 2007;13:565–581. [DOI] [PubMed] [Google Scholar]
  • 16.Price AL, Patterson NJ, Plenge RM, Weinblatt ME, Shadick NA, Reich D. Principal components analysis corrects for stratification in genome-wide association studies. Nature Genet. 2006;38(8):904–909. [DOI] [PubMed] [Google Scholar]

RESOURCES