Summary
The software package TUNA (Testing UNtyped Alleles) implements a fast and efficient algorithm for testing association of genotyped and ungenotyped variants in genome-wide case-control studies. TUNA uses Linkage Disequilibrium (LD) information from existing comprehensive variation datasets such as HapMap to construct databases of frequency predictors using linear combination of haplotype frequencies of genotyped SNPs. The predictors are used to estimate untyped allele frequencies, and to perform association tests. The methods incorporated in TUNA achieve great accuracy in estimation, and the software is computationally efficient and does not demand a lot of system memory and CPU resources.
INTRODUCTION
Genome-wide association studies are now well recognized as powerful tools for finding genetic risk variants for complex traits (e.g., WTCCC, 2007). However, even with the availability of newly developed high throughput genotyping platforms, it is likely that the disease causing variants are not directly genotyped. The ability of performing statistical tests on untyped variants based on genotyped data becomes crucial in finding and explaining association signals. There has been recently a lot of efforts on methods and software designed to tackle this problem (Scheet and Stephens, 2006; Nicolae, 2006b; Servin and Stephens, 2007; Marchini et al., 2007). We introduce here a software package, called TUNA, that provides a fast and accurate solution for the inference on untyped variation. The package consists of two computational elements: (i) a predictor database building program that efficiently extracts LD information from a reference population panel data set such as HapMap, where a much larger set of genetic variants are studied; (ii) an analysis program that performs single SNP statistical testing. The software is written in ANSI C++ language and can be compiled and used in all modern operating systems. The programs only require modest amount of computing system resources and run extremely fast as an application for a genome-wide case-control study. In the following sections, we describe the methods and algorithms implemented in TUNA and investigate the performance of the package.
OVERVIEW OF THE METHOD
The method implemented in TUNA aims to estimate allele frequencies for all the markers in HapMap (or other reference database) based on genotype data from a subset of markers (e.g. the Illumina HumanHap300 BeadChip SNP set) in a group of subjects (e.g. the cases in a case-control sample). The frequency is estimated as a linear combination of observed haplotype frequencies. For example, the next table shows haplotype frequencies from HapMap CEU: one SNP is absent in the Illumina HumanHap300 set (rs897623, denoted by T), and three Illumina SNPs (rs10909880, rs3748816 and rs12409348, denoted by A1, A2 and A3 respectively).
&table 1;
TUNA estimate of the frequency of allele 1 of T is given by, , where the haplotype frequencies are estimated from the available genotypes.
Given a set of genotyped markers (e.g. high quality Illumina HumanHap300 SNPs) and information on the population composition of the samples (e.g. Caucasian), TUNA first builds a database containing, for every SNP in HapMap, the following information:
the amount of information available on the SNP in the genotyped markers. The information is quantified using MD (Nicolae, 2006a), a multi-locus measure of LD that is similar in interpretation with r2. MD is defined as the asymptotic relative efficiency (ratio of sample sizes necessary to achieve the same power) of two allele frequency estimators: the direct estimator (as if the marker is genotyped) and the indirect estimator (via LD, based on the available genotypes as described above). Note that pairwise r2 might give an incomplete picture of the information available (Nicolae et al., 2006).
the set of genotyped markers used in the frequency estimation (e.g., the markers A1, A2 and A3). This is done by first finding the set of N (e.g. four) SNPs that give maximum MD within a pre-defined window size (e.g. 400kb). We further remove SNPs that do not contribute significantly to the information on the SNP of interest by computing the following adjusted MD, , where n is the number of haplotypes. Clearly, M̄D penalizes large N, and this definition is inspired by the adjusted R2 widely used in regression.
the weights for each possible haplotype (in table 1, the haplotype 1-0-1 has weight 0.017/0.350=0.048).
Table 1.
| Haplotype | A1 – A2 – A 3 – T | Frequency |
|---|---|---|
| H1 | 0 – 0 – 0 – 1 | 0.200 |
| H2 | 0 – 0 – 1 – 0 | 0.117 |
| H3 | 0 – 1 – 0 – 0 | 0.292 |
| H4 | 1 – 0 – 0 – 0 | 0.008 |
| H5 | 1 – 0– 1 – 1 | 0.017 |
| H6 | 1 – 0 – 1– 0 | 0.333 |
| H7 | 1 – 1 – 0 – 0 | 0.033 |
The resulting database depends on the set of genotyped markers that will be used and on the population used in inference.
The case-control statistical tests implemented in TUNA aim to find differences in the allele frequencies in the two groups. Note that the linear haplotype predictor obtained in the database construction is used not for frequency estimation, but for defining the null hypothesis to be tested. For example, for the rs897623 case described above, we test the hypothesis that the frequencies h000 + 0.048h101 are equal in cases and controls. This can be done using tools developed for haplotype-based genetic association studies (Nicolae, 2006b; Lin and Zeng, 2006). Our first implementation used a likelihood ratio test where the MLEs were calculated using an ECM algorithm (Meng and Rubin, 1993; Kim and Taylor, 1995). The ECM algorithm converged occasionally to local maxima, and we decided to adopt a different strategy for assessing significance. We use squared difference of the estimated allele frequencies (in cases and controls) as test statistic and estimate its variance using two methods: a direct estimation based on the asymptotic interpretation of MD, and a resampling-based evaluation. Both test statistics are expected to have an asymptotic distribution under the null. A permutation-based assessment of significance is also implemented in the software.
APPLICATION
For prediction methods, the most important issues that need to be considered are the validity of the model assumptions and the accuracy of the imputation. Most existing software are computationally expensive in model evaluation, and we implement in TUNA a simple and fast algorithm that provides solutions for this. We estimate for each genotyped SNP two allele frequencies, one direct estimation using the observed genotypes and one indirect estimation using the methods described above. Comparing the two sets of results provides a direct assessment of the accuracy of the procedure. This feature is extremely important when used in cases where the reference database is not well matched with the studied population (e.g. using the Caucasian HapMap to construct predictors for non-Caucasian data).
We applied TUNA on two datasets from the Illumina iControl Database. We downloaded data on 248 Caucasian samples and 75 Latino samples that were genotyped using the HumanHap300 platform. For both samples, the HapMap CEU population is used as the LD reference panel. Out of the 2.557M polymorphic SNPs in HapMap, 2.114M are well-tagged (MD ≥ 0.7) by SNPs in this Illumina set, and 177K are well-tagged if only using multi-marker predictors (MD ≥ 0.7 and max r2 ≤ 0.5). Figure 1 shows the comparison of the direct and indirect allele frequency estimators for all SNPs on chromosome 22 and for the well-tagged (MD ≥ 0.7) subset of SNPs. Note that the method is more accurate on the Caucasian controls because they match better with the reference panel, and is more accurate when more information (as measured by MD) is available. All computations (database construction and frequency estimation) only cost 300 seconds of CPU time and 10Mb of total memory in an AMD Opteron 2.6MHz Linux system. In general, one can analyze a full genome scan on a regular workstation in a few hours.
Fig. 1.

The performance of the TUNA allele frequency estimation is shown for Caucasian and Latino samples from the Illumina iControl Database. In all plots, the frequency estimated from the observed genotypes is shown on the x-axis, and the frequency estimated using TUNA is shown on the y-axis. Well tagged SNPs are those for which MD ≥ 0.7.
DISCUSSION
We introduce in this note a novel software package, TUNA, that provides a powerful computational tool for inference on ungenotyped variants in genome-wide association studies. Studying hidden information has many advantages including: (i) an increased power to detect genetic associations, because we can effectively turn a scan based on few hundred thousand markers into something that more closely approximates a scan of all the SNPs in the HapMap; (ii) a clear interpretation of the detected associations because every statistical test that is performed corresponds to one (typed or untyped) marker; and (iii) a simple way to integrate data from different platforms because each marker in HapMap can be assigned with a p-value for association.
There is currently a lot of research done on methods for association testing at untyped variation, and many statistical and computational issues are still under investigation. For example, it is clear that if we have no prior information on risk variation, a larger number of markers leads to an increase in power even after after adjusting for multiple comparisons. If we test untyped alleles only for the markers where the prediction/imputation is perfect, the same conclusion applies: there is an increase in power. Inaccurate prediction/imputation is equivalent to a reduction in sample size for a genotyped marker (Nicolae, 2006a), and more research needs to be done on what are the appropriate thresholds for the prediction accuracy that need to be imposed to guarantee an increase in power. Another important issue is the choice of the reference panel used in constructing the prediction database. Our procedure is designed to guard against an increase in false-positives as a result of a reference panel that comes from a different population than the samples under investigation. We use the predictors to define null hypotheses that are tested in the sample, and an incorrect prediction leads to testing an “uninteresting” null hypothesis. The price for this robustness is a possible decrease in power as testing “uninteresting” hypotheses would lead to a decrease in power due to the change in the multiple comparison adjustment. We believe that this robustness and the computational efficiency are the benefits of this algorithm over the methods where individual genotypes are imputed.
We would like to end by noting that the same software and methods that are described in this paper can be applied for indirect association testing of other genomic variants such insertion/deletions and copy number polymorphisms. The only requirement is the availability of reference databases that contain the typed SNPs and genotypes on the genomic variants that can be used to construct the haplotype predictors.
Acknowledgments
The research was supported in part by NIH grants HL084715, DK62429, DK077489 and HL087665.
References
- Kim DK, Taylor JMG. The Restricted EM Algorithm for Maximum Likelihood Estimation Under Linear Restrictions on the Parameters. Journal of the American Statistical Association. 1995;90(430):708–716. [Google Scholar]
- Lin DY, Zeng D. Likelihood-based inference on haplotype effects in genetic association studies (with discussion) Journal of the American Statistical Association. 2006;101(473):89–118. [Google Scholar]
- Marchini J, Howie B, Myers S, McVean G, Donnelly P. A new multipoint method for genome-wide association studies by imputation of genotypes. Nature Genetics. 2007;39(7):906–913. doi: 10.1038/ng2088. [DOI] [PubMed] [Google Scholar]
- Meng XL, Rubin DB. Maximum Likelihood Estimation Via the ECM Algorithm: A General Framework. Biometrika. 1993;80:267–278. [Google Scholar]
- Nicolae DL. Quantifying the amount of missing information in genetic association studies. Genet Epidemiol. 2006;30(8):703–717. doi: 10.1002/gepi.20181. [DOI] [PubMed] [Google Scholar]
- Nicolae DL. Testing untyped alleles (TUNA)-applications to genome-wide association studies. Genet Epidemiol. 2006;30(8):718–727. doi: 10.1002/gepi.20182. [DOI] [PubMed] [Google Scholar]
- Nicolae DL, Wen X, Voight BF, Cox NJ. Coverage and characteristics of the Affymetrix GeneChip Human Mapping 100K SNP set. PLoS Genetics. 2006;2(5):e67. doi: 10.1371/journal.pgen.0020067. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Scheet P, Stephens M. A fast and flexible statistical model for large-scale population genotype data: applications to inferring missing genotypes and haplotypic phase. Am J Hum Genet. 2006;78(4):629–644. doi: 10.1086/502802. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Servin B, Stephens M. Imputation-based analysis of association studies: candidate regions and quantitative traits. PLoS Genetics. 2007;3(7):e114. doi: 10.1371/journal.pgen.0030114. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wellcome Trust Case Control Consortium. Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature. 2007;447(7145):661–678. doi: 10.1038/nature05911. [DOI] [PMC free article] [PubMed] [Google Scholar]
