Abstract
Imputation, the process of inferring genotypes for untyped variants, is used to identify and refine genetic association findings. Inaccuracies in imputed data can distort the observed association between variants and a disease. Many statistics are used to assess accuracy; some compare imputed to genotyped data and others are calculated without reference to true genotypes. Prior work has shown that the Imputation Quality Score (IQS), which is based on Cohen’s kappa statistic and compares imputed genotype probabilities to true genotypes, appropriately adjusts for chance agreement; however, it is not commonly used. To identify differences in accuracy assessment, we compared IQS with concordance rate, squared correlation, and accuracy measures built into imputation programs. Genotypes from the 1000 Genomes reference populations (AFR N = 246 and EUR N = 379) were masked to match the typed single nucleotide polymorphism (SNP) coverage of several SNP arrays and were imputed with BEAGLE 3.3.2 and IMPUTE2 in regions associated with smoking behaviors. Additional masking and imputation was conducted for sequenced subjects from the Collaborative Genetic Study of Nicotine Dependence and the Genetic Study of Nicotine Dependence in African Americans (N = 1,481 African Americans and N = 1,480 European Americans). Our results offer further evidence that concordance rate inflates accuracy estimates, particularly for rare and low frequency variants. For common variants, squared correlation, BEAGLE R2, IMPUTE2 INFO, and IQS produce similar assessments of imputation accuracy. However, for rare and low frequency variants, compared to IQS, the other statistics tend to be more liberal in their assessment of accuracy. IQS is important to consider when evaluating imputation accuracy, particularly for rare and low frequency variants.
Introduction
In genomic analyses high-quality data are crucial to accurate statistical inferences. Data accuracy can typically be assessed by different methods and measures.
Genetic imputation provides an informative scenario for examining how the use of different accuracy measures can influence the assessment of accuracy. Genotype imputation is a valuable tool in association studies and meta-analyses. This process infers “in silico” genotypes for untyped variants in a study sample by matching genotyped variants in the study to corresponding haplotypes in a comprehensively genotyped reference panel [1–8]. Therefore, imputation accuracy is influenced by haplotype frequencies in the reference panel [9–10] and the typed single nucleotide polymorphism (SNP) coverage of the study sample [11–12]. Once untyped variants are inferred, statistics that measure imputation accuracy are calculated to identify poorly imputed SNPs.
Imputation accuracy statistics can be classified into two types: (1) statistics that compare imputed to genotyped data and (2) statistics produced without reference to true genotypes. Concordance rate, squared correlation, and Imputation Quality Score (IQS) [13] are examples of the first type. Because imputed SNPs usually do not have genotyped data for comparison, statistics of the second type are usually provided by imputation programs and are commonly relied upon in practice. However, a direct comparison of imputed and genotyped data can be made possible by masking a percentage of variants that were genotyped in the study sample [9, 14–15].
Lin et al (2010) introduced IQS, which is based on Cohen’s kappa statistic for agreement [13]. Because of chance agreement, concordance rate, i.e. the proportion of agreement, can lead to incorrect assessments of accuracy for rare and low frequency variants. IQS adjusts for chance agreement [13]. Furthermore, Lin et al. (2010) used simulated data to show that requiring an IQS threshold > 0.9 removed all false positive association signals, while concordance rate > 0.99 still resulted in many false positives. Despite this evidence, IQS is not widely used in accuracy assessment.
This work builds upon previous studies by comparing IQS with commonly used accuracy measures—concordance rate, squared correlation, and built-in accuracy statistics—with the goal of identifying situations in which the choice of accuracy measure leads to differing assessments of accuracy. We compared imputed and genotyped data via masking, and used African-ancestry and European-ancestry populations to evaluate imputation accuracy in genomic regions associated with nicotine dependence and smoking behavior, some of which have also been implicated in lung cancer and chronic obstructive pulmonary disease (COPD).
Methods
We examined differences and similarities in accuracy assessment as measured by IQS, squared correlation, concordance rate and built-in accuracy statistics using: (1) 1000 Genomes as the sample and the reference, and (2) data from nicotine dependence studies as the sample and 1000 Genomes as the reference. Below we describe both approaches, beginning with analyses involving 1000 Genomes as the sample and the reference.
Masking and Imputation using 1000 Genomes Data
Because IQS adjusts for chance agreement [13], we used IQS as a benchmark for accuracy estimation. Calculating IQS, concordance rate, and squared correlation requires genotyped data for comparison with imputed data. We created a study sample for imputation by masking genotypes in the reference panel to mimic the typed SNP coverage of commercially available SNP arrays (Affymetrix—Affy 500 and Affy 6 as well as Illumina—Duo, Omni, and Quad matched by genomic position using Build 37.3/hg19). We used 1000 Genomes African (AFR) and European (EUR) continental reference panels with 246 and 379 individuals respectively (S1 Table) [16]. All data analyzed here are de-identified, publicly available data from the 1000 Genomes (1000G) project, which provides these data as a resource for the scientific community. Participants provided informed consent to the 1000G Project for broad use and broad data release in databases [16–17]. We also have Washington University Human Research Protection Office approval for analyses of de-identified data.
The process of creating the study sample is described in Fig 1 and the numbers of typed variants are presented in S2 Table. Fig 1 illustrates several key characteristics of our masking approach. The reference panel individuals were the same as the study sample individuals. Our approach is expected to give an upper bound on accuracy because of the ideal match between the reference panel and study sample; the “correct” haplotype for each individual being imputed is present in the reference. Using population-specific reference panels (AFR and EUR) rather than a cosmopolitan reference panel maximizes the matching between the reference panel and study sample. Also, this design allowed us to compare accuracy estimates for variants not found on a SNP array. This sample data set was then imputed and the results were used to calculate accuracy statistics.
Imputation Programs
BEAGLE (version 3.3.2) [2, 8] and IMPUTE2 [1, 4–5] were used to obtain imputed genotype probabilities. We obtained the BEAGLE R2 and IMPUTE2 INFO accuracy measures for each SNP; neither of these makes use of true genotypes. The BEAGLE R2 and IMPUTE2 INFO accuracy measures are well established [3, 15]. BEAGLE R2 approximates the squared correlation between the most likely genotype and the true unobserved allele dosage [2, 8]. IMPUTE2 INFO considers allele frequency as well as the observed and expected allele dosage [15]. We include their formulas for completeness, in Eqs 1 and 2, Here g n represents the observed dosage, e n represents the expected allele dosage, and represents the sample allele frequency for sample n at a particular SNP, where n ranges from 1 to N, the total number of individuals and 0 < <1. Additionally, z n represents the genotype with the highest posterior probability from imputation, i.e. 0, 1, or 2 corresponding to the number of copies of the coded allele. Finally, f n = p n1 + 4p n2 where p nk represents the imputed probability of the genotypic class k (0, 1, and 2) corresponding to the nth sample.
(1) |
(2) |
Imputed probabilities produced by BEAGLE and the corresponding accuracy statistics showed variability, so we focus on these results. Analyses using IMPUTE2 were less informative in this matched sample-reference setting; this program appears to identify the matching individual in the reference and assign imputed data accordingly. The result was highly accurate imputation in this special context. Since we aim to compare concordance rate, squared correlation, and IQS in efforts to identify scenarios where these statistics produce similar or divergent conclusions regarding accuracy estimation, the variation produced by using BEAGLE for imputation allows us to address our question of interest.
Statistics that Compare Genotyped and Imputed Data
The imputed genotype probabilities produced by BEAGLE and IMPUTE2 were used to calculate concordance rate, squared correlation and IQS. These imputed genotype probabilities, one for each genotype class (e.g. AA, AB, or BB), are transformed to dosage values by multiplying by 0, 1 or 2 for each genotypic class. IQS is calculated from genotype probabilities while squared correlation uses dosage values. Note that a specific dosage value can correspond to multiple genotypic probabilities, but only one dosage value can result from a specific set of genotypic probabilities. Although the most likely (best guess) genotype for each variant can be used to calculate these statistics, it is not recommended because the discrete classification of each individual’s genotype does not consider the probabilistic nature of imputation [18].
The incorporation of the genotypic classes into the IQS calculation is represented in Table 1, where each cell is the sum of the genotype probabilities for each genotyped and imputed genotypic class combination. The IQS calculation is demonstrated in Eq 3. IQS considers both the observed proportion of agreement (concordance rate or Po shown in Eq 4) as well as chance agreement (Pc in Eq 5). Concordance rate (Po) is the sum of probabilities for each matching genotypic class divided by the total sum of all genotype probabilities. Chance agreement is evaluated as the sum of the products of the marginal frequencies. An IQS score of one indicates that the data matched perfectly, while a negative IQS score indicates that the SNP was imputed worse than expected by chance [13]. Mathematically, the value of IQS will always be less than or equal to the value of concordance rate: PoPc ≤ Pc, so Po−Pc ≤ Po-PoPc, hence (Po-Pc)/(1-Pc) ≤ (Po-PoPc)/(1-Pc), which says that IQS ≤ Po. Some statistics can be confounded with Hardy-Weinberg equilibrium (HWE) if they assume HWE to calculate "expected" genotype counts [19]. IQS avoids this concern since it uses imputed and experimentally determined genotypes.
Table 1. Calculating concordance (P0) and IQS from imputed genotype probabilities and actual genotypes.
Actual | |||||
---|---|---|---|---|---|
AA | AB | BB | Total | ||
AA | |||||
AB | |||||
Imputed | BB | ||||
Total | N |
(3) |
(4) |
(5) |
Squared correlation is the square of the Pearson correlation coefficient between the imputed and genotyped dosage for each SNP. This is calculated using Eqs 6–11 where xi and yj are the imputed and genotyped dosage values for the nth sample respectively. It represents the proportion of the variability in the imputed data that can be explained by the least squared regression model.
(6) |
(7) |
(8) |
(9) |
(10) |
(11) |
Evaluating Accuracy across MAF and LD
Imputation accuracy is influenced by a variant’s minor allele frequency (MAF) and linkage disequilibrium (LD) with genotyped variants (measured by pairwise squared correlation r2). We examined imputation accuracy in relation to these properties. The MAFs used here were based on the allele frequencies found in the genotyped data. We will use the terminology “rare” to denote variants with MAF ≤ 1%; and “low frequency” to refer to variants with 1% < MAF ≤ 5%. For each imputed SNP, the genotyped SNP in the region with the highest LD was used to define the maximum r2 LD with a genotyped SNP (denoted by max r2 LD). PLINK was used to generate the LD values [20]. Bins for maximum r2 LD and MAF were defined in 0.01 increments [13]. For each bin, the mean and one standard deviation of the values produced by each accuracy statistic were calculated.
Examining Regions Associated with Nicotine Dependence
We examined the imputation accuracy of two genomic regions known to be associated with nicotine dependence and smoking behavior. These regions were the nicotinic receptor subunit gene clusters on chromosome 15 (CHRNA5-CHRNA3-CHRNB4) and chromosome 8 (CHRNB3-CHRNA6) [21–26]. These signals were identified through genome-wide association studies (GWAS) and meta-analyses for smoking behavior, with the chromosome 15 region being the most significantly associated. We imputed 3Mb on each chromosome: 2Mb regions used for analysis plus two 500Kb flanking buffer regions according to Build 37.3/hg19. We focused our analyses on polymorphic variants with dbSNP identifiers in each 2MB region.
Masking and Imputation in a Real Data Application using a Nicotine Dependence Sample
A comparison of accuracy statistics was also conducted using nicotine dependence data as the study samples (N = 1,481 African Americans and N = 1,480 European Americans who were sequenced) and 1000 Genomes as the reference. The study sample was masked and imputed separately by race. This analysis provided a more conventional imputation scenario for comparison with the patterns found in the 1000 Genomes analyses.
The sequenced subjects in this applied analysis were from the Collaborative Genetic Study of Nicotine Dependence (COGEND) and the Genetic Study of Nicotine Dependence in African Americans (AAND). These studies are cross-sectional and contain extensive smoking behavior phenotypes in African Americans and European Americans [21]. These individuals were between the ages of 25–44 years old and were assessed for dependence as measured by the Fagerstrom Test for Nicotine Dependence (FTND) and cigarettes-per-day (CPD) [27]. The study protocol was approved by the appropriate Institutional Review Boards and written informed consent was obtained from all subjects.
Center for Inherited Disease Research (CIDR) performed next-generation targeted sequencing on genomic regions previously associated with smoking behaviors, using COGEND and AAND DNA samples derived from blood. Genotypic data that passed initial quality control at CIDR were released to the Quality Assurance/Quality Control analysis team at the University of Washington Genetics Coordinating Center. These data had mean on-target coverage of 180X with more than 96% of on-target bases containing a depth greater than 20X. A total of 1,481 African Americans and 1,480 European Americans were used in the analysis.
These sequencing data were masked to match the typed SNP coverage of the Omni 2.5 SNP array in a 500kb region on chromosome 15. The cosmopolitan reference panel, composed of individuals from a variety of ancestries, was used for imputation since it has been shown to produce the best accuracy estimates [9]. The imputation was performed using BEAGLE and IMPUTE2 to evaluate whether observed trends in accuracy were consistent across imputation programs. The imputed probabilities were compared to the masked sequencing data and accuracy statistics were calculated. We focused our analyses on polymorphic variants.
Results
We compared IQS with squared correlation, concordance rate, and BEAGLE R2 to examine changes in accuracy assessment using 1000 Genomes as the study sample in Figs 2–5. IQS is our benchmark because it adjusts for chance agreement, in contrast to concordance rate which inflates assessments of accuracy [13]. We focus here on the results for the AFR reference population using Omni 2.5M typed coverage on chromosome 15 (13,442 imputed SNPs). We emphasize Omni 2.5 because it has the greatest genotype SNP coverage in the region (S2 Table).
Results for 1000 Genomes Imputation with Matching Reference
Results produced using BEAGLE and the AFR reference population are shown. Results for different chromosomal regions and populations were similar and are shown in S6–S8 Figs.
To help interpret results that are displayed by MAF and max r2 LD bin, S1 Fig. shows the number of imputed variants in each MAF bin in panel A and max r2 LD bin in panel B. This figure indicates that most of the imputed variants were rare and low frequency variants. There were 6,480 (48.21%) rare and low frequency rsID SNPs in the AFR population. The bins ranged in size from 7 variants (0.49 ≥ MAF < 0.50) to 2,371 variants (0.01 ≥ MAF < 0.02).
Concordance Rate and BEAGLE R2 Inflate Assessments of Accuracy for Rare Variants
Results show that the choice of statistic is important when examining the imputation accuracy of rare and low frequency variants. Fig 2 displays the mean accuracy and one standard deviation in each MAF bin, after imputing from Omni 2.5M coverage. IQS (Panel A) and squared correlation (Panel B) produced similar means and standard deviations in each bin, though this does not necessarily represent similarity of values for particular SNPs. For rare and low frequency variants, both concordance rate (Panel C) and BEAGLE R2 (Panel D) produce inflated assessments of accuracy. The higher concordance rate and BEAGLE R2 values could mislead a researcher into assuming that these variants were imputed well, and that accuracy is best measured using concordance rate and BEAGLE R2. IQS and squared correlation also show low accuracy for rare variants using other SNP array coverages (S2 Fig).
A MAF bin can have a wide range in accuracy values. Fig 2 shows variability within MAF bins across all MAF values. Standard deviations for IQS, squared correlation and BEAGLE R2 can be sizeable for both rare and common variants (panels A, B and D); concordance rate does not reflect this as it classifies most variants as well imputed (panel C).
Rare and Low Frequency Variants can be Well Tagged but Poorly Imputed
We examined max r2 LD, the maximum LD r2 between imputed and genotyped SNPs, to understand the relationship between typed SNP coverage and imputation accuracy as measured by these accuracy statistics. Fig 3 displays the mean accuracy and one standard deviation in each max r2 LD bin, after imputing from Omni 2.5M coverage, additional arrays are in S3 Fig. Mean accuracy tends to increase with increasing max r2 LD, as expected. For low to moderate max r2 LD, we observed substantial variability in IQS as well as squared correlation and BEAGLE R2 values; however, at high max r2 LD, the variability decreases. IQS and squared correlation show a surprisingly wide standard deviation for variants in the highest max r2 LD bin (0.99 < max r2 LD ≤ 1) as well as the max r2 LD bin 0.5 < max r2 LD ≤ 0.51. Upon investigation, we found that the variability was due to rare variants: after limiting to SNPs with MAF > 5%, these standard deviations were comparable to those of the other bins, S4 Fig. This pattern suggests that even rare variants that are well tagged (as measured by max r2 LD) can be poorly imputed.
Concordance Classifies Most Variants as Well Imputed
Concordance differs from IQS, squared correlation, and BEAGLE R2 in that it indiscriminately classifies most variants as well imputed, across MAF (Fig 2) and r2 LD bins (Fig 3). The results in Figs 2 and 3 support prior concerns regarding concordance rate [13] and led us to focus the rest of our evaluation on IQS, squared correlation, and BEAGLE R2.
For Rare Variants, IQS and Squared Correlation Produce Different Assessments of Accuracy
Although squared correlation and IQS appeared similar overall in their assessment of imputation accuracy when examined using means and standard deviations by bin (Figs 2 and 3), further investigation showed that on an individual SNP level, these statistics produce divergent assessments of accuracy for rare and low frequency variants. We compared accuracy estimates produced by IQS and squared correlation in Fig 4 for each SNP. Panel A shows results for all variants, and panel B displays results for variants with MAF > 5%. A comparison of these panels is useful to identify divergent trends for common variants versus rare and low-frequency variants. For most SNPs, IQS and squared correlation produced similar assessments of accuracy as seen by the many observations on and near the y = x line in panels A and B. This is consistent with the accuracy patterns observed for IQS and squared correlation in Figs 2 and 3. However, discrepancies in accuracy assessment do occur, with squared correlation generally being more liberal in assigning high accuracy compared to IQS. This is indicated by the sparseness of observations above the y = x line in panels A and B. The points below the y = x line indicate SNPs for which squared correlation values were higher than IQS. Panel B shows that widely discrepant values for IQS and squared correlation are attributable to rare and low frequency SNPs: filtering out SNPs with MAF ≤ 5% removes the widely discrepant observations.
To further examine trends in the discrepancies between these statistics, we subtracted squared correlation from IQS for each variant and displayed this result across all MAF values in S5 Fig. Thus negative differences denote that squared correlation was greater than IQS (i.e. squared correlation more liberal) while positive differences indicate that IQS was greater than squared correlation. Large discrepancies occur over all MAF values with squared correlation tending to be higher than IQS, especially for SNPs with higher MAFs.
For Common Variants, IQS and BEAGLE R2 Provide Similar Assessments of Accuracy
For common variants, BEAGLE R2 produces a similar assessment of imputation accuracy as IQS, but BEAGLE R2 can differ dramatically from squared correlation. In Fig 5, we compared BEAGLE R2 to IQS (panels A and C) and squared correlation to BEAGLE R2 (panels B and D). For many variants, squared correlation and BEAGLE R2 differ in accuracy assessment as seen by the variants above the y = x line in panel B. Although most of these variants are rare, there are still many common variants for which this trend is true (panel D). Large differences between IQS and BEAGLE R2 occur mostly when rare variants are examined.
Results are Similar in Different Genomic Regions and Populations
Figs 2–5 displayed results for the AFR reference population and Omni 2.5M typed coverage in the chromosome 15 region. Results similar to those described above were also observed using the AFR reference on chromosome 8 (S6 Fig) as well as using the EUR reference panel for chromosomes 15 and 8 (S7 and S8 Figs respectively). In particular, low IQS values do occur for rare variants that have high squared correlation or high BEAGLE R2. The number of variants for each imputation subset can be found in S3 Table.
Results are Consistent in Application to Nicotine Dependence Study Sample
Fig 6 shows results produced using African American individuals from the nicotine dependence data as the study sample and a 1000 Genomes cosmopolitan reference panel imputed using BEAGLE. These data show discrepancies in accuracy assessment between statistics. If IQS and squared correlation are compared, squared correlation tends to be similar or higher (i.e. more liberal) than IQS. In the applied scenario, we observed some variants with high IQS and low squared correlation (Fig 6, panel A, upper left quadrant), which was not observed for the upper bound values from the 1000 Genomes analysis (Fig 4, panel A); however, these discrepancies are few, and mostly among rare and low frequency variants (see Fig 6, panel D). When comparing IQS to Beagle R2, the applied scenario showed IQS to be similar to or less than Beagle R2 (Fig 6, panel B), which recapitulates patterns seen in 1000 Genomes (Fig 5, panel A).
In European Americans, from the nicotine dependence data, we also observed these same patterns as in African Americans, with squared correlation’s more liberal assignment of accuracy as compared to IQS, S9 Fig. These results were also consistent using IMPUTE2 with African American and European American study samples, S10 and S11 Figs respectively. This confirms that these patterns are not limited to specific populations, chromosomes, or imputation programs.
Discussion
Genotype imputation is used to improve the density of genomic coverage and increase power by combining datasets [28], in efforts to identify and refine genetic variants associated with disease. We investigated how assessment of imputation accuracy changes when concordance rate, squared correlation and BEAGLE R2 are compared to IQS, focusing on two genomic regions associated with smoking behavior.
Results showed that the choice of accuracy statistic matters for rare variants more than for common variants. This is important given that researchers are increasingly interested in imputing rare and low frequency variants [29–31]. While it has been recognized that rare variants are more difficult to impute accurately, our work here goes further by highlighting that choice of accuracy measure has an important role.
For common variants, squared correlation, IMPUTE2, and BEAGLE R2 produce similar assessments of imputation accuracy as compared to IQS. For rare and low frequency variants, we observed varying assessments of accuracy compared to IQS. Our results also showed that discrepancies between IQS and squared correlation are most likely to occur at rare and low frequency variants, where squared correlation is more liberal in assigning higher accuracy as compared to IQS. An evaluation of nicotine dependence samples also showed discrepancies between IQS and squared correlation. We recommend calculating IQS to confirm imputation accuracy, especially for rare or low frequency variants.
The variability observed within a MAF or max r2 LD bin is a reminder that not all variants that share the same MAF or max r2 LD value can be imputed with the same level of accuracy. This is consistent with the expectation that the inference of untyped variants depends on haplotype block structure and not simply the pairwise relationships between the genotyped and untyped variants. For rare variants, high LD with a genotyped SNP may not guarantee high imputation accuracy. Still, overall, a high max r2 LD usually implies high accuracy, as we observed increasing mean accuracy along with decreasing variability within max r2 LD bins as max r2 LD increases.
We applied this approach to genomic regions associated with our phenotype of interest, smoking behavior using an upper bound scenario and a nicotine dependence sample. Thus, one limitation is that rather than comprehensively examining the genome, we focused only on selected genomic regions. Furthermore we focused on certain populations (European and African ancestry). Nevertheless, different regions (on chromosome 8 and 15), different imputation programs, and different populations showed similar overall patterns, suggesting that our observations are relevant throughout the genome and across multiple populations.
In our masking process using only the 1000 Genomes reference data, the reference panel individuals were the same as the study sample individuals, and our masked SNPs are not limited to a SNP array, making our approach different from the two most common masking processes. One common masking method removes the genotypes for a portion of markers (e.g. 10%) found amongst the typed variants on a study sample SNP array. This method can provide accuracy comparisons only for SNPs on the array. Our approach is able to provide accuracy assessments for SNPs not on the array.
Another commonly used masking method is the “leave-one-out” masking of a comprehensively genotyped reference panel, in which one individual is imputed using the remaining reference panel members. Our study design differed from the leave-one-out method since all individuals in the reference panel and study sample were the same. Our approach was expected to give an upper bound on accuracy because of the ideal match between the reference and study sample; the “correct” genotype for each individual at each variant was present in the reference panel.
Our results provide further evidence that concordance rate inflates accuracy estimates particularly for rare and low frequency variants [13, 32]. These observations highlight a need to account for chance agreement not only when assessing imputation accuracy, but also more broadly in other situations for which concordance is traditionally used to assess accuracy, such as checking genotype agreement across duplicate samples [33–34]. Concordance rate will always produce a value greater than or equal to IQS due to their mathematical relationship (see Methods for proof).
IQS is important to consider, as it is designed to identify variants for which imputation accuracy is better than can be expected by chance; accordingly, other measures were generally more liberal in assigning high accuracy. Our analyses indicate that especially for rare and low frequency variants, IQS may be important to avoid overly liberal assessments of imputation quality. In practice, IQS can be computed by the leave-one-out method. Databases that provide per-SNP "imputability," such as that created by Duan et al. [35], would have increased usefulness if they included IQS values. As imputation methodology continues to develop and reference panels become more comprehensive, we expect that imputation will become increasingly accurate. However, it will be important to take chance agreement into account when assessing this accuracy, and IQS provides a means to do so.
Supporting Information
Acknowledgments
The authors thank John Rice and Laura Bierut for helpful comments and discussion. The authors also thank Cynthia Helms for assistance in data analysis.
Data Availability
Data used are from the 1000 Genomes Project (http://www.1000genomes.org/data). The reference panels used were obtained from the University of Michigan Center for Statistical Genetics (http://csg.sph.umich.edu/abecasis/MACH/download/). The nicotine dependence studies (Collaborative Genetic Study of Nicotine Dependence and the Genetic Study of Nicotine Dependence in African Americans) are available from NCBI dbGAP (accession number phs000813).
Funding Statement
NLS, WD, JZ, SR, and RC were supported by R01DA026911 from the National Institute on Drug Abuse (NIDA). RC was also supported by R21DA033827 from NIDA. NLS, WD, and RC were also supported by P01DA035825 from NIDA. DBH and EOJ were supported by R01DA035825 from NIDA. EOJ was also supported by R01DA025888 from NIDA. EO was supported by T32GM07200, UL1TR000448, F30AA023685, and TL1TR000449 from the National Institutes of Health. This work was also supported by P01CA089392 from NCI and HHSN268201100011I from NIH. LC was supported by K08DA030398 from NIH. SMH was supported by K08DA032680 from NIDA. TSA was supported by the Division of Intramural Research at National Human Genome Research Institute, National Institutes of Health. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
References
- 1. Howie BN, Donnelly P,Marchini J (2009) A Flexible and Accurate Genotype Imputation Method for the Next Generation of Genome-Wide Association Studies. PLoS Genet 5: e1000529 10.1371/journal.pgen.1000529 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2. Browning BL, Browning SR (2009) A unified approach to genotype imputation and haplotype-phase inference for large data sets of trios and unrelated individuals. Am J Hum Genet 84: 210–223. 10.1016/j.ajhg.2009.01.005 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3. Marchini J, Howie B (2010) Genotype imputation for genome-wide association studies. Nat Rev Genet 11: 499–511. 10.1038/nrg2796 [DOI] [PubMed] [Google Scholar]
- 4. Howie B, Marchini J, Stephens M (2011) Genotype Imputation with Thousands of Genomes. G3: Genes|Genomes|Genetics 1: 457–470. 10.1534/g3.111.001198 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5. Howie B, Fuchsberger C, Stephens M, Marchini J, Abecasis GR (2012) Fast and accurate genotype imputation in genome-wide association studies through pre-phasing. Nat Genet 44: 955–959. 10.1038/ng.2354 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6. Liu EY, Li M, Wang W and Li Y (2013) MaCH-Admix: Genotype Imputation for Admixed Populations. Genetic epidemiology 37: 25–37. 10.1002/gepi.21690 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7. Li Y, Willer CJ, Ding J, Scheet P, Abecasis GR (2010) MaCH: Using Sequence and Genotype Data to Estimate Haplotypes and Unobserved Genotypes. Genetic epidemiology 34: 816–834. 10.1002/gepi.20533 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8. Browning SR (2006) Multilocus Association Mapping Using Variable-Length Markov Chains. American Journal of Human Genetics 78: 903–913. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9. Hancock DB, Levy JL, Gaddis NC, Bierut LJ, Saccone NL, Page GP, et al. (2012) Assessment of Genotype Imputation Performance Using 1000 Genomes in African American Studies. PLoS One 7: e50610 10.1371/journal.pone.0050610 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10. Sung YJ, Gu CC, Tiwari HK, Arnett DK, Broeckel U, Rao DC (2012) Genotype Imputation for African Americans Using Data From HapMap Phase II Versus 1000 Genomes Projects. Genetic epidemiology 36: 508–516. 10.1002/gepi.21647 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11. Johnson EO, Hancock DB, Levy JL, Gaddis NC, Saccone NL, Bierut LJ, et al. (2013) Imputation across genotyping arrays for genome-wide association studies: assessment of bias and a correction strategy. Human genetics 132: 509–522. 10.1007/s00439-013-1266-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12. Nelson SC, Doheny KF, Pugh EW, Romm JM, Ling H, Laurie CA, et al. (2013) Imputation-Based Genomic Coverage Assessments of Current Human Genotyping Arrays. G3: Genes|Genomes|Genetics 3: 1795–1807. 10.1534/g3.113.007161 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13. Lin P, Hartz SM, Zhang Z, Saccone SF, Wang J, Tischfield JA, et al. (2010) A New Statistic to Evaluate Imputation Reliability. PLoS One 5: e9697 10.1371/journal.pone.0009697 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14. Shriner D, Adeyemo A, Chen G, Rotimi CN (2010) Practical Considerations for Imputation of Untyped Markers in Admixed Populations. Genetic epidemiology 34: 258–265. 10.1002/gepi.20457 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15. Chanda P, Yuhki N, Li M, Bader JS, Hartz A, Boerwinkle E, et al. (2012) Comprehensive evaluation of imputation performance in African Americans. Journal of human genetics 57: 411–421. 10.1038/jhg.2012.43 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16. The 1000 Genomes Project Consortium (2012) An integrated map of genetic variation from 1,092 human genomes. Nature 491: 56–65. 10.1038/nature11632 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17. The 1000 Genomes Project Consortium (2010) A map of human genome variation from population-scale sequencing. Nature 467: 1061–1073. 10.1038/nature09534 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18. Zheng J, Li Y, Abecasis GR, Scheet P (2011) A Comparison of Approaches to Account for Uncertainty in Analysis of Imputed Genotypes. Genetic epidemiology 35: 102–110. 10.1002/gepi.20552 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19. Shriner D (2013) Impact of Hardy—Weinberg disequilibrium on post-imputation quality control. Human genetics 132: 1073–1075. 10.1007/s00439-013-1336-x [DOI] [PubMed] [Google Scholar]
- 20. Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira Manuel AR, Bender D, et al. (2007) PLINK: A Tool Set for Whole-Genome Association and Population-Based Linkage Analyses. American Journal of Human Genetics 81: 559–575. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21. Bierut LJ, Madden PAF, Breslau N, Johnson EO, Hatsukami D, Pomerleau OF, et al. (2007) Novel genes identified in a high-density genome wide association study for nicotine dependence. Human Molecular Genetics 16: 24–35. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22. Saccone SF, Hinrichs AL, Saccone NL, Chase GA, Konvicka K, Madden PA, et al. (2007) Cholinergic nicotinic receptor genes implicated in a nicotine dependence association study targeting 348 candidate genes with 3713 SNPs. Hum Mol Genet 16: 36–49. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23. Saccone NL, Culverhouse RC, Schwantes-An T-H, Cannon DS, Chen X, Cichon S, et al. (2010) Multiple Independent Loci at Chromosome 15q25.1 Affect Smoking Quantity: a Meta-Analysis and Comparison with Lung Cancer and COPD. PLoS Genetics 6: e1001053 10.1371/journal.pgen.1001053 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24. Liu JZ, Tozzi F, Waterworth DM, Pillai SG, Muglia P, Middleton L, et al. (2010) Meta-analysis and imputation refines the association of 15q25 with smoking quantity. Nat Genet 42: 436–440. 10.1038/ng.572 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25. Tobacco and Genetics Consortium (2010) Genome-wide meta-analyses identify multiple loci associated with smoking behavior. Nat Genet 42: 441–447. 10.1038/ng.571 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26. Thorgeirsson TE, Gudbjartsson DF, Surakka I, Vink JM, Amin N, Geller F, et al. (2010) Sequence variants at CHRNB3-CHRNA6 and CYP2A6 affect smoking behavior. Nature genetics 42: 448–453. 10.1038/ng.573 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27. Luo Z, Alvarado GF, Hatsukami DK, Johnson EO, Bierut LJ, Breslau N (2008) Race Differences in Nicotine Dependence in the Collaborative Genetic Study of Nicotine Dependence (COGEND). Nicotine & Tobacco Research 10: 1223–1230. [DOI] [PubMed] [Google Scholar]
- 28. Winkler TW, Day FR, Croteau-Chonka DC, Wood AR, Locke AE, Mägi R, et al. (2014) Quality control and conduct of genome-wide association meta-analyses. Nat Protocols 9: 1192–1212. 10.1038/nprot.2014.071 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29. Zheng H-F, Ladouceur M, Greenwood CMT, Richards JB (2012) Effect of Genome-Wide Genotyping and Reference Panels on Rare Variants Imputation. Journal of Genetics and Genomics 39: 545–550. 10.1016/j.jgg.2012.07.002 [DOI] [PubMed] [Google Scholar]
- 30. Zheng H-F, Rong J-J, Liu M, Han F, Zhang X-W, Richards JB, et al. (2015) Performance of Genotype Imputation for Low Frequency and Rare Variants from the 1000 Genomes. PLoS One 10: e0116487 10.1371/journal.pone.0116487 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31. Liu EY, Buyske S, Aragaki AK, Peters U, Boerwinkle E, Carlson C, et al. (2012) Genotype Imputation of MetabochipSNPs Using a Study-Specific Reference Panel of ∼4,000 Haplotypes in African Americans From the Women's Health Initiative. Genetic epidemiology 36: 107–117. 10.1002/gepi.21603 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32. Asimit J, Zeggini E (2010) Rare Variant Association Analysis Methods for Complex Traits. Annual Review of Genetics 44: 293–308. 10.1146/annurev-genet-102209-163421 [DOI] [PubMed] [Google Scholar]
- 33. Truong L, Park H, Chang S, Ziogas A, Neuhausen S, Wang S, et al. (2015) Human Nail Clippings as a Source of DNA for Genetic Studies. Open Journal of Epidemiology: 41–50. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34. Rogers A, Beck A, Tintle NL (2014) Evaluating the concordance between sequencing, imputation and microarray genotype calls in the GAW18 data. BMC Proceedings 8: S22–S22. 10.1186/1753-6561-8-S1-S22 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35. Duan Q, Liu EY, Croteau-Chonka DC, Mohlke KL, Li Y (2013) A comprehensive SNP and indel imputability database. Bioinformatics 29: 528–531. 10.1093/bioinformatics/bts724 [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
Data used are from the 1000 Genomes Project (http://www.1000genomes.org/data). The reference panels used were obtained from the University of Michigan Center for Statistical Genetics (http://csg.sph.umich.edu/abecasis/MACH/download/). The nicotine dependence studies (Collaborative Genetic Study of Nicotine Dependence and the Genetic Study of Nicotine Dependence in African Americans) are available from NCBI dbGAP (accession number phs000813).